10. The Genome Median Problem
Literature:
- Tannier, Eric, Chunfang Zheng, and David Sankoff. "Multichromosomal median
and halving problems under different genomic distances." BMC
Bioinformatics 10.1 (2009): 120.
- Kováč, Jakub. "On the complexity of rearrangement problems under the
breakpoint distance." Journal of Computational Biology 21.1 (2014): 1-15.
10.1 Multiple genome rearrangement
Similar to inferring phylogenies using models of sequence evolution (which use
distances derived from substitution mutations of in nucleotide sequences), we
are interested in inferring phylogenies using genome rearrangement models:
Problem 10.1 (Multiple Genome Rearrangement (MGR) problem): Given n genomes,
find a tree T with the n genomes as leaf nodes and assign ancestral genomes to
internal nodes of T such that the tree is optimal, i.e., the sum of
rearrangement distances over all edges of the tree is minimal.
This problem is also called the /big parsimony problem/. In contrast, when a tree
T is already given and only the ancestral assignment is needed, the problem is
reduced to the /small parsimony problem/.
Restricting MGR to three input genomes reduces the problem to finding the genome
median of three genomes.
10.2 Breakpoint median models
Problem 10.2 (Genome Median Problem): Given three genomes A, B and C, and a
genome distance measure d, find a fourth genome M, called /median/, that
minimizes the sum-of-pairs distance
s(M) = d(A, M) + d(B, M) + d(C, M),
and M satisfies given karyotypic constraints on its number and types
(linear/circular) of chromosomes.
For most distances, including the breakpoint distance, the genome median problem
is NP-hard. But before we can get into the details, we need to settle the
question how the breakpoint distance is defined form multiple chromosomes.
Definition 10.1 (Multichromosomal breakpoint distance): Given genomes A and B,
the /multichromosomal breakpoint distance/ is
b(A, B) = n-a(A, B)-t(A, B)/2
where n is the number of genes and a(A, B) and t(A, B) are the numbers of
conserved adjacencies and conserved telomeres between genomes A and B,
respectively.
The breakpoint median problem has been shown NP-hard, even for the
uni-chromosomal variant, i.e., when the median is constraint to a singular
linear or circular chromosome (Pe'er and Shamir, 1998; Bryant, 1997). Yet, for
two models, a polynomial time solution exists:
- Circular-multichromosomal breakpoint median with unbound number of
chromosomes
- Mixed-multichromosomal breakpoint median with an unbound number of
chromosomes
Solutions to both models can be computed using the same algorithm.
- Algorithm 10.1 (perfect matching algorithm) --------------------------------
Input: Genomes A, B, C
Output: Median M
1: Construct an undirected weighted graph G = (V, E), where vertex set V has two
types of vertices, "extremity" x and "telomere" t_x, for each gene extremity
x of the genome dataset A, B, C, and where edge set E = V^2. Weights of edges
are assigned as follows:
- For every pair of gene extremities {x, y}, assign a weight to edge {x, y}
equal to the number of genomes, among A, B, and C, in which {x, y} is an
adjacency. Thus, each edge in E connecting two extremities has weight 0,
1, 2, or 3.
- For every extremity x, assign a weight to edge {x, t_x} equal to half the
number of genomes, among A, B, and C, in which x is a telomere. Thus, each
edge {x, t_x} in E has weight 0, 1/2, 1, or 3/2.
- Assign weight 0 to all other remaining edges
2: Compute a maximum weight perfect matching P \subseteq E
3: Construct genome M from X \subseteq P where X only includes edges between
extremity vertices or an extremity vertex and a telomere vertex.
4: Return M
-------------------------------------------------------------------------------
A maximum weight perfect matching can be computed in O(|V|^3) time. This
solution can be improved by exploiting the following two lemmas:
Lemma 10.1: Every adjacency and every telomere conserved in all genomes A, B,
and C are part of the median.
Lemma 10.2: There exists a median that contains all adjacencies and telomeres
conserved in at least two out of three genomes A, B, and C.
The breakpoint median problem can then be solved by computing a
maximum cardinality matching. This is possible after applying a so-called "graph
doubling" technique.
- Algorithm 10.2 (maximum cardinality matching algorithm) --------------------
Input: Genomes A, B, C
Output: Median M
1: Initialize genome M with genes of input genomes A, B, C
2. Establish adjacencies and telomeres in M that are conserved in 2 or more
genomes of A, B, and C
3: Construct an undirected graph G = (V, E), where vertex set V has to types of
vertices x and x', for each gene extremity x of the genome dataset A, B, C.
Edge set E
is constructed as follows:
- For every pair of gene extremities {x, y}, that is an adjacency in exactly
one genome A, B, or C and not in conflict with any of the adjacencies
already added to median genome M, add edges {x, y}, {x', y'} to E
- For every telomere that occurs in exactly one genome A, B, or C, add
edge {x, x'} to E
2: Compute a maximum cardinality matching P \subseteq E
3: Create an adjacency in genome M between any two extremities corresponding to
an edge {x, y} \in P (or alternatively, {x', y'} \in P, but do not mix edges
from both types)
3: Create telomeres in genome M for extremities corresponding to a matched edge
{x', x} \in P
4: Join all remaining single extremities in genome M arbitrarily
4: Return M
-------------------------------------------------------------------------------
A maximum cardinality matching can be computed in O(|V| \sqrt(|V|)).