10. The Genome Median Problem Literature: - Tannier, Eric, Chunfang Zheng, and David Sankoff. "Multichromosomal median and halving problems under different genomic distances." BMC Bioinformatics 10.1 (2009): 120. - Kováč, Jakub. "On the complexity of rearrangement problems under the breakpoint distance." Journal of Computational Biology 21.1 (2014): 1-15. 10.1 Multiple genome rearrangement Similar to inferring phylogenies using models of sequence evolution (which use distances derived from substitution mutations of in nucleotide sequences), we are interested in inferring phylogenies using genome rearrangement models: Problem 10.1 (Multiple Genome Rearrangement (MGR) problem): Given n genomes, find a tree T with the n genomes as leaf nodes and assign ancestral genomes to internal nodes of T such that the tree is optimal, i.e., the sum of rearrangement distances over all edges of the tree is minimal. This problem is also called the /big parsimony problem/. In contrast, when a tree T is already given and only the ancestral assignment is needed, the problem is reduced to the /small parsimony problem/. Restricting MGR to three input genomes reduces the problem to finding the genome median of three genomes. 10.2 Breakpoint median models Problem 10.2 (Genome Median Problem): Given three genomes A, B and C, and a genome distance measure d, find a fourth genome M, called /median/, that minimizes the sum-of-pairs distance s(M) = d(A, M) + d(B, M) + d(C, M), and M satisfies given karyotypic constraints on its number and types (linear/circular) of chromosomes. For most distances, including the breakpoint distance, the genome median problem is NP-hard. But before we can get into the details, we need to settle the question how the breakpoint distance is defined form multiple chromosomes. Definition 10.1 (Multichromosomal breakpoint distance): Given genomes A and B, the /multichromosomal breakpoint distance/ is b(A, B) = n-a(A, B)-t(A, B)/2 where n is the number of genes and a(A, B) and t(A, B) are the numbers of conserved adjacencies and conserved telomeres between genomes A and B, respectively. The breakpoint median problem has been shown NP-hard, even for the uni-chromosomal variant, i.e., when the median is constraint to a singular linear or circular chromosome (Pe'er and Shamir, 1998; Bryant, 1997). Yet, for two models, a polynomial time solution exists: - Circular-multichromosomal breakpoint median with unbound number of chromosomes - Mixed-multichromosomal breakpoint median with an unbound number of chromosomes Solutions to both models can be computed using the same algorithm. - Algorithm 10.1 (perfect matching algorithm) -------------------------------- Input: Genomes A, B, C Output: Median M 1: Construct an undirected weighted graph G = (V, E), where vertex set V has two types of vertices, "extremity" x and "telomere" t_x, for each gene extremity x of the genome dataset A, B, C, and where edge set E = V^2. Weights of edges are assigned as follows: - For every pair of gene extremities {x, y}, assign a weight to edge {x, y} equal to the number of genomes, among A, B, and C, in which {x, y} is an adjacency. Thus, each edge in E connecting two extremities has weight 0, 1, 2, or 3. - For every extremity x, assign a weight to edge {x, t_x} equal to half the number of genomes, among A, B, and C, in which x is a telomere. Thus, each edge {x, t_x} in E has weight 0, 1/2, 1, or 3/2. - Assign weight 0 to all other remaining edges 2: Compute a maximum weight perfect matching P \subseteq E 3: Construct genome M from X \subseteq P where X only includes edges between extremity vertices or an extremity vertex and a telomere vertex. 4: Return M ------------------------------------------------------------------------------- A maximum weight perfect matching can be computed in O(|V|^3) time. This solution can be improved by exploiting the following two lemmas: Lemma 10.1: Every adjacency and every telomere conserved in all genomes A, B, and C are part of the median. Lemma 10.2: There exists a median that contains all adjacencies and telomeres conserved in at least two out of three genomes A, B, and C. The breakpoint median problem can then be solved by computing a maximum cardinality matching. This is possible after applying a so-called "graph doubling" technique. - Algorithm 10.2 (maximum cardinality matching algorithm) -------------------- Input: Genomes A, B, C Output: Median M 1: Initialize genome M with genes of input genomes A, B, C 2. Establish adjacencies and telomeres in M that are conserved in 2 or more genomes of A, B, and C 3: Construct an undirected graph G = (V, E), where vertex set V has to types of vertices x and x', for each gene extremity x of the genome dataset A, B, C. Edge set E is constructed as follows: - For every pair of gene extremities {x, y}, that is an adjacency in exactly one genome A, B, or C and not in conflict with any of the adjacencies already added to median genome M, add edges {x, y}, {x', y'} to E - For every telomere that occurs in exactly one genome A, B, or C, add edge {x, x'} to E 2: Compute a maximum cardinality matching P \subseteq E 3: Create an adjacency in genome M between any two extremities corresponding to an edge {x, y} \in P (or alternatively, {x', y'} \in P, but do not mix edges from both types) 3: Create telomeres in genome M for extremities corresponding to a matched edge {x', x} \in P 4: Join all remaining single extremities in genome M arbitrarily 4: Return M ------------------------------------------------------------------------------- A maximum cardinality matching can be computed in O(|V| \sqrt(|V|)).