10. The Genome Median Problem

Literature: 
    - Tannier, Eric, Chunfang Zheng, and David Sankoff. "Multichromosomal median
      and halving problems under different genomic distances." BMC
      Bioinformatics 10.1 (2009): 120.
    - Kováč, Jakub. "On the complexity of rearrangement problems under the
      breakpoint distance." Journal of Computational Biology 21.1 (2014): 1-15.

10.1 Multiple genome rearrangement

Similar to inferring phylogenies using models of sequence evolution (which use
distances derived from substitution mutations of in nucleotide sequences), we
are interested in inferring phylogenies using genome rearrangement models:

Problem 10.1 (Multiple Genome Rearrangement (MGR) problem): Given n genomes,
find a tree T with the n genomes as leaf nodes and assign ancestral genomes to
internal nodes of T such that the tree is optimal, i.e., the sum of
rearrangement distances over all edges of the tree is minimal.

This problem is also called the /big parsimony problem/. In contrast, when a tree
T is already given and only the ancestral assignment is needed, the problem is
reduced to the /small parsimony problem/. 

Restricting MGR to three input genomes reduces the problem to finding the genome
median of three genomes. 

10.2 Breakpoint median models

Problem 10.2 (Genome Median Problem): Given three genomes A, B and C, and a
genome distance measure d, find a fourth genome M, called /median/, that
minimizes the sum-of-pairs distance 
    s(M) = d(A, M) + d(B, M) + d(C, M),
and M satisfies given karyotypic constraints on its number and types
(linear/circular) of chromosomes.

For most distances, including the breakpoint distance, the genome median problem
is NP-hard. But before we can get into the details, we need to settle the
question how the breakpoint distance is defined form multiple chromosomes.

Definition 10.1 (Multichromosomal breakpoint distance): Given genomes A and B,
the /multichromosomal breakpoint distance/ is
    b(A, B) = n-a(A, B)-t(A, B)/2
where n is the number of genes and a(A, B) and t(A, B) are the numbers of
conserved adjacencies and conserved telomeres between genomes A and B,
respectively.

The breakpoint median problem has been shown NP-hard, even for the
uni-chromosomal variant, i.e., when the median is constraint to a singular
linear or circular chromosome (Pe'er and Shamir, 1998; Bryant, 1997). Yet, for
two models, a polynomial time solution exists:
    - Circular-multichromosomal breakpoint median with unbound number of
      chromosomes
    - Mixed-multichromosomal breakpoint median with an unbound number of
      chromosomes

Solutions to both models can be computed using the same algorithm.

- Algorithm 10.1 (perfect matching algorithm)  --------------------------------

Input: Genomes A, B, C
Output: Median M

1: Construct an undirected weighted graph G = (V, E), where vertex set V has two
   types of vertices, "extremity" x and "telomere" t_x, for each gene extremity
   x of the genome dataset A, B, C, and where edge set E = V^2. Weights of edges
   are assigned as follows:
    - For every pair of gene extremities {x, y}, assign a weight to edge {x, y}
      equal to the number of genomes, among A, B, and C, in which {x, y} is an
      adjacency. Thus, each edge in E connecting two extremities has weight 0,
      1, 2, or 3. 
    - For every extremity x, assign a weight to edge {x, t_x} equal to half the
      number of genomes, among A, B, and C, in which x is a telomere. Thus, each
      edge {x, t_x} in E has weight 0, 1/2, 1, or 3/2. 
    - Assign weight 0 to all other remaining edges
2: Compute a maximum weight perfect matching P \subseteq E
3: Construct genome M from X \subseteq P where X only includes edges between
   extremity vertices or an extremity vertex and a telomere vertex.
4: Return M
-------------------------------------------------------------------------------

A maximum weight perfect matching can be computed in O(|V|^3) time. This
solution can be improved by exploiting the following two lemmas:

Lemma 10.1: Every adjacency and every telomere conserved in all genomes A, B,
and C are part of the median.

Lemma 10.2: There exists a median that contains all adjacencies and telomeres
conserved in at least two out of three genomes A, B, and C.

The breakpoint median problem can then be solved by computing a
maximum cardinality matching. This is possible after applying a so-called "graph
doubling" technique.

- Algorithm 10.2 (maximum cardinality matching algorithm)  --------------------

Input: Genomes A, B, C
Output: Median M

1: Initialize genome M with genes of input genomes A, B, C
2. Establish adjacencies and telomeres in M that are conserved in 2 or more
   genomes of A, B, and C
3: Construct an undirected graph G = (V, E), where vertex set V has to types of
   vertices x and x', for each gene extremity x of the genome dataset A, B, C.
   Edge set E
   is constructed as follows:
    - For every pair of gene extremities {x, y}, that is an adjacency in exactly
      one genome A, B, or C and not in conflict with any of the adjacencies
      already added to median genome M, add edges {x, y}, {x', y'} to E
    - For every telomere that occurs in exactly one genome A, B, or C, add
      edge {x, x'} to E
2: Compute a maximum cardinality matching P \subseteq E
3: Create an adjacency in genome M between any two extremities corresponding to
   an edge {x, y} \in P (or alternatively, {x', y'} \in P, but do not mix edges
   from both types)
3: Create telomeres in genome M for extremities corresponding to a matched edge
   {x', x} \in P
4: Join all remaining single extremities in genome M arbitrarily
4: Return M
-------------------------------------------------------------------------------

A maximum cardinality matching can be computed in O(|V| \sqrt(|V|)).