11. The Genome Halving Problem
Literature:
- Julia Mixtacki. "Genome Halving under DCJ revisited", Proceedings of
COCOON 2008, LNCS 5092 (2008): 276-286.
Motivation: reconstruct genome just after whole genome duplication
Definition 11.1 (duplicated genome): In /a duplicated genome/, each gene appears
twice, or (in adjacency notation) each head and tail appears twice. Further,
- a paralogous extremity of p is denoted by \bar p,
- a paralogous adjacency of x = {p,q} is denoted by \bar x = {\bar p,\bar q},
and
- a paralogous chromosome of C is denoted by \bar C.
ex.: (o,-d_2,a_2,-d_1,-c_2,b_2,o) (o,-b_1,c_1,a_1,o)
Definition 11.2: A genome is
- /linear-perfectly duplicated/, if for each linear chromosome C_i, there
is also a chromosome C_j = \bar C_i for some j \not= i
- /circular-perfectly duplicated/, if for each circular chromosome C_i,
either there is also a chromosome C_j = \bar C_i for some j \not= i,
or C_i = C \cup \bar C, where each adjacency of C_i occurs either in C or in
\bar C, but not in both
- /perfectly duplicated/ if it is linear-perfectly duplicated and
circular-perfectly duplicated.
ex.: (o,a_1,-d_2,-c_1,b_1,o) (o,a_2,-d_1,-c_2,b_2,o)
ex.: (a_1,d_1,a_2,d_2) (o,c_1,b_1,o) (o,c_2,b_2,o)
Lemma 11.1: A genome A is perfectly duplicated if and only if
- for each adjacency {u,v} in A, also {\bar u,\bar v} is in A and u\not=\bar v
and
- for each telomere {u} in A, also {\bar u} is in A
Problem 11.1 (Genome Halving Problem): Given a (rearranged) duplicated genome A,
find a perfectly duplicated genome B such that the DCJ distance between A and B
is minimal.
Definition 11.3 (Natural graph): The natural graph NG(A) of genome A is a graph
whose vertices are the adjacencies and telomeres of A and in which each vertex
containing an extremity p is connected to the vertex containing the paralogous
extremity \bar p.
Definition 11.4: The set of paths and cycles of a natural graph is divided into
four sets:
- EC = set of even cycles,
- EP = set of even paths,
- OC = set of odd cycles,
- OP = set of odd paths
Observation 11.1: A genome is perfectly duplicated if and only if n = |EC| +
|OP|/2 (all cycles are 2-cycles, all paths are 1-paths)
Theorem 11.1: d_{GH}(A) = min_B d_{DCJ}(A,B) = n - (|EC| + \lfloor |OP|/2
\rfloor)
Proof:
1. This is a lower bound (a DCJ can change the number of components only by 1)
2. There is an algorithm that achieves this lower bound
Algorithm:
1. Construct the natural graph
2. Maximize the number of even cycles and odd paths in the natural graph:
k-path with k>1 -> 2-cycle (and (k-2)-path if k>2) =>all paths have length 1
k-cycle with k>2 -> 2-cycle and (k-2)-cycle => all cycles have length 1 or 2
1-cycle + 1-cycle -> 2-cycle
1-cycle -> 1-path
3. Reconstruct the perfectly duplicated genome from the resulting natural graph
Linear time and space.