11. The Genome Halving Problem Literature: - Julia Mixtacki. "Genome Halving under DCJ revisited", Proceedings of COCOON 2008, LNCS 5092 (2008): 276-286. Motivation: reconstruct genome just after whole genome duplication Definition 11.1 (duplicated genome): In /a duplicated genome/, each gene appears twice, or (in adjacency notation) each head and tail appears twice. Further, - a paralogous extremity of p is denoted by \bar p, - a paralogous adjacency of x = {p,q} is denoted by \bar x = {\bar p,\bar q}, and - a paralogous chromosome of C is denoted by \bar C. ex.: (o,-d_2,a_2,-d_1,-c_2,b_2,o) (o,-b_1,c_1,a_1,o) Definition 11.2: A genome is - /linear-perfectly duplicated/, if for each linear chromosome C_i, there is also a chromosome C_j = \bar C_i for some j \not= i - /circular-perfectly duplicated/, if for each circular chromosome C_i, either there is also a chromosome C_j = \bar C_i for some j \not= i, or C_i = C \cup \bar C, where each adjacency of C_i occurs either in C or in \bar C, but not in both - /perfectly duplicated/ if it is linear-perfectly duplicated and circular-perfectly duplicated. ex.: (o,a_1,-d_2,-c_1,b_1,o) (o,a_2,-d_1,-c_2,b_2,o) ex.: (a_1,d_1,a_2,d_2) (o,c_1,b_1,o) (o,c_2,b_2,o) Lemma 11.1: A genome A is perfectly duplicated if and only if - for each adjacency {u,v} in A, also {\bar u,\bar v} is in A and u\not=\bar v and - for each telomere {u} in A, also {\bar u} is in A Problem 11.1 (Genome Halving Problem): Given a (rearranged) duplicated genome A, find a perfectly duplicated genome B such that the DCJ distance between A and B is minimal. Definition 11.3 (Natural graph): The natural graph NG(A) of genome A is a graph whose vertices are the adjacencies and telomeres of A and in which each vertex containing an extremity p is connected to the vertex containing the paralogous extremity \bar p. Definition 11.4: The set of paths and cycles of a natural graph is divided into four sets: - EC = set of even cycles, - EP = set of even paths, - OC = set of odd cycles, - OP = set of odd paths Observation 11.1: A genome is perfectly duplicated if and only if n = |EC| + |OP|/2 (all cycles are 2-cycles, all paths are 1-paths) Theorem 11.1: d_{GH}(A) = min_B d_{DCJ}(A,B) = n - (|EC| + \lfloor |OP|/2 \rfloor) Proof: 1. This is a lower bound (a DCJ can change the number of components only by 1) 2. There is an algorithm that achieves this lower bound Algorithm: 1. Construct the natural graph 2. Maximize the number of even cycles and odd paths in the natural graph: k-path with k>1 -> 2-cycle (and (k-2)-path if k>2) =>all paths have length 1 k-cycle with k>2 -> 2-cycle and (k-2)-cycle => all cycles have length 1 or 2 1-cycle + 1-cycle -> 2-cycle 1-cycle -> 1-path 3. Reconstruct the perfectly duplicated genome from the resulting natural graph Linear time and space.