6. Double-Cut-and-Join (DCJ) Literature: - Anne Bergeron, Julia Mixtacki, Jens Stoye: A Unifying View of Genome Rearrangements. Proceedings of WABI 2006: Springer Verlag, LNBI 4175, 163-173, 2006 6.1. The double cut and join operation Definition 6.1: Let G = (V, E) be a graph whose vertices have degree 1 or 2. We denote a vertex by its set of /incident edges/, i.e., a vertex incident to edges e, f \in E is denoted by {e, f} . Likewise, a vertex that is incident to a single edge g \in E is denoted by {g}. Definition 6.2: The /double cut and join/ (DCJ) operation acts on two vertices u and v of a graph with vertices of degree one or two in one of the following three ways: (a) If both u = {p, q} and v = {r, s} are internal vertices, these are replaced by the two vertices {p,r} and {s,q} or by the two vertices {p,s} and {q,r}. (b) If u = {p, q} is internal and v = {r} is external, these are replaced by {p, r} and {q} or by {q,r} and {p}. (c) If both u = {q} and v = {r} are external, these are replaced by {q, r}. In addition, as an inverse of case (c), a single internal vertex {q,r} can be replaced by two external vertices {q} and {r}. Definition 6.3 (genome graph): The /genome graph/ is a graph in which edges correspond to genes (a gene is a tuple of its extremities 'h' (head), 't' (tail)) and vertices correspond to /adjacencies/, i.e., the neighboring extremities of adjacent genes. Each connected component corresponds to a linear or a circular chromosome. Note: - each gene participates in exactly two adjacencies - a vertex corresponding to a single a extremity is called a /telomere/ Ex. (o 1 -2 -3 4 -5 o) (o -6 7 o) corresponds to genome graph G = (V, E) with vertex set V = {{1^t}, {1^h, 2^h}, {2^t, 3^h}, {3^t, 4^t}, {4^h 5^h}, {5^t}, {6^h}, {6^t, 7^t}, {7^h}} and edge set E = {(x^h, x^t) | x=1..7}) What kind of rearrangements can be modeled by a double-cut-and-join operation? - inversion (o 1 |-2 -6| 4 -5 o) (o -3 7 o) -> (o 1 6 2 4 -5 o) (o -3 7 o) - chromosome fission (o 1 |6 2| 4 -5 o) (o -3 7 o) -> (o 1 4 -5 o) (6 2) (o -3 7 o) - chromosome fusion (o 1 4 -5 |o) (6 2) (o| -3 7 o) -> (o 1 4 -5 -3 7 o) (6 2), discard (o o) - chromosome linearization (o 1 4 -5 |o) (6 |2) (o| -3 7 o) (o | o) -> (o 1 4 -5 -3 7 o) (o 2 6 o) - chromosome circularization - translocation (o 1 4 -5 -3 | 7 o) (o 2 | 6 o) -> (o 1 4 -5 -3 6 o) (o 2 7 o) - transposition (2 DCJs) (o 1 |4 -5 |-3 6 o) (o 2 7 o) -> (o 1 -3| 6 o) (4 -5|) (o 2 7 o) -> (o 1 -3 4 -5 6 o) (o 2 7 o) - block interchange (4 DCJs) 6.2 The adjacency graph In an adjacency graph AG(A, B) of genomes A and B, vertices are the adjacencies and telomeres of A and B and edges connect corresponding extremities of A and B. More formally, we have: Definition 6.4 (adjacency graph): The /adjacency graph/ AG(A,B) of two genome graphs A and B is an undirected multi-graph whose set of vertices are the elements of the multi-set V(A) \cup V(B) and for each v \in V(A) and u in \in V(B) for which u ∩ v \neq \emptyset there is an edge between u and v in AG(A, B). The adjacency graph is a graph whose vertices have again degree 1 or 2, hence we can apply the same DCJ operation on it as on the genome graph. BUT: we will only apply DCJ operations on vertices associated with genome A. Observation 6.1: - the connected components of the adjacency graph are: - cycles - paths of even length (between two linear chromosomes) - paths of odd length (between a linear and a circular chromosome) - cycles of length 2 correspond to common adjacencies - paths of length 1 correspond to common telomeres - the adjacency graph between two identical genomes consists of cycles of length 2 and paths of length 1. 6.3 Sorting by DCJs Problem 6.1 (sorting by DCJs) Given the genome graph AG(A, B) of two genomes A and B, find a minimum number of DCJ operations O_1, O_2, ..., O_d A = B. We call dcj(A, B) := d the /double-cut-and-join/ distance. Lemma 6.1: Two genomes with N genes are identical if their adjacency graph has N-(I/2) cycles, where I is the number of odd paths. In other words N = C+I/2 <=> N-C-I/2 = 0 where C = #cycles and I = #odd paths The application of one DCJ operation can change the graph AG(A,B) in the following ways: - number of cycles by −1, 0 or +1 - number of odd paths by −2, 0 or +2 - no DCJ changes odd paths and cycles at the same time Therefore, we have ∆(C + I/2) = −1, 0, +1 This directly leads to a lower bound of the DCJ distance: dcj(A, B) >= N-C-I/2 Let's look at this simple greedy sorting algorithm: - Algorithm 6.1 (greedy sorting by DCJ) --------------------------------------- Input: adjacency graph AG(A, B) Output: sequence O_1,...O_d of DCJ operations such that d = dcj(A, B) 1: for each adjacency {p, q} in genome B do 2: let u,v be the elements of genome A that contain p and q, respectively 3: if u \neq v then 4: replace u and v in A by {p, q} and (u \ {p}) \cup (v \ {q}) and report the corresponding DCJ operation 5: end 6: end 7: for each telomere {p} in genome B do 8: let u be the element of genome A that contains p 9: if u is an adjacency then 10: replace u in A by {p} and (u \ {p}) and report the corresponding DCJ operation 11: end 12: end ------------------------------------------------------------------------------- In each step, (C + I/2) is increased by one, therefore the greedy algorithm transforms A into B in N - (C + I/2) steps, which is the lower bound, i.e., Algorithm 6.1 is optimal. Using tables to relate between gene extremities and vertices of the adjacency graph, the algorithm runs in O(N) time and space.