6. Double-Cut-and-Join (DCJ)

Literature: 
    - Anne Bergeron, Julia Mixtacki, Jens Stoye: A Unifying View of Genome
      Rearrangements.  Proceedings of WABI 2006: Springer Verlag, LNBI 4175,
      163-173, 2006

6.1. The double cut and join operation

Definition 6.1: Let G = (V, E) be a graph whose vertices have degree 1 or 2. We
denote a vertex by its set of /incident edges/, i.e., a vertex incident to edges e, f
\in E is denoted by {e, f} . Likewise, a vertex that is incident to a
single edge g \in E is denoted by {g}. 

Definition 6.2: The /double cut and join/ (DCJ) operation acts on two vertices u
and v of a graph with vertices of degree one or two in one of the following
three ways:

    (a) If both u = {p, q} and v = {r, s} are internal vertices, these are
    replaced by the two vertices {p,r} and {s,q} or by the two vertices {p,s}
    and {q,r}.
    (b) If u = {p, q} is internal and v = {r} is external, these are replaced by
    {p, r} and {q} or by {q,r} and {p}.
    (c) If both u = {q} and v = {r} are external, these are replaced by {q, r}.

In addition, as an inverse of case (c), a single internal vertex {q,r} can be
replaced by two external vertices {q} and {r}.

Definition 6.3 (genome graph): The /genome graph/ is a graph in which edges
correspond to genes (a gene is a tuple of its extremities 'h' (head), 't'
(tail)) and vertices correspond to /adjacencies/, i.e., the neighboring
extremities of adjacent genes. Each connected component corresponds to a linear
or a circular chromosome.

Note: 
    - each gene participates in exactly two adjacencies
    - a vertex corresponding to a single a extremity is called a /telomere/

Ex. (o 1 -2 -3 4 -5 o) (o -6 7 o) corresponds to genome graph G = (V, E) with
    vertex set V = {{1^t}, {1^h, 2^h}, {2^t, 3^h}, {3^t, 4^t}, {4^h 5^h}, {5^t},
    {6^h}, {6^t, 7^t}, {7^h}} and 
    edge set E = {(x^h, x^t) | x=1..7})


What kind of rearrangements can be modeled by a double-cut-and-join operation?

- inversion (o 1 |-2 -6| 4 -5 o) (o -3 7 o) -> (o 1 6 2 4 -5 o) (o -3 7 o)
- chromosome fission 
    (o 1 |6 2| 4 -5 o) (o -3 7 o) ->  (o 1 4 -5 o) (6 2) (o -3 7 o)
- chromosome fusion
    (o 1 4 -5 |o) (6 2) (o| -3 7 o) -> (o 1 4 -5 -3 7 o)  (6 2), discard (o o)
- chromosome linearization
    (o 1 4 -5 |o) (6 |2) (o| -3 7 o) (o | o) -> (o 1 4 -5 -3 7 o)  (o 2 6 o)
- chromosome circularization
- translocation 
    (o 1 4 -5 -3 | 7 o) (o 2 | 6 o) -> (o 1 4 -5 -3 6 o) (o 2 7 o)
- transposition (2 DCJs)
    (o 1 |4 -5 |-3 6 o) (o 2 7 o) -> (o 1 -3| 6 o) (4 -5|) (o 2 7 o) 
                                  -> (o 1 -3 4 -5 6 o) (o 2 7 o)
- block interchange (4 DCJs)

6.2 The adjacency graph

In an adjacency graph AG(A, B) of genomes A and B, vertices are the adjacencies
and telomeres of A and B and edges connect corresponding extremities of A and B.

More formally, we have:

Definition 6.4 (adjacency graph): The /adjacency graph/ AG(A,B) of two genome
graphs A and B is an undirected multi-graph whose set of vertices are the
elements of the multi-set V(A) \cup V(B) and for each v \in V(A) and u in \in
V(B) for which u ∩ v \neq \emptyset there is an edge between u and v in AG(A,
B).

The adjacency graph is a graph whose vertices have again degree 1 or 2, hence we
can apply the same DCJ operation on it as on the genome graph. BUT: we will only
apply DCJ operations on vertices associated with genome A.

Observation 6.1:
- the connected components of the adjacency graph are:
   - cycles
   - paths of even length (between two linear chromosomes)
   - paths of odd length (between a linear and a circular chromosome)
- cycles of length 2 correspond to common adjacencies
- paths of length 1 correspond to common telomeres
- the adjacency graph between two identical genomes consists of cycles of length
  2 and paths of length 1.

6.3 Sorting by DCJs

Problem 6.1 (sorting by DCJs) Given the genome graph AG(A, B) of two genomes A
and B, find a minimum number of DCJ operations O_1, O_2, ..., O_d A = B. We call
dcj(A, B) := d the /double-cut-and-join/ distance. 

Lemma 6.1: Two genomes with N genes are identical if their adjacency graph has
N-(I/2) cycles, where I is the number of odd paths. 

In other words N = C+I/2 <=> N-C-I/2 = 0 where C = #cycles and I = #odd paths

The application of one DCJ operation can change the graph AG(A,B) in
the following ways:
- number of cycles by −1, 0 or +1
- number of odd paths by −2, 0 or +2
- no DCJ changes odd paths and cycles at the same time

Therefore, we have ∆(C + I/2) = −1, 0, +1

This directly leads to a lower bound of the DCJ distance:
    dcj(A, B) >= N-C-I/2

Let's look at this simple greedy sorting algorithm:

- Algorithm 6.1 (greedy sorting by DCJ) ---------------------------------------

Input: adjacency graph AG(A, B)
Output: sequence O_1,...O_d of DCJ operations such that d = dcj(A, B)

 1: for each adjacency {p, q} in genome B do
 2:     let u,v be the elements of genome A that contain p and q, respectively
 3:     if u \neq v then 
 4:         replace u and v in A by {p, q} and (u \ {p}) \cup (v \ {q}) and
            report the corresponding DCJ operation
 5:     end 
 6: end 
 7: for each telomere {p} in genome B do
 8:     let u be the element of genome A that contains p
 9:     if u is an adjacency then
10:         replace u in A by {p} and (u \ {p}) and report the corresponding DCJ
            operation
11:     end
12: end

-------------------------------------------------------------------------------

In each step, (C + I/2) is increased by one, therefore the greedy algorithm
transforms A into B in N - (C + I/2) steps, which is the lower bound, i.e.,
Algorithm 6.1 is optimal. Using tables to relate between gene extremities and
vertices of the adjacency graph, the algorithm runs in O(N) time and space.