8. The breakpoint distance with duplicates
Literature:
- Angibaud, Sébastien, et al. "Efficient tools for computing the number of
breakpoints and the number of adjacencies between two genomes with
duplicate genes." Journal of Computational Biology 15.8 (2008): 1093-1115.
8.1 Orthologs and Paralogs
Two genes that that descended from the same ancestral sequence are /homologous/.
A set of homologs is also called a /gene family/. A branching point in the
phylogeny of a gene family is associated with an evolutionary event, here we
only consider two, namely /speciation/ and /duplication/. Two genes are
/orthologous/ if they separated into different evolutionary paths through a
speciation event. Conversely, if the branching point was a duplication event,
they are /paralogous/.
In a comparison between two genomes, two or more genes can be (co-) orthologous
to two or more genes of the other genome. This raises the question how gene
orders can be compared in the presence of co-orthologous genes, which are
further denoted /gene duplicates/.
8.2. Models with duplicates
Definition 8.1 (general signed genome): A /genome/ is a string drawn from a
signed alphabet of gene families \Sigma.
ex.: 0 1 2 3 4 5 6 7 8 9
G = (o b a -b -c d -e a -d o)
Note that in this genome model, genes are represented by their gene family
membership. Yet, each gene is unique and can be uniquely identified by its
position in the genome.
In the following, F(G) denotes the set of gene families of genome G.
Definition 8.2 (matching) A matching M of two genomes G and H is an assignment
between genes of G and H such that for each assignment (g, h) \in M, g \in G and
h \in H, holds that (i) genes g and h are members of the same gene family and
(ii) not contained in any other assignment of M.
ex.: 0 1 2 3 4 5 6 7 8 9 10
H = (o c -a c a c e -b b -d o)
a matching of G and H could be M = {(o_0, o_0), (a_2, -a_2), (-c_4, c_1),
(-e_6, e_6), (-d_8, -d_9), (o_9, o_10)}
Definition 8.3 (exemplar matching model): A matching of two genomes G and H is
/exemplar/ if it contains exactly assignment for each gene family in F(G) \cap
F(H).
ex.: M = {(o_0, o_0), (b_1, -b_7), (a_2, -a_2), (-c_4, c_1), (-e_6, e_6), (-d_8,
-d_9), (o_9, o_10)} is an exemplar matching
Definition 8.4 (maximum matching model): A matching of two genomes G and H is
/maximum/ if it contains as many assignments between genes of G and H as
possible.
ex.: M = {(o_0, o_0), (b_1, -b_7), (a_2, -a_2), (-b_3, b_8), (-c_4, c_1), (-e_6,
e_6), (-d_8, -d_9), (o_9, o_10)} is a maximum matching
Definition 8.5 (intermediate matching model): A matching of two genomes G and H
is intermediate if it contains at least one assignment for each gene family of
F(G) \cap F(H) and both pairs of telomeres.
8.2 Conserved adjacencies and breakpoints in matchings
Definition 8.6 (adjacency, conserved adjacency, breakpoint) Given a matching M
of genomes G and H, two genes g, g' of G form an adjacency if g, g' are both
contained in assignments of M and there exists no assignment (g^*, .) in M s.t.
g < g^* < g'. The same definition applies to any two genes h, h' of H. Two pairs
of genes (g, h) and (g', h') of M form a /conserved adjacency/ if (i) g, g' and
h, h' are adjacencies and (ii) the relative orientation between g and g' is the
same as h to h'. Otherwise, if only g, g' is an adjacency or only h, h', then
assignments (g, h), (g', h') form a /breakpoint/.
"Relative orientation" simply means that if g and g' face each other e.g. tail
to tail, so must h and h'.
We now seek to study the number of conserved adjacencies further denoted by
a(M), respectively breakpoints, further denoted b(M), of a matching M between two
genomes G and H.
Lemma 8.1: For any matching M of two genomes holds that a(M) + b(M) = |M| + 1.
We are interested in solutions of the following six optimization problems: Which
(exemplar | maximum | intermediate ) matching M (minimizes b(M) | maximizes
a(M)) between two genomes G and H?
Lemma 8.2: Minimizing b(M) under the exemplar model is equivalent to maximizing
a(M); the same result holds for the maximum matching model.
Lemma 8.3: Minimizing b(M) under the exemplar model is equivalent to minimizing
b(M) under the intermediate model.
Solving any of the three distinct problems is NP-hard, as shown by Bryant (1999)
for the exemplar model, and Angibaud (2008) for maximum and intermediate.
8.3 Computing exact solutions with integer linear programming
An integer linear program (ILP) is a way of writing combinatorial optimization
problems:
Input:
1. a set of variables x_1, ..., x_n,
2. an objective (linear) function, and
3. a set of linear inequalities and equalities
Output:
A feasible solution (i.e. a solution that obeys the linear (in-) equalities)
that optimizes the objective function
ex.: maximize 3 x_1 + 6 x_2 + 8 x_3
subject to 2 x_1 + 7 x_2 <= 58
1 x_2 + 4 x_3 <= 43
variables: x_1, x_2, x_3
Solving ILPs is generally NP-hard, but there exists fast solvers (software) that
can produce solutions for many instances in little time.
How to write a maximum bipartite matching as ILP?
Given a bipartite graph G = (U, V, E)
maximize \Sum_{(u, v) \in E} x_uv
subject to:
for each vertex u of U
\Sum_{v is neighbor of u} x_uv <= 1
for each vertex v of V
\Sum_{u is neighbor of v} x_uv <= 1
variables: for each (u, v) \in E, x_uv \in {0, 1}
An ILP with integers variables that can take on values between 0 and 1 is also
called a /0-1 linear program/ or a /boolean linear program/.