8. The breakpoint distance with duplicates Literature: - Angibaud, Sébastien, et al. "Efficient tools for computing the number of breakpoints and the number of adjacencies between two genomes with duplicate genes." Journal of Computational Biology 15.8 (2008): 1093-1115. 8.1 Orthologs and Paralogs Two genes that that descended from the same ancestral sequence are /homologous/. A set of homologs is also called a /gene family/. A branching point in the phylogeny of a gene family is associated with an evolutionary event, here we only consider two, namely /speciation/ and /duplication/. Two genes are /orthologous/ if they separated into different evolutionary paths through a speciation event. Conversely, if the branching point was a duplication event, they are /paralogous/. In a comparison between two genomes, two or more genes can be (co-) orthologous to two or more genes of the other genome. This raises the question how gene orders can be compared in the presence of co-orthologous genes, which are further denoted /gene duplicates/. 8.2. Models with duplicates Definition 8.1 (general signed genome): A /genome/ is a string drawn from a signed alphabet of gene families \Sigma. ex.: 0 1 2 3 4 5 6 7 8 9 G = (o b a -b -c d -e a -d o) Note that in this genome model, genes are represented by their gene family membership. Yet, each gene is unique and can be uniquely identified by its position in the genome. In the following, F(G) denotes the set of gene families of genome G. Definition 8.2 (matching) A matching M of two genomes G and H is an assignment between genes of G and H such that for each assignment (g, h) \in M, g \in G and h \in H, holds that (i) genes g and h are members of the same gene family and (ii) not contained in any other assignment of M. ex.: 0 1 2 3 4 5 6 7 8 9 10 H = (o c -a c a c e -b b -d o) a matching of G and H could be M = {(o_0, o_0), (a_2, -a_2), (-c_4, c_1), (-e_6, e_6), (-d_8, -d_9), (o_9, o_10)} Definition 8.3 (exemplar matching model): A matching of two genomes G and H is /exemplar/ if it contains exactly assignment for each gene family in F(G) \cap F(H). ex.: M = {(o_0, o_0), (b_1, -b_7), (a_2, -a_2), (-c_4, c_1), (-e_6, e_6), (-d_8, -d_9), (o_9, o_10)} is an exemplar matching Definition 8.4 (maximum matching model): A matching of two genomes G and H is /maximum/ if it contains as many assignments between genes of G and H as possible. ex.: M = {(o_0, o_0), (b_1, -b_7), (a_2, -a_2), (-b_3, b_8), (-c_4, c_1), (-e_6, e_6), (-d_8, -d_9), (o_9, o_10)} is a maximum matching Definition 8.5 (intermediate matching model): A matching of two genomes G and H is intermediate if it contains at least one assignment for each gene family of F(G) \cap F(H) and both pairs of telomeres. 8.2 Conserved adjacencies and breakpoints in matchings Definition 8.6 (adjacency, conserved adjacency, breakpoint) Given a matching M of genomes G and H, two genes g, g' of G form an adjacency if g, g' are both contained in assignments of M and there exists no assignment (g^*, .) in M s.t. g < g^* < g'. The same definition applies to any two genes h, h' of H. Two pairs of genes (g, h) and (g', h') of M form a /conserved adjacency/ if (i) g, g' and h, h' are adjacencies and (ii) the relative orientation between g and g' is the same as h to h'. Otherwise, if only g, g' is an adjacency or only h, h', then assignments (g, h), (g', h') form a /breakpoint/. "Relative orientation" simply means that if g and g' face each other e.g. tail to tail, so must h and h'. We now seek to study the number of conserved adjacencies further denoted by a(M), respectively breakpoints, further denoted b(M), of a matching M between two genomes G and H. Lemma 8.1: For any matching M of two genomes holds that a(M) + b(M) = |M| + 1. We are interested in solutions of the following six optimization problems: Which (exemplar | maximum | intermediate ) matching M (minimizes b(M) | maximizes a(M)) between two genomes G and H? Lemma 8.2: Minimizing b(M) under the exemplar model is equivalent to maximizing a(M); the same result holds for the maximum matching model. Lemma 8.3: Minimizing b(M) under the exemplar model is equivalent to minimizing b(M) under the intermediate model. Solving any of the three distinct problems is NP-hard, as shown by Bryant (1999) for the exemplar model, and Angibaud (2008) for maximum and intermediate. 8.3 Computing exact solutions with integer linear programming An integer linear program (ILP) is a way of writing combinatorial optimization problems: Input: 1. a set of variables x_1, ..., x_n, 2. an objective (linear) function, and 3. a set of linear inequalities and equalities Output: A feasible solution (i.e. a solution that obeys the linear (in-) equalities) that optimizes the objective function ex.: maximize 3 x_1 + 6 x_2 + 8 x_3 subject to 2 x_1 + 7 x_2 <= 58 1 x_2 + 4 x_3 <= 43 variables: x_1, x_2, x_3 Solving ILPs is generally NP-hard, but there exists fast solvers (software) that can produce solutions for many instances in little time. How to write a maximum bipartite matching as ILP? Given a bipartite graph G = (U, V, E) maximize \Sum_{(u, v) \in E} x_uv subject to: for each vertex u of U \Sum_{v is neighbor of u} x_uv <= 1 for each vertex v of V \Sum_{u is neighbor of v} x_uv <= 1 variables: for each (u, v) \in E, x_uv \in {0, 1} An ILP with integers variables that can take on values between 0 and 1 is also called a /0-1 linear program/ or a /boolean linear program/.