9. Family-free genome comparison
Gene order analysis is part of a larger, circular problem which is rooted in the
fact that homologies between genomic sequences cannot be /observed/ nor
/detected/ but only /inferred/. Nowadays, homology inference is usually done /in
silico/ by automated assessment of sequence similarity through the computation
of pairwise or multiple alignments. Yet, this approach is erroneous, giving rise
to a non-negligible number of false predictions. This presents a serious problem
for gene order analysis which prerequisites knowledge of (true) homologies.
Ironically, a powerful tool to improve homology inference is gene order analysis
itself: The knowledge of a gene's genomic neighborhood and the neighborhood's
homologies helps to deduce the evolutionary relationships of the gene itself.
This leads to a circularity problem: Homologies are required for gene order
analysis, yet gene order analysis can improve homology inference.
(Gene) family-free genome comparison is a branch of research in the field of
comparative genomics that aims to break the circularity by devising methods that
work directly on the input data of homology inference, that is to say a measure
of sequence similarity between genes, to perform both, homology inference and
gene order analysis.
9.1 Computing the number of family-free adjacencies
Definition 9.1 (gene similarity graph): Given two genomes S and T, a gene
connection graph G = (V_S‚ V_T, E) of S and T is a bipartite graph with one
vertex for each gene of S and one vertex for each gene of T. An edge e = (u, v)
with weight w(e) between two vertices u and v (one from S and one from T),
indicates sequence similarity between the two genes represented by these
vertices.
We will use the same notion of adjacencies and conserved adjacencies as in
Definition 8.6 from the previous chapter, except that the matching is now a
matching of a gene similarity graph. Given a matching M, we denote S^M and T^M
the M-induced subsequences of genomes S and T.
Because we have now weighted edges, where some edges can be weak (i.e. have a
low weight) and others can be strong (i. e. have a high weight), it not longer
makes sense to compute a matching that optimizes the bare number of adjacencies
or breakpoints. Instead, we quantify the quality of a matching by two different
measures:
- edge measure: edg(M) = \Sum_{e \in M} w(e)
- adjacency measure: adj_{ST}(M) = \Sum_{e,f \in M, e,f form a conserved
adjacency in S^M and T^M} \sqrt{w(e) * w(f)}
This leads to the following optimization problem:
Problem 9.1 (FF-Adjacencies): Given two genomes S, T and some α ∈ [0,1], find a
matching M in gene similarity graph G = (V_S, V_T, E) such that the following
formula is maximized: F_α(M) = α * adj_{ST}(M) + (1−α) * edg(M).
Lemma 9.1: Given two genomes S and T, any matching M', that is a solution to the
maximum weighted matching problem in gene similarity graph G = (V_S, V_T, E) of
S and T, is a 1/(1−α) approximation of problem FF-Adjacencies for α < 1.
Furthermore, for any matching M holds true that
(1−α) * edg(M') ≤ F_α(M) ≤ edg(M) ≤ edg(M').
9.2 Polynomial-time data reduction: an algorithmic approach towards solving
NP-hard problems
Idea: Devise a rule to reduce an instance of a NP-hard problem and then have an
ILP solve the reduced instance. The reduction can lead to a massive speed-up in
computation time.
Two strategies for data reduction in a given instance I of an NP-hard problem:
1. Discover a partial solution that is subset to at least one optimal solution
of instance I.
2. Compute a potentially sub-optimal solution of instance I with a
polynomial-time algorithm. Then reduce the search space by discarding partial
solutions that would produce a solution with lower score.
The time for computing solutions to FF-Adjacencies can significantly reduced by
making use of both strategies. Strategy (1) can be implemented by identifying
optimal adjacencies between directly neighboring genes of in the gene similarity
graph of two genomes. Strategy (2) can be implemented by exploiting the
following lemma:
Lemma 9.2: Given a gene similarity graph G = (V_S, V_T, E) of genomes S and T,
let M ⊆ E, and M' ⊆ E' be two maximum weighted matchings in bipartite graphs G
and G'= (V_S', V_T', E'), where V_S' ⊂ V_S, V_T' ⊂ V_T, and E' = E \ E_D for
some non-empty edge-set E_D that is consecutive in gene similarity graph G.
Further, let M'' ⊆ E be a heuristic solution, which is any matching in gene
similarity graph G such that F_α(M'') ≥ F_α(M). Then all potentially conserved
adjacencies formed by edges of E_D can be discarded if F_α(M'') > edg(M')
without losing optimality in solving problem FF-Adjacencies for gene similarity
graph G.