9. Family-free genome comparison Gene order analysis is part of a larger, circular problem which is rooted in the fact that homologies between genomic sequences cannot be /observed/ nor /detected/ but only /inferred/. Nowadays, homology inference is usually done /in silico/ by automated assessment of sequence similarity through the computation of pairwise or multiple alignments. Yet, this approach is erroneous, giving rise to a non-negligible number of false predictions. This presents a serious problem for gene order analysis which prerequisites knowledge of (true) homologies. Ironically, a powerful tool to improve homology inference is gene order analysis itself: The knowledge of a gene's genomic neighborhood and the neighborhood's homologies helps to deduce the evolutionary relationships of the gene itself. This leads to a circularity problem: Homologies are required for gene order analysis, yet gene order analysis can improve homology inference. (Gene) family-free genome comparison is a branch of research in the field of comparative genomics that aims to break the circularity by devising methods that work directly on the input data of homology inference, that is to say a measure of sequence similarity between genes, to perform both, homology inference and gene order analysis. 9.1 Computing the number of family-free adjacencies Definition 9.1 (gene similarity graph): Given two genomes S and T, a gene connection graph G = (V_S‚ V_T, E) of S and T is a bipartite graph with one vertex for each gene of S and one vertex for each gene of T. An edge e = (u, v) with weight w(e) between two vertices u and v (one from S and one from T), indicates sequence similarity between the two genes represented by these vertices. We will use the same notion of adjacencies and conserved adjacencies as in Definition 8.6 from the previous chapter, except that the matching is now a matching of a gene similarity graph. Given a matching M, we denote S^M and T^M the M-induced subsequences of genomes S and T. Because we have now weighted edges, where some edges can be weak (i.e. have a low weight) and others can be strong (i. e. have a high weight), it not longer makes sense to compute a matching that optimizes the bare number of adjacencies or breakpoints. Instead, we quantify the quality of a matching by two different measures: - edge measure: edg(M) = \Sum_{e \in M} w(e) - adjacency measure: adj_{ST}(M) = \Sum_{e,f \in M, e,f form a conserved adjacency in S^M and T^M} \sqrt{w(e) * w(f)} This leads to the following optimization problem: Problem 9.1 (FF-Adjacencies): Given two genomes S, T and some α ∈ [0,1], find a matching M in gene similarity graph G = (V_S, V_T, E) such that the following formula is maximized: F_α(M) = α * adj_{ST}(M) + (1−α) * edg(M). Lemma 9.1: Given two genomes S and T, any matching M', that is a solution to the maximum weighted matching problem in gene similarity graph G = (V_S, V_T, E) of S and T, is a 1/(1−α) approximation of problem FF-Adjacencies for α < 1. Furthermore, for any matching M holds true that (1−α) * edg(M') ≤ F_α(M) ≤ edg(M) ≤ edg(M'). 9.2 Polynomial-time data reduction: an algorithmic approach towards solving NP-hard problems Idea: Devise a rule to reduce an instance of a NP-hard problem and then have an ILP solve the reduced instance. The reduction can lead to a massive speed-up in computation time. Two strategies for data reduction in a given instance I of an NP-hard problem: 1. Discover a partial solution that is subset to at least one optimal solution of instance I. 2. Compute a potentially sub-optimal solution of instance I with a polynomial-time algorithm. Then reduce the search space by discarding partial solutions that would produce a solution with lower score. The time for computing solutions to FF-Adjacencies can significantly reduced by making use of both strategies. Strategy (1) can be implemented by identifying optimal adjacencies between directly neighboring genes of in the gene similarity graph of two genomes. Strategy (2) can be implemented by exploiting the following lemma: Lemma 9.2: Given a gene similarity graph G = (V_S, V_T, E) of genomes S and T, let M ⊆ E, and M' ⊆ E' be two maximum weighted matchings in bipartite graphs G and G'= (V_S', V_T', E'), where V_S' ⊂ V_S, V_T' ⊂ V_T, and E' = E \ E_D for some non-empty edge-set E_D that is consecutive in gene similarity graph G. Further, let M'' ⊆ E be a heuristic solution, which is any matching in gene similarity graph G such that F_α(M'') ≥ F_α(M). Then all potentially conserved adjacencies formed by edges of E_D can be discarded if F_α(M'') > edg(M') without losing optimality in solving problem FF-Adjacencies for gene similarity graph G.