Literature: - Landau, G. M., Parida, L., & Weimann, O. (2005). Gene proximity analysis across whole genomes via PQ trees. Journal of Computational Biology : a Journal of Computational Molecular Cell Biology, 12(10), 1289–1306. http://doi.org/10.1089/cmb.2005.12.1289 - Bergeron, A., Chauve, C., de Montgolfier, F., & Raffinot, M. (2008). Computing common intervals of K permutations, with applications to modular decomposition of graphs. SIAM Journal on Discrete Mathematics, 22(3), 1022–1039. http://doi.org/10.1137/060651331 ------------------------------------------------------------------------------ 1. Synteny Hierarchies Synteny Hierarchies can be represented by PQ-trees ------------------------------------------------------------------------------ 2. PQ-Trees Def.: A PQ-tree on set V is a tree whose leaves are labeled from 1 to |V| and whose internal nodes are labeled P-nodes or Q-nodes. A P-node must have at least two children, and a Q-node must have at least three children. The children of a P-node are unordered, and the children of a Q-node are totally ordered. Example: The PQ-tree of permutations P1 = (1..8) = ID and P2 = (2 1 5 3 4 8 6 7) is shown below. _____Q_____ | __P_ _P_ _P_ | _P_ | _P_ | | | | | | | | P2 = (2 1 5 3 4 8 6 7) In the following, we will always assume that the identity permutation is part of the studied collection of permutations. If a binary matrix obeys the consecutive ones property (C1P), then it can be represented by a PQ-tree. The PQ-tree represents the class of all admissible permutations under which the matrix is C1P. Example: C1P Matrix: ___________________|_1_|_2_|_3_|_4_|_5_|_6_|_7_|_8_| O_1 = {3,4,5,6,7,8}| 1 1 1 1 1 1 | O_2 = {1,2,3,4,5} | 1 1 1 1 1 | O_3 = {1,2} | 1 1 | O_4 = {3,4,5} | 1 1 1 | O_5 = {3,4} | 1 1 | O_6 = {6,7,8} | 1 1 1 | O_7 = {6,7} | 1 1 | We will now study an algorithm that allows the construction of a PQ-tree from a collection of k permutations of size n in optimal time (O(kn)), if such exists. Observe that there is a direct relation between a PQ tree and certain collections of intervals of the studied permutations: A PQ-tree is a generator for the following sets: 1. Trivial intervals: (1), .. (n), and (1..n) 2. The (single) interval represented by the union of all intervals generated by its children. 3. Intervals of a Q node: Union of any consecutive subset of the intervals of its children. These sets correspond to collections of intervals in the given set of permutations that satisfy the following property: Def. (Common Intervals): Intervals (l1, r1),..,(lk, rk) of k permutations P1, .. Pk are /common/, if P1[l1,r1] = .. Pk[lk,rk]. The common intervals of permutation P2 are: P2 = (2 1 5 3 4 8 6 7) |_______________| <- "root" interval |_|_|_|_|_|_|_|_| <- singletons |___________| |_________| |_____|_____| |___| |___| |___| If the set of common intervals C of a permutation is /closed/, i.e. If two intervals (i1, j1) (i2, j2) overlap, i.e i1 < i2 and (i1 < i2 or j1 < j2), then also (i1, j2), (i1, j2), (i2, i1), (j1, j2) \in C. i1 j1 |__________| i2 j2 |___________| => |________________| |____| |_____| |_____| A straightforward strategy would be to (i) identify all common intervals of a permutation, then (ii) start with the most confined PQ tree (which corresponds to a single P-Node whose children are the 1..n leaves) that can only generate the trivial common intervals that are part of every set of permutations (i.e., {(1..n), (1), (2), ..., (n)}). Subsequently, (iii) refine the tree by iterating through the set of intervals and adding additional internal vertices to the PQ-tree. In each iteration, the tree will be the most confined tree that can generate the intervals that have been observed so far. Even if all common intervals could be found in optimal time and the tree be refined in constant time, the algorithm would be in O(n^2). That's because of the possible number of common intervals, which is in O(n^2). But: not all intervals are required! The set of common intervals can be partitioned "overlap classes": Def. (commuting and overlapping intervals): Two intervals A and B /commute/ if A \subseteq B or B \subseteq A or A and be are disjoint, and otherwise, they /overlap/. Def. (overlap class) An /overlap class/ is an equivalence class formed by the transitive closure of the overlap relation within a given set of intervals. The overlap classes of a set of all common intervals of a permutation can be organized as follows: Def.: An overlap class holding only a single member is /trivial/, and /non-trivial/ otherwise. Def. The intervals corresponding to trivial overlap classes are /strong/. Lemma: The set of strong (common) intervals of k permutations is in bijection the vertices of their PQ-tree. Obs.: The set of strong intervals commutes. The set of strong intervals of permutation P2 is (1..8), {all singletons}, (3,4,5), (6,7,8), (1,2), (3,4), and (6,7). P2 = (2 1 5 3 4 8 6 7) |_______________| |_|_|_|_|_|_|_|_| |_____|_____| |___| |___| |___| Obs.: A PQ-tree is an inclusion tree. Inclusion trees can be build in time linear to their number of intervals (vertices). Algorithm 1 (Construction of an inclusion tree)------------------------------- Input: Set F of commuting intervals Output: Inclusion tree of F 1. Bucket-sort in decreasing order the intervals of F according to their right bound 2. Bucket-sort in increasing order the intervals of F according to their left bound 3. Let I1..Im be the list of sorted intervals 4. F <- I1 // I1 = V is the root 5. k <- 2 6. While k ≤ m 7. If Ik ⊂ F 8. Parent(Ik) <- F 9. F <- Ik 10. k <- k+1 11. Else 12. F <- Parent(F) ------------------------------------------------------------------------------ Labeling the internal vertices of the PQ tree can be done by the following rule set: 1. If v has size 2, label it P 2. Otherwise, test if the interval represented by the first two of its children is a common interval: If so, label it Q, otherwise P. Thus, all there is left, is to identify strong intervals. ------------------------------------------------------------------------------ 2. Generators of common intervals Def. A generator for the common intervals of a set of permutations P is a pair (R, L) of vectors of size n such that: 1. R[i] ≥ i and L[j] ≤ j for all i,j ∈ {1,2,...,n}, 2. (i..j) is a common interval of P if and only if (i..j) = (i..R[i]) ∩ (L[j]..j), or, equivalently L[j] <= i <= j <= R[i]. There are many possible generators, here is one: Def.: Let P = (p1,..,pn ) be a permutation of size n. For each element pi, we define two intervals containing pi: - IMax[pi] is the largest *set* of elements ≥ pi that forms an interval around pi in P. - IMin[pi] is the largest *set* of elements ≤ pi that forms an interval around pi in P. And we define the following two integer vectors: - Sup[pi] is the largest integer such that (pi..Sup[pi]) ⊆ IMax[pi]; - Inf[pi] is the smallest integer such that (Inf[pi]..pi) ⊆ IMin[pi]. The pair of vectors (Sup, Inf) is a generator for the common intervals of a permutation P. Example: IMax and IMin, Sup, and Inf of permutation P2 are IMax Sup IMin Inf p[1,8]= (1,2,3,4,5,6,7,8) 1 [8] p[2,2]= (1) 1 [1] p[1,1]= (2) 2 [2] p[1,2]= (1,2) 2 [1] p[3,8]= (3,4,5,6,7,8) 3 [8] p[4,4]= (3) 3 [3] p[5,8]= (4,6,7,8) 4 [4] p[4,5]= (3,4) 4 [3] p[3,3]= (5) 5 [5] p[1,5]= (1,2,3,4,5) 5 [1] p[6,8]= (6,7,8) 6 [8] p[7,7]= (6) 6 [6] p[8,8]= (7) 7 [7] p[7,8]= (6,7) 7 [6] p[6,6]= (8) 8 [8] p[1,8]= (1,2,3,4,5,6,7,8) 8 [1] Lemma: Let (R1, L1 ) and (R2, L2) be generators for the common intervals of two sets A1 and A2 of permutations. The pair (min(R1, R2), max(L1, L2)) is a generator for the common intervals of A1 ∪ A2. Example: (R1, L1) and (R2, L2) are generators of permutations P2 and P3 = (1 3 2 4 5 7 6 8) are R1 L1 R2 L2 R=min(R1, R2) L=max(L1, L2) 1 [8] [1] [8] [1] [8] [1] 2 [2] [1] [6] [2] [2] [2] 3 [8] [3] [6] [3] [3] [3] 4 [4] [3] [6] [3] [4] [3] 5 [5] [1] [6] [3] [5] [1] 6 [8] [6] [6] [3] [8] [6] 7 [7] [6] [7] [3] [7] [6] 8 [8] [1] [8] [1] [8] [1] Proof. Interval (i..j) is a common interval of A1 ∪ A2 if and only if it is a common interval of both A1 and A2, which is equivalent to L1[j] ≤ i ≤ j ≤ R1[i] and L2[j] ≤ i ≤ j ≤ R2[i] and finally to max(L1[j],L2[j]) ≤ i ≤ j ≤ min(R1 [i], R2 [i]). Given IMin, Inf can be computed in linear time with this simple algorithm: Algorithm 2 (Construction of Inf from IMin)----------------------------------- 1. Inf[k] <- k for k=1..n 2. For k from 2 to n 3. While Inf[k] − 1 is in IMin[k] 4. Inf[k] <- Inf[Inf[k] − 1] ------------------------------------------------------------------------------ (A similar algorithm can be designed for the computation of Sup from IMin) We will now aim to use the set of common intervals defined by R and L to find the strong intervals of the given set of permutations. But there are still two problems: 1. Not all intervals of R and L are necessarily common intervals. 2. Recall that the set of strong intervals commutes. The set of intervals defined by R commutes, same holds for L. However, their union doesn't! ------------------------------------------------------------------------------ 2.1 Fixing Problem 1 Example: Intervals of R and L of permutations P2 and P3: Intervals of R: Intervals of L: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 --------------- 1 - 2 - 2 - 3 - 3 - 4 - 4 --- <-- interval is not common! 5 - 5 --------- 6 ----- 6 - 7 - 7 --- 8 - 8 --------------- Generators are called /canonical/ if each interval of R or L is a common interval: Def.: A generator (R, L) for a closed family of common intervals F is /canonical/ if, for all i=1..n, intervals (i..R[i]) and (L[i]..i) belong to F. We can always construct a canonical version of any generator by processing R and L, independently. To find a canonical variant of R, we apply the following strategy: 1. Iterate through each interval I_i = (i..R[i]), 2 <= i <= n in *decreasing* order. 2. If I_i is not common: 3. Truncate I_i's right border to the right border of the largest subinterval I_j, j > i if such exists, otherwise set R[i] = i. (Same can be done for L) But there is a faster way, which uses the /support/ of R and L respectively. Def. The /support/ of vector R is a vector Support_R that refers at each position Support_R[i] to the index i' of the smallest interval (i'..R[i']) that is a super interval of (i..R[i]) and is undefined if no such interval exists. Example: Intervals of R: Support_R 1 2 3 4 5 6 7 8 1 --------------- / 2 - 1 3 - 1 4 - 1 5 - 1 6 ----- 1 7 - 6 8 - 6 We make the following observation: - The support for an interval of R is only undefined for its first interval, which is always (1..n). - The support for interval (i..R[i]) is must be the interval corresponding to the *highest index* i' < i s.t. (i..R[i]) \subset (i'..R[i']). Support_R can be computed in O(n) (No proof). Using support, we can compute the canonical vector of R in linear time, using the following algorithm: Algorithm 3 (Construction of canonical variant of vector R)------------------- Input: Support_R, R, L Output: canonical vector R' 1. R' <- [1..n] 2. R'[1] <- n 3. For k from n to 2 // test if (Support_R[k]..R'[k]) is a common interval 4. If L[k] <= Support_R[k] <= R'[k]) <= R[k] 5. R'[Support_R[k]] <- max(R'[k],R'[Support_R[k]]) ------------------------------------------------------------------------------ (A similar algorithm can be designed for computing Support_L) The correctness of the algorithm can be derived by the facts that 1. (k..R'[k]) is the largest common interval with rightmost bound k by the time of the k^th iteration, and 2. once the largest common interval for R'[k'] is found, it will never be truncated (the max(.) function in line 5 ensures that). Example: Intervals of canonical R and L of permutations P2 and P3: Intervals of canonical R: Intervals of canonical L: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 --------------- 1 - 2 - 2 - 3 - 3 - 4 - 4 - 5 - 5 --------- 6 ----- 6 - 7 - 7 --- 8 - 8 --------------- ------------------------------------------------------------------------------ 2.2 Fixing Problem 2 Lemma: A trivial overlap class of interval set {(i..R[i]) | i=1..n} \cup {(L[i]..i) | i=1..n} is a strong interval of the given set of permutations. Example: Intervals of canonical R and L of permutation P2: Intervals of canonical R: Intervals of canonical L: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 --------------- 1 - 2 - 2 --- 3 ----------- 3 - 4 - 4 --- 5 - 5 --------- 6 ----- 6 - 7 - 7 --- 8 - 8 --------------- We make the following observation: Observation: Let (R,L) be the canonical generator of a closed family of intervals. We have the following: (1) if (i..R[i]) overlaps (L[j]..j) and (L[j]..j), then L[j] = L[j]; and (2) if (L[j]..j) overlaps (i..R[i]) and (i..R[i]), then R[i] = R[i]. The following lemma shows that the overlaps generate strong intervals of the given permutation: Lemma: Let (R,L) be the canonical generator of a closed family of intervals F, and let C be a nontrivial overlap class containing (i1..R[C]), ... , (ik..R[C]) and (L[C]..j1), ... , (L[C]..jl), with i1 <···< ik and j1 <···< jl. Then k=l, and for all a ∈ (1..k), (ia..ja) is a strong interval of F. Theorem: The set of intervals given by the union of trivial overlap classes of R and L, and the set of strong intervals constructed from the overlaps of non-trivial overlap classes is equal to the family of strong intervals of a closed family of intervals. ------------------------------------------------------------------------------ 2.3 Algorithm for enumerating all strong intervals The above theorem motivates the following simple strategy for enumerating all strong intervals of a closed family of intervals. 1. Sort the 4n bounds of intervals of the families (i..R[i]) and (L[j]..j) for i, j ∈ (1..n) in increasing order, with the left bounds placed before the right bounds when they are equal. 2. Apply Algorithm 4 Example: The 4n bounds of intervals for permutation P2: 0(,0(,0(,0(,0(,0),1(,1),1),2(,2(,2(,2),3(,3),3),4(,4),4),5(,5(,5(,5),6(,6),6), 7(,7),7),7),7),7) Algorithm 4 (Computation of the strong intervals)----------------------------- Input: 4n bounds of intervals of the families (i..R[i]) and (L[j]..j) Output: Set of strong intervals (S is a stack of bounds; s denotes the top of S. ) 1. For i from 1 to 4n: 2. If ai is a left bound 3. Push ai on S 4. Else 5. Output (s..ai) // Interval (s..ai) is strong 6. Pop the top of S ------------------------------------------------------------------------------