Literature: - Bergeron, A., Chauve, C., de Montgolfier, F., & Raffinot, M. (2008). Computing common intervals of K permutations, with applications to modular decomposition of graphs. SIAM Journal on Discrete Mathematics, 22(3), 1022–1039. http://doi.org/10.1137/060651331 - Didier, G., Schmidt, T., Stoye, J., & Tsur, D. (2006). Character sets of strings. Journal of Discrete Algorithms, 5(2), 330–340. http://doi.org/10.1016/j.jda.2006.03.021 ------------------------------------------------------------------------------ 1. Gene Clusters Gene clusters are sets of genes that have an associated function, such as being part of the same metabolic pathway. Gene clusters are common in prokaryotes and fungi, but can also be found in higher eukaryotes. For many gene clusters holds true that their members are neighboring each other in the genome. This structural organization is positively enforced by natural selection of the genes and their host, as it can simplify transcription (polycistronic transcription -- operons) or be advantageous for horizontal gene transfer. The function of the large majority of genes in sequenced organisms remains unknown until today. Yet, without the knowledge of the genes' function, it seems hopeless to identify gene clusters. However, comparative genomics provides means to identify gene cluster candidates, by identifying those sets of genes that are closely located to each other in multiple genomes of related species. ------------------------------------------------------------------------------ 2. Common Intervals of Permutations The O(N) common intervals of of k permutations of size n can be enumerated in optimal time O(kn+N) by computing generators (Sup, Inf) of each permutation independently and then constructing their union (R, L) as shown in the previous lecture. The enumeration relies on the Support_R vector, which has also been previously mentioned, but without giving an explicit algorithm for its construction. Recall that Support_R is the vector that indices for each interval (i..R[i]) the smallest interval (i'..R[i']), in which (i..R[i]) is contained. Clearly, i' must be strictly smaller than i. Thus, by maintaining a stack of previous intervals, Support_R can be constructed in linear time. By definition, all intervals (2..R[2]),..(n..R[n]) must be a subinterval of (1..R[n]) = (1..n). The algorithm thus iterates through all intervals from (2..R[2]) to (n..R[n]), popping intervals off the stack that are not super-intervals of the current interval as long as a proper super-interval is found. The popped intervals cannot be super-intervals of subsequent intervals, if all intervals of R commute (i.e., one is included in the other or they are disjoint). This holds true for Sup (and consequently for the union of Sup vectors), as noted in the previous lecture. At last, note that it is sufficient to test only whether the left bound i of the current interval (i..R[i]) is smaller than the right bound R[s] of any previously observed interval s (which will be an interval of the stack) if one wants to know if (i..R[i]) is included in (s..R[s]). That is because (a) i > s holds true by definition and (b) all intervals of R commute by supposition. Algorithm 1 shows the construction of Support_R in pseudo-code. Algorithm 1 (Construction of Support_R from a commuting generator (R, L) )--- Input: Generator (R, L) Output: Support_R 1. Initialize empty stack S // in the following, s denotes the top of S 2. Push 1 on S 3. For i from 2 to n 4. While R[s] < i 5. Pop the top of S 6. Support_R[i] <- s 7. Push i on S ------------------------------------------------------------------------------ The vector L and Support_R are sufficient for enumerating all common intervals, as demonstrated by Algorithm 2 below. The algorithm simply iterates through all possible right bounds j of a common interval in decreasing order. Recall that, if (i..j) is a common interval, it must hold that L[j] <= i <= j <= R[i]. Let (s..R[s]) = (Support_R[i], R[Support_R[i]]) be the smallest interval that contains (i..R[i]). Recall that each interval of R is maximal, it is the largest (common -- if R is canonical) interval around element i that contains only elements larger i, with R[i] being the largest one. We have: (i..R[i]) \subset (s..R[s]) and thus R[i] <= R[s]. Observe that due to maximality, R[i] = R[s]. The idea of the algorithm is to fix j=R[i], and continuously decrease, thereby allowing i to take on values corresponding to the left bound of the supporting interval, i.e. s = Support_R[i]. Because by definition (s..R[i]) is the smallest super-interval of (i..R[i]), it is also the smallest possible common interval that is larger than (i..R[i]) for fixed right bound R[i]. It then suffices to execute the test for common intervals for the left bound, i.e L[j] <= i. As long as this condition is true, the algorithm will report common interval (i..R[i]) and decrease the left bound i to that of its supporting interval. Algorithm 2 shows the full procedure. Algorithm 2 (Enumerator for Common Intervals)--------------------------------- Input: Generator (R, L), Support_R Output: All common intervals of (R, L) 1. For j from n to 1 2. i <- j 3. While i ≥ L[j] 4. Output (i..j) // interval (i..j) is a common interval 5. i <- Support_R[i] ------------------------------------------------------------------------------ The running time of the algorithm is O(kn+N), where N is the number of common intervals of (R, L). The time complexity is apparent: In each step the algorithm either finds a common interval, or continues to the next right bound j <- j-1. ------------------------------------------------------------------------------ 3. Common Intervals of Strings Def. The character set C(S) of a string is S is the (unordered) set of characters that constitute S. Example: S = biadkfblhdba, C(S) = {a,b,d,f,h,i,k,l} An interval of a string is represented by [i, j] := [i, i+1, .., j] and corresponds to the indices of its bounds. Example: Interval [5,8] of S = biadkfblhdba is associated with substring S[5,8] = kfbl. Of particular interest to us are intervals that are /maximal/, which means that the interval cannot be extended to its left or right without changing its character set. Maximality can be sub-divided into /left-/ and /right-maximality/. Def. An interval [i, j] in string S is /left-maximal/ iff i == 1 or C(S[i,j]) != C(S[i-1, j]), it is /right-maximal/ iff j == |S| or C(S[i,j]) != C(S[i,j+1]). It is /maximal/, if its both left- and right-maximal. Example: Interval [1,6] of S = biadkfblhdba is left-maximal, but not right-maximal. Interval [1,7] is maximal (and so are many other intervals of S). Def. A pair of intervals [i, j] and [i', j'] of two strings S and T, respectively, is /common/, iff their character sets are identical, i.e., C(S[i,j]) = C(T[i', j']). Example: S = biadkfblhdba, T = iecdblfhkbhea, intervals [4,11] of S and [4,11] of T are maximal common intervals This leads to the following problem statement: Problem: Given two strings S and T, find all their maximal common intervals. ------------------------------------------------------------------------------ 3.1 Didier's Algorithm The algorithm implements the following strategy to identify maximal common intervals of two strings S and T: 1. Iterate through each position i of S, thereby fixing i as left bound of all intervals of S that are examined in the current iteration. 2. Identify positions in T that correspond to the character S[i]. Mark these positions as (trivial) intervals. 3. Successively move the right bound j of S from i to |S|. In each such move, do as follows: a. To ensure right-maximality, move j to the rightmost position such that character set C(S[i,j]) has exactly one more character than the character set C(S[i,j']) corresponding to the substring of its preceding position j' < j. For further reference, let c be this new character. b. If S[i-1] == c, i.e., S[i,j] is not left-maximal, go to Step 1 and continue with next position i'=i+1. c. Extend all intervals marked in T to the right and left if c is neighboring their current bounds. d. Let T[k,l] be such a marked interval after its extension. Report the interval pair [i,j], [k, l] iff C(S[i,j]) = C(T[k,l]). Note that marked intervals in T are by definition left- and right-maximal. The new right bound j for step 3.a can be identified in constant time using a table that tracks the leftmost occurrences of characters in string S[i,|S|]: Def. The /rank/ of a character c in a string S is the number of different characters that occur up to and including the leftmost occurrence of c in S. The ranks of all characters are stored in a /rank table/ called Rank. Pos is a table that maps each rank to the leftmost position of the corresponding character in S. If a character does not occur, then its rank is infinite and its position undefined. Example: S = biadkfblhdba, the rank table of S is: Rank = character | Rank | Pos a | 3 | 3 b | 1 | 1 c | ∞ | / d | 4 | 4 e | ∞ | / f | 6 | 6 g | ∞ | / h | 8 | 9 i | 2 | 2 j | ∞ | / k | 5 | 5 l | 7 | 8 We can now use Rank and Pos in Step 3.a to determine the next right bound j of S. Let j' be the bound before the move, then S[j'+1] is the new character c. Then j = Pos[Rank[c]+1]-1 is the next right-maximal bound in S. We now construct a data structure very similar to the set of intervals associated to vector R of a generator of common intervals in permutations. Def. Given rank table Rank, the /rank interval/ of a position p in T is the largest interval [k, l], k <= p <= k, such that max ({Rank[c] | c in C(T[k, l]) }) = Rank[T[p]]. Didier's Algorithm pre-computes all rank Intervals and stores them in a table called Int. Rank intervals enable constant time extension of each marked interval in Step 3.a of the strategy. One last challenge remains unsolved: How can the equality of character sets C(T[k,l]) and C(S[i,j]) in Step 3.d be tested efficiently? For this, Didier uses another concept, called the /rank-nearest successor/, which is determined by computing /rank distances/ between positions of string T: Def. Given a rank table Rank, the rank distance d(p, p') of two positions p, p' of string T is d(p, p') = max({Rank[c] | c \in C(T[p, p'])}) Def. Given a rank table Rank and a position p in string T. Let P' be the set of all positions p' of T such that Rank[T[p']] = Rank[T[p]]+1. If P' is empty, no rank-nearest successor of p exists. But if P' is not empty, let p'' \in P' be the position with minimal rank distance d(p, p''). Ties are broken by choosing the leftmost among all co-optimal positions. The path of rank-nearest successors spans a forest in T, which Didier's Algorithm follows from the leaves to the roots. Following this path is ingenious, because it is clear that all characters the corresponding interval [i, j] in S will be visited one by one. . Assume we followed a path of rank-nearest successors and arrived at position p_r. To test, whether the rank interval [p, p'] of position p_r is a common interval, it suffices to check whether all positions previously encountered on this path are within the bounds of [p, p']. More formally, let p_r be the r-th position of a path that follows rank-nearest successors from a position p_1 of string T with Rank[T[p_1]] = 1 to its rank-nearest successor p_2 with Rank[T[p_2]]= 2, etc, to position p_r with Rank[T[p_r]] = r. Further, let p_k and p_l the outermost positions observed on this path, i.e., p_l = min({p_1,...p_r}) and p_k = max({p_1,..p_r}). The rank interval [p, p'] of position p_r has the same character set as C(S[i,j]), where j = Pos[r+1]-1], iff [p_k, p_l] \subseteq [p, p']. Didier precomputes rank-nearest successors and stores them in a table called Succ. The full algorithm is described in pseudo-code below. Algorithm 3 (Didier's Algorithm)---------------------------------------------- Input: Two strings S and T Output: List of all maximal common intervals between S and T 1. For i=1,...,|S| 2. Compute tables Rank w.r.t. S[i,|S|] 3. Compute table of intervals Int for T 4. Compute table Succ for T 5. Initialize List with positions in T of rank 1 and their corresponding rank intervals 6. While List is not empty 7. (y, [p_k,p_l]) <- the first element of List verifying [p_k,p_l] \subseteq Int[y] 8. r_y <- Rank[T[y]] 9. j <- Pos[r_y+1]-1 // test if there are locations to output 10. If i = 1 or Rank[S[i-1]] > r_y 11. Output ([i,j], Int[y]) 12. previous <- Int[y] 13. For each following element (y, [p_k,p_l]) in List 14. If [p_k,p_l] \subseteq Int[y] and Int[y] != previous 15. Output ([i,j], Int[y]) 16. previous <- Int[y] // compute the next level 17. For each element (y,[p_k,p_l]) in List 18. Remove (y, [p_k,p_l]) from List 19. If Succ[y] exists and y is the leftmost index with minimal rank distance to Succ[y] among positions of the list having the same successor 20. y <- Succ[y] // update path bounds 21. If y < p_k, then p_k <- y 22. If y > p_l, then p_l <- y 23. Add (y, [p_k,p_l]) to List ------------------------------------------------------------------------------ Runtime. Each position i in S is used once as left bound, hence the outer for loop contributes O(|S|) time. For each fixed i, table Rank can be computed in O(|S|) time, whereas tables Int and Succ can be computed in O(|T|) time (no proof). The number of positions visited when following all paths of rank-nearest successors in T cannot exceed T. In each step, tracking and extending the path's bounds and testing whether the path's bounds are within the current rank interval can be performed in constant time. Thus, the algorithm overall takes O(n^2) time, where n=max(|S|, |T|).