Literature:
- Bergeron, A., Chauve, C., de Montgolfier, F., & Raffinot, M. (2008).
Computing common intervals of K permutations, with applications to modular
decomposition of graphs. SIAM Journal on Discrete Mathematics, 22(3),
1022–1039. http://doi.org/10.1137/060651331
- Didier, G., Schmidt, T., Stoye, J., & Tsur, D. (2006). Character sets of
strings. Journal of Discrete Algorithms, 5(2), 330–340.
http://doi.org/10.1016/j.jda.2006.03.021
------------------------------------------------------------------------------
1. Gene Clusters
Gene clusters are sets of genes that have an associated function, such as
being part of the same metabolic pathway. Gene clusters are common in
prokaryotes and fungi, but can also be found in higher eukaryotes.
For many gene clusters holds true that their members are neighboring each other
in the genome. This structural organization is positively enforced by natural
selection of the genes and their host, as it can simplify transcription
(polycistronic transcription -- operons) or be advantageous for horizontal gene
transfer.
The function of the large majority of genes in sequenced organisms remains
unknown until today. Yet, without the knowledge of the genes' function, it
seems hopeless to identify gene clusters. However, comparative genomics
provides means to identify gene cluster candidates, by identifying those sets
of genes that are closely located to each other in multiple genomes of related
species.
------------------------------------------------------------------------------
2. Common Intervals of Permutations
The O(N) common intervals of of k permutations of size n can be enumerated in
optimal time O(kn+N) by computing generators (Sup, Inf) of each permutation
independently and then constructing their union (R, L) as shown in the
previous lecture.
The enumeration relies on the Support_R vector, which has also been previously
mentioned, but without giving an explicit algorithm for its construction.
Recall that Support_R is the vector that indices for each interval (i..R[i])
the smallest interval (i'..R[i']), in which (i..R[i]) is contained. Clearly,
i' must be strictly smaller than i. Thus, by maintaining a stack of previous
intervals, Support_R can be constructed in linear time. By definition, all
intervals (2..R[2]),..(n..R[n]) must be a subinterval of (1..R[n]) = (1..n).
The algorithm thus iterates through all intervals from (2..R[2]) to (n..R[n]),
popping intervals off the stack that are not super-intervals of the current
interval as long as a proper super-interval is found.
The popped intervals cannot be super-intervals of subsequent intervals, if all
intervals of R commute (i.e., one is included in the other or they are
disjoint). This holds true for Sup (and consequently for the union of Sup
vectors), as noted in the previous lecture.
At last, note that it is sufficient to test only whether the left bound i of
the current interval (i..R[i]) is smaller than the right bound R[s] of any
previously observed interval s (which will be an interval of the stack) if one
wants to know if (i..R[i]) is included in (s..R[s]). That is because (a) i > s
holds true by definition and (b) all intervals of R commute by supposition.
Algorithm 1 shows the construction of Support_R in pseudo-code.
Algorithm 1 (Construction of Support_R from a commuting generator (R, L) )---
Input: Generator (R, L)
Output: Support_R
1. Initialize empty stack S // in the following, s denotes the top of S
2. Push 1 on S
3. For i from 2 to n
4. While R[s] < i
5. Pop the top of S
6. Support_R[i] <- s
7. Push i on S
------------------------------------------------------------------------------
The vector L and Support_R are sufficient for enumerating all common
intervals, as demonstrated by Algorithm 2 below. The algorithm simply iterates
through all possible right bounds j of a common interval in decreasing order.
Recall that, if (i..j) is a common interval, it must hold that L[j] <= i <= j
<= R[i].
Let (s..R[s]) = (Support_R[i], R[Support_R[i]]) be the smallest interval that
contains (i..R[i]). Recall that each interval of R is maximal, it is the
largest (common -- if R is canonical) interval around element i that contains
only elements larger i, with R[i] being the largest one. We have: (i..R[i])
\subset (s..R[s]) and thus R[i] <= R[s]. Observe that due to maximality, R[i]
= R[s]. The idea of the algorithm is to fix j=R[i], and continuously decrease,
thereby allowing i to take on values corresponding to the left bound of the
supporting interval, i.e. s = Support_R[i]. Because by definition (s..R[i]) is
the smallest super-interval of (i..R[i]), it is also the smallest possible
common interval that is larger than (i..R[i]) for fixed right bound R[i]. It
then suffices to execute the test for common intervals for the left bound, i.e
L[j] <= i. As long as this condition is true, the algorithm will report common
interval (i..R[i]) and decrease the left bound i to that of its supporting
interval. Algorithm 2 shows the full procedure.
Algorithm 2 (Enumerator for Common Intervals)---------------------------------
Input: Generator (R, L), Support_R
Output: All common intervals of (R, L)
1. For j from n to 1
2. i <- j
3. While i ≥ L[j]
4. Output (i..j) // interval (i..j) is a common interval
5. i <- Support_R[i]
------------------------------------------------------------------------------
The running time of the algorithm is O(kn+N), where N is the number of common
intervals of (R, L). The time complexity is apparent: In each step the
algorithm either finds a common interval, or continues to the next right bound
j <- j-1.
------------------------------------------------------------------------------
3. Common Intervals of Strings
Def. The character set C(S) of a string is S is the (unordered) set of
characters that constitute S.
Example: S = biadkfblhdba, C(S) = {a,b,d,f,h,i,k,l}
An interval of a string is represented by [i, j] := [i, i+1, .., j] and
corresponds to the indices of its bounds.
Example: Interval [5,8] of S = biadkfblhdba is associated with substring
S[5,8] = kfbl.
Of particular interest to us are intervals that are /maximal/, which means
that the interval cannot be extended to its left or right without changing its
character set. Maximality can be sub-divided into /left-/ and
/right-maximality/.
Def. An interval [i, j] in string S is /left-maximal/ iff i == 1 or C(S[i,j])
!= C(S[i-1, j]), it is /right-maximal/ iff j == |S| or C(S[i,j]) !=
C(S[i,j+1]). It is /maximal/, if its both left- and right-maximal.
Example: Interval [1,6] of S = biadkfblhdba is left-maximal, but not
right-maximal. Interval [1,7] is maximal (and so are many other intervals of
S).
Def. A pair of intervals [i, j] and [i', j'] of two strings S and T,
respectively, is /common/, iff their character sets are identical, i.e.,
C(S[i,j]) = C(T[i', j']).
Example: S = biadkfblhdba, T = iecdblfhkbhea, intervals [4,11] of S and [4,11]
of T are maximal common intervals
This leads to the following problem statement:
Problem: Given two strings S and T, find all their maximal common intervals.
------------------------------------------------------------------------------
3.1 Didier's Algorithm
The algorithm implements the following strategy to identify maximal common
intervals of two strings S and T:
1. Iterate through each position i of S, thereby fixing i as left bound of all
intervals of S that are examined in the current iteration.
2. Identify positions in T that correspond to the character S[i]. Mark these
positions as (trivial) intervals.
3. Successively move the right bound j of S from i to |S|. In each such move,
do as follows:
a. To ensure right-maximality, move j to the rightmost position such that
character set C(S[i,j]) has exactly one more character than the
character set C(S[i,j']) corresponding to the substring of its preceding
position j' < j. For further reference, let c be this new character.
b. If S[i-1] == c, i.e., S[i,j] is not left-maximal, go to Step 1 and
continue with next position i'=i+1.
c. Extend all intervals marked in T to the right and left if c is
neighboring their current bounds.
d. Let T[k,l] be such a marked interval after its extension. Report the
interval pair [i,j], [k, l] iff C(S[i,j]) = C(T[k,l]).
Note that marked intervals in T are by definition left- and right-maximal.
The new right bound j for step 3.a can be identified in constant time using a
table that tracks the leftmost occurrences of characters in string S[i,|S|]:
Def. The /rank/ of a character c in a string S is the number of different
characters that occur up to and including the leftmost occurrence of c in S.
The ranks of all characters are stored in a /rank table/ called Rank. Pos is a
table that maps each rank to the leftmost position of the corresponding
character in S. If a character does not occur, then its rank is infinite and
its position undefined.
Example:
S = biadkfblhdba, the rank table of S is:
Rank = character | Rank | Pos
a | 3 | 3
b | 1 | 1
c | ∞ | /
d | 4 | 4
e | ∞ | /
f | 6 | 6
g | ∞ | /
h | 8 | 9
i | 2 | 2
j | ∞ | /
k | 5 | 5
l | 7 | 8
We can now use Rank and Pos in Step 3.a to determine the next right bound j of
S. Let j' be the bound before the move, then S[j'+1] is the new character c.
Then j = Pos[Rank[c]+1]-1 is the next right-maximal bound in S.
We now construct a data structure very similar to the set of intervals
associated to vector R of a generator of common intervals in permutations.
Def. Given rank table Rank, the /rank interval/ of a position p in T
is the largest interval [k, l], k <= p <= k, such that max ({Rank[c] | c in
C(T[k, l]) }) = Rank[T[p]].
Didier's Algorithm pre-computes all rank Intervals and stores them in a table
called Int. Rank intervals enable constant time extension of each marked
interval in Step 3.a of the strategy.
One last challenge remains unsolved: How can the equality of character sets
C(T[k,l]) and C(S[i,j]) in Step 3.d be tested efficiently? For this, Didier
uses another concept, called the /rank-nearest successor/, which is determined
by computing /rank distances/ between positions of string T:
Def. Given a rank table Rank, the rank distance d(p, p') of two positions p,
p' of string T is
d(p, p') = max({Rank[c] | c \in C(T[p, p'])})
Def. Given a rank table Rank and a position p in string T. Let P' be the set
of all positions p' of T such that Rank[T[p']] = Rank[T[p]]+1. If P' is empty,
no rank-nearest successor of p exists. But if P' is not empty, let p'' \in P'
be the position with minimal rank distance d(p, p''). Ties are broken by
choosing the leftmost among all co-optimal positions.
The path of rank-nearest successors spans a forest in T, which Didier's
Algorithm follows from the leaves to the roots. Following this path is
ingenious, because it is clear that all characters the corresponding interval
[i, j] in S will be visited one by one. . Assume we followed a path of
rank-nearest successors and arrived at position p_r. To test, whether the rank
interval [p, p'] of position p_r is a common interval, it suffices to check
whether all positions previously encountered on this path are within the
bounds of [p, p'].
More formally, let p_r be the r-th position of a path that follows
rank-nearest successors from a position p_1 of string T with Rank[T[p_1]] = 1
to its rank-nearest successor p_2 with Rank[T[p_2]]= 2, etc, to position p_r
with Rank[T[p_r]] = r. Further, let p_k and p_l the outermost positions
observed on this path, i.e., p_l = min({p_1,...p_r}) and p_k =
max({p_1,..p_r}). The rank interval [p, p'] of position p_r has the same
character set as C(S[i,j]), where j = Pos[r+1]-1], iff [p_k, p_l] \subseteq
[p, p'].
Didier precomputes rank-nearest successors and stores them in a table called
Succ. The full algorithm is described in pseudo-code below.
Algorithm 3 (Didier's Algorithm)----------------------------------------------
Input: Two strings S and T
Output: List of all maximal common intervals between S and T
1. For i=1,...,|S|
2. Compute tables Rank w.r.t. S[i,|S|]
3. Compute table of intervals Int for T
4. Compute table Succ for T
5. Initialize List with positions in T of rank 1 and their corresponding
rank intervals
6. While List is not empty
7. (y, [p_k,p_l]) <- the first element of List verifying [p_k,p_l]
\subseteq Int[y]
8. r_y <- Rank[T[y]]
9. j <- Pos[r_y+1]-1
// test if there are locations to output
10. If i = 1 or Rank[S[i-1]] > r_y
11. Output ([i,j], Int[y])
12. previous <- Int[y]
13. For each following element (y, [p_k,p_l]) in List
14. If [p_k,p_l] \subseteq Int[y] and Int[y] != previous
15. Output ([i,j], Int[y])
16. previous <- Int[y]
// compute the next level
17. For each element (y,[p_k,p_l]) in List
18. Remove (y, [p_k,p_l]) from List
19. If Succ[y] exists and y is the leftmost index with minimal
rank distance to Succ[y] among positions of the list
having the same successor
20. y <- Succ[y]
// update path bounds
21. If y < p_k, then p_k <- y
22. If y > p_l, then p_l <- y
23. Add (y, [p_k,p_l]) to List
------------------------------------------------------------------------------
Runtime. Each position i in S is used once as left bound, hence the outer
for loop contributes O(|S|) time. For each fixed i, table Rank can be computed
in O(|S|) time, whereas tables Int and Succ can be computed in O(|T|) time (no
proof). The number of positions visited when following all paths of
rank-nearest successors in T cannot exceed T. In each step, tracking and
extending the path's bounds and testing whether the path's bounds are within
the current rank interval can be performed in constant time. Thus, the
algorithm overall takes O(n^2) time, where n=max(|S|, |T|).