0. Organizational stuff
- exercises: you can solve them in groups, but you must provide solutions
individually and in your own words; submitted just before the tutorial in the
subsequent week, preferably by email (PDF!) or by paper. Solutions will then
be discussed right away.
- there will be a written exam in the last class (28.07.2017)
- this lecture is part of the module "Spezielle Algorithmen der Bioinformatik"
the module can be completed with
- 392208 Algorithms in Bioinformatics (seminar with Omar Castillo) and
- 392231 Implementation of Algorithms
-------------------------------------------------------------------------------
1. Introduction
1.1 What are and what causes genome rearrangements?
- double strand breaks are common, but usually repaired by DNA-ligases
- mutagens: chemicals, UV-light, etc.
- topoisomerases induce double strand breaks when changing chromatid states
between supercoiled and uncoiled regions.
- if two chromosomal strands are spatially close and double strand breaks occur
simultaneously, errors in re-ligating the correct strands with each other can
occur
(example of rearrangements between human/mouse shown)
1.2 Rearrangements:
- reversal, transposition, translocation, block interchange, fusion, fission,
circularization, linearization
(sorting scenario between human and mouse shown)
(phylogenetic tree, rearrangements along the paths between two species)
- rearrangement events are undirected
1.3 Important questions:
- genomic distance between two genomes: what is the minimum number of
rearrangement operations that transform one genome into another? (assumption
of parsimony)
- genomic sorting scenario between two genomes, e.g. can we find a sequence of
rearrangements with minimum number of operations?
- relationship of multiple genomes (small parsimony, large parsimony)
- duplication history, genome halving
- detection of conserved regions (syntenic blocks)
1.4 Genome models
- markers: genes vs. blocks
- genes: location of the genes are known
- markers: location is unknown/arbitrary
- signed vs. unsigned:
- genes: strands known
- markers: stranding makes only sense in comparative analysis
- main orthology relations: gene families with exactly one member per genome
(permutation)
- unequal gene content:
- insertions, deletions, otherwise one gene family member per genome
- gene families (unique, with duplicates) vs. family-free (gene
similarities)
-------------------------------------------------------------------------------
2. The (unsigned) reversal distance
Literature:
Kececioglu and Sankoff (1995): Exact and Approximation Algorithms for Sorting by
Reversals, with Application to Genome Rearrangement
Bafna and Pevzner (1996): Genome Rearrangements and Sorting by Reversals
A genome is represented by a permutation, which is a bijection on the set
{1,...,n}: π = (π_1 π_2 ··· π_n)
π^1 = (2 1 4 3 5 8 6 7)
Definition 2.1: A /reversal/ ρ(i,j) reverts the order of elements in interval
(i,j).
π^1 = (2 1 |4 3 5 8| 6 7) # maybe color the numbers within |..| differently
π^1 \circ ρ(3, 6) = ( 2 1 8 5 3 4 6 7)
Problem 2.1 (sorting by reversals): Find a shortest series of reversals that
transforms permutation π into σ
Input: Permutations π and σ
Output: A series of reversals ρ_1, ... , ρ_d ("sorting scenario") transforming
π into σ, such that d is minimum.
d is called the reversal distance and further denoted rd(π, σ)
-> in practice, we assume that σ is the identity
-> rd(π) := rd(π, id)
Problem 2.1 has been proven NP hard by Caprara 1997, making the computation of
exact solutions infeasible in practice.
2.1 Finding approximate rather than optimal solutions
(Recall) The approximation ratio of an algorithm A on input π is:
r = A(π) / OPT(π)
where: A(π) is the solution by algorithm A
OPT(π) is the optimal solution
How can we sort this permutation?
π^2 = (2 1 5 3 4)
(1 2 5 3 4)
(1 2 3 5 4)
(1 2 3 4 5)
Algorithm 2.1 (naive): In the i-th step, put the i-th element in position. Then at
most n − 1 reversals are needed.
Let's do another example:
π^3 = (5 1 2 3 4)
(1 5 2 3 4)
.
.
(1 2 3 4 5)
.. is sorted with 4 reversals, but it is possible to sort it with only 2.
Worst case for naive algorithm:
π = (n 1 ... n−1) -> (n n−1 ... 1) -> (1 2 ... n)
In this case, the ratio between naive and the optimal solution is (n-1)/2.
Why is this algorithm bad? Because it breaks some “good” parts: applying the
reversal ρ(1, 2) on π = ( 5 1 2 3 4) to put 1 in its position breaks the “good”
connection 1, 2
2.2 A 2-approximation algorithm for the reversal distance (see Kececioglu and
Sankoff, 1992)
Definition 2.2: For a permutation π = (π_1 π_2 ··· π_n), two elements π_i and
π_{i+1} form a /breakpoint/ (BP), if |π_i − π_i+1| > 1 and otherwise an
/adjacency/ (ADJ).
Ex.: in π^2 are (1 5), (5 3) breakpoints
... but so are (2) and (4)!
-> add 0, n+1 to beginning and end of permutation.
Observation 2.1: There are at most n+1 breakpoints and the only permutation
without breakpoints is the identity.