13. Scaffolding ancient contigs
Literature:
- Rajaraman, Tannier, and Chauve. "FPSAC: fast phylogenetic scaffolding of
ancient contigs." Bioinformatics 29.23 (2013): 2987-2994.
- Luhmann, Nina. "Phylogenetic assembly of paleogenomes integrating ancient
DNA data." (2017).
- Maňuch, et al. "Linearization of ancestral multichromosomal genomes." BMC
bioinformatics 13.19 (2012): S11.
- Luhmann, Doerr, and Chauve. "Comparative scaffolding and gap filling of
ancient bacterial genomes applied to two ancient Yersinia pestis genomes."
Microbial Genomics (2017).
Computational paleogenomics is the discipline that aims at developing methods
for reconstructing the organization of ancestral genomes based on extent
(present) and extant genomic data.
Modern sequencing equipment and -protocols permits the recovery of ancient DNA
(aDNA) sequences from archaeological probes. The obtained paleogenomic sequences
are typically highly fragmented due to the fact that DNA degenerates quickly
over time. As a result, assembling such data is difficult. New computational
methods make use of information from extant genomic data to improve the
assembly. Such approaches proceed in four steps:
1. Construction of an assembly graph from aDNA reads
2. Identification and mapping of ancient markers in aDNA contigs onto extant
sequences
3. Construction of a non-conflicting set of adjacencies in assembly graph based
on the conservation of extant adjacencies in a given phylogeny.
4. Closing of sequence gaps between ancestral markers
Step 1 can be performed with traditional De Bruijn-based assembly methods.
Computing local alignments between ancestral contigs and extant genome sequences
allows the subsequent construction of marker sets and their extant counterparts.
13.1 Finding a maximum set of non-conflicting adjacencies
In Chapter 10, we have already discussed a method that allows us to obtain a
non-conflicting subset of adjacencies from three genomes by computing a maximum
weighted matching. This approach can be generalized for arbitrary weights of
adjacencies. In doing so, we can take phylogenetic relationships into account
and weight adjacencies not only by their presence in extant genomes, but also by
their evolutionary distance from the reconstructed ancestral genome.
13.2 Ensuring a non-conflicting evolutionary history of adjacencies
Sankoff-Rousseau's algorithm solves the weighted parsimony problem when state
changes can have different costs. In doing so, it takes edge weights of the
phylogeny into account. If edge weights correlate with evolutionary time, then a
state change on a long branch is more likely to occur than on a short branch,
simply because there is more time for the event to happen. Instead of minimizing
the number of changes (Fitch parsimony), Sankoff-Rousseau minimizes the total
number of costs of changes. To this end the algorithm makes use of a the
following cost functions:
- Cost for transition from state s to t between two nodes u, v:
| 0 if s = t
d(s, t) := <
| 1/l(v, u) otherwise
- C(v, l) := min cost of subtree rooted at node v if v has state l
- Sankoff-Rousseau's Algorithm ------------------------------------------------
1. Bottom-up phase:
- Initialization for each leaf i and observed character x
| 0 if x_i = x
C(i, x) = <
| \infty otherwise
- Post-order traversal:
for inner node i with children j, k:
C(i, x) = min_{states y}( C(j, y) + d(x, y) )+ min_{states y}( C(k, y) +
d(x, y))
set pointers x -> y_j, y_k to state with minimum cost C(i, x)
2. Top-down phase:
- Choose min-cost state for root
- pre-order traversal of tree:
for node j with parent i, choose state y that produced C(i, x) using
pre-computed pointers
-------------------------------------------------------------------------------
The following algorithm makes use of a /sparse/ variant of Sankoff-Rousseau's
algorithm, meaning that when the costs of absence (0) and presence (1) of an
adjacency in any node is equal, absence is always chosen in order to ensure
a conflict-free adjacency set.
- Algorithm EWRA -------------------------------------------------------------
1. Attach an additional leaf to the assembly graph node v
2. Assign (possibly conflicting) adjacencies inferred from assembly graph to
new leaf node
3. Reroot the tree such that v becomes its root
4. Perform Bottom-up phase of sparse edge-weighted Sankoff-Rousseau Algorithm
for each adjacency at the leaves
5. Choose min-cost states for all adjacencies A at the root
6. Find a maximum non-conflicting subset of adjacencies A' of A
7. Perform Top-down phase of sparse edge-weighted Sankoff-Rousseau Algorithm
for each adjacency of A'
-------------------------------------------------------------------------------
13.3 Closing gaps
If gap lengths between related species in the phylogeny are of roughly equal
size, one can assume that only point mutations occurred. In this case, the
ancestral sequence can be recovered in four steps:
- Algorithm AGapEs -----------------------------------------------------------
1. Compute an alignment between sequences of extant genomes associated
with an ancestral gap
2. Construct a /template sequence/ using Fitch's or Sankoff's algorithm
3. Map aDNA reads onto the template sequence and its surrounding markers
4. Find sequence of overlapping mappings that covers t with minimal distance (an
application of the shortest path problem)
-------------------------------------------------------------------------------