13. Scaffolding ancient contigs Literature: - Rajaraman, Tannier, and Chauve. "FPSAC: fast phylogenetic scaffolding of ancient contigs." Bioinformatics 29.23 (2013): 2987-2994. - Luhmann, Nina. "Phylogenetic assembly of paleogenomes integrating ancient DNA data." (2017). - Maňuch, et al. "Linearization of ancestral multichromosomal genomes." BMC bioinformatics 13.19 (2012): S11. - Luhmann, Doerr, and Chauve. "Comparative scaffolding and gap filling of ancient bacterial genomes applied to two ancient Yersinia pestis genomes." Microbial Genomics (2017). Computational paleogenomics is the discipline that aims at developing methods for reconstructing the organization of ancestral genomes based on extent (present) and extant genomic data. Modern sequencing equipment and -protocols permits the recovery of ancient DNA (aDNA) sequences from archaeological probes. The obtained paleogenomic sequences are typically highly fragmented due to the fact that DNA degenerates quickly over time. As a result, assembling such data is difficult. New computational methods make use of information from extant genomic data to improve the assembly. Such approaches proceed in four steps: 1. Construction of an assembly graph from aDNA reads 2. Identification and mapping of ancient markers in aDNA contigs onto extant sequences 3. Construction of a non-conflicting set of adjacencies in assembly graph based on the conservation of extant adjacencies in a given phylogeny. 4. Closing of sequence gaps between ancestral markers Step 1 can be performed with traditional De Bruijn-based assembly methods. Computing local alignments between ancestral contigs and extant genome sequences allows the subsequent construction of marker sets and their extant counterparts. 13.1 Finding a maximum set of non-conflicting adjacencies In Chapter 10, we have already discussed a method that allows us to obtain a non-conflicting subset of adjacencies from three genomes by computing a maximum weighted matching. This approach can be generalized for arbitrary weights of adjacencies. In doing so, we can take phylogenetic relationships into account and weight adjacencies not only by their presence in extant genomes, but also by their evolutionary distance from the reconstructed ancestral genome. 13.2 Ensuring a non-conflicting evolutionary history of adjacencies Sankoff-Rousseau's algorithm solves the weighted parsimony problem when state changes can have different costs. In doing so, it takes edge weights of the phylogeny into account. If edge weights correlate with evolutionary time, then a state change on a long branch is more likely to occur than on a short branch, simply because there is more time for the event to happen. Instead of minimizing the number of changes (Fitch parsimony), Sankoff-Rousseau minimizes the total number of costs of changes. To this end the algorithm makes use of a the following cost functions: - Cost for transition from state s to t between two nodes u, v: | 0 if s = t d(s, t) := < | 1/l(v, u) otherwise - C(v, l) := min cost of subtree rooted at node v if v has state l - Sankoff-Rousseau's Algorithm ------------------------------------------------ 1. Bottom-up phase: - Initialization for each leaf i and observed character x | 0 if x_i = x C(i, x) = < | \infty otherwise - Post-order traversal: for inner node i with children j, k: C(i, x) = min_{states y}( C(j, y) + d(x, y) )+ min_{states y}( C(k, y) + d(x, y)) set pointers x -> y_j, y_k to state with minimum cost C(i, x) 2. Top-down phase: - Choose min-cost state for root - pre-order traversal of tree: for node j with parent i, choose state y that produced C(i, x) using pre-computed pointers ------------------------------------------------------------------------------- The following algorithm makes use of a /sparse/ variant of Sankoff-Rousseau's algorithm, meaning that when the costs of absence (0) and presence (1) of an adjacency in any node is equal, absence is always chosen in order to ensure a conflict-free adjacency set. - Algorithm EWRA ------------------------------------------------------------- 1. Attach an additional leaf to the assembly graph node v 2. Assign (possibly conflicting) adjacencies inferred from assembly graph to new leaf node 3. Reroot the tree such that v becomes its root 4. Perform Bottom-up phase of sparse edge-weighted Sankoff-Rousseau Algorithm for each adjacency at the leaves 5. Choose min-cost states for all adjacencies A at the root 6. Find a maximum non-conflicting subset of adjacencies A' of A 7. Perform Top-down phase of sparse edge-weighted Sankoff-Rousseau Algorithm for each adjacency of A' ------------------------------------------------------------------------------- 13.3 Closing gaps If gap lengths between related species in the phylogeny are of roughly equal size, one can assume that only point mutations occurred. In this case, the ancestral sequence can be recovered in four steps: - Algorithm AGapEs ----------------------------------------------------------- 1. Compute an alignment between sequences of extant genomes associated with an ancestral gap 2. Construct a /template sequence/ using Fitch's or Sankoff's algorithm 3. Map aDNA reads onto the template sequence and its surrounding markers 4. Find sequence of overlapping mappings that covers t with minimal distance (an application of the shortest path problem) -------------------------------------------------------------------------------