Mini-Workshop on the Storage, Search and Annotation of Multiple Similar Genomes


Place: Bielefeld University, U10-146
Date: December 9-10, 2013

Guests

Name Institution City Country
Corinna Ernst Duisburg-Essen University Essen Germany
Sven Rahmann Duisburg-Essen University Essen Germany
Johannes Fischer Dortmund University Dortmund Germany
Jochen Blom Justus Liebig University Gießen Germany
Dirk Willrodt Hamburg University Hamburg Germany
Veli A. T. Mäkinen Helsinki University Helsinki Finland
Pierre Peterlongo INRIA Rennes France
Claire Lemaitre INRIA Rennes France
Eric Rivals LIRMM Montpellier France
Tobias Marschall CWI Amsterdam Netherlands
Alexander Schönhuth CWI Amsterdam Netherlands

Timetable

Each talk is followed by up to twenty minutes of questions/discussion

Monday 9th December 2013

Start Name
9h Veli A. T. Mäkinen Indexing Pan-Genomes for Variant Calling
10h Coffee Break
10h30 Johannes Fischer Inducing Suffix Arrays in External Memory
11h15 Tobias Marschall The Dutch Pan-Genome
12h05 Lunch time
14h Dirk Willrodt Alignment-free comparison of multiple similar genomes with tiny memory footprints
15h Eric Rivals Building the Assembly De Bruijn Graph from an Implicit Suffix Tree
16h Coffee Break
16h30 Sven Rahmann A Historical Perspective on Integer Linear Programs for Discovering Approximate Gene Clusters
Start Name
9h Corinna Ernst PanCake – A Data Structure for Pangenomes
9h50 Coffee Break
10h20 Pierre Peterlongo Extremely low memory de-bruin graph representation and its potential pan-genomic applications
11h05 Jochen Blom EDGAR - A database framework for comparative gene content analyses
11h55 Lunch time
14h Discussion Session

Talks

PanCake – A Data Structure for Pangenomes

by Corinna Ernst

The PanCake data structure for pangenomes is based on bundling similar sequence regions into shared features, which are derived from genome-wide pairwise sequence alignments. In contrast to many other pangenome analysis tools, like EDGAR or PGAT, PanCake allows identification of core and singleton regions independently of gene annotations. Nevertheless, comparison of identified core and singleton regions show good agreements between methods. Due to the storage of similar subsequences as edit operations, the PanCake data structure requires significantly less space than the individual sequence files. The PanCake software is available from https://bitbucket.org/CorinnaErnst/pancake.


Inducing Suffix Arrays in External Memory

by Johannes Fischer

We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best EM suffix sorter [Dementiev et al., JEA 2008] by a factor of about two in time and I/O-volume. Our second contribution is to augment the first algorithm to also construct the array of longest common prefixes (LCPs). This yields a new internal memory LCP array construction algorithm, and the first EM construction algorithm for LCP arrays. The overhead in time and I/O volume for this extended algorithm over plain suffix array construction is roughly two. Our algorithms scale far beyond problem sizes previously considered in the literature (text size of 80 GiB using only 4 GiB of RAM in our experiments).


Building the Assembly De Bruijn Graph from an Implicit Suffix Tree

by Eric Rivals

Work in collaboration between: B. Cazaux, T. Lecroq, E. Rivals

Suffix trees belong to the most studied indexing data structures for strings. Generalised suffix trees can index the substrings of a set of input words. In its construction algorithm Ukkonen used implicit suffix trees, which relax the assumption that a suffix is not a prefix of another suffix, and thus represent some suffixes by internal nodes rather than by leaves. In computational biology, for the sake of genome assembly, one wishes to assemble a target sequence from a multitude of input strings, usually called the reads. Numerous algorithms build assembly related graphs that represent the overlaps of the input reads: for instance, the overlap graph, or a special version of the De Bruijn graph, in which only k-mers occurring within the reads are represented by nodes. We address the question of algorithms for building the overlap graph and the De Bruijn graph directly from indexing data structures of the read set, and investigate it when the index in an implicit generalised suffix tree. We will present the algorithms for the De Bruijn graph and for its contracted version, and show they run in a time that is linear in the size of the index. Generally, this work establishes algorithmic paths to convert classical indexing data structures in graph representations useful for assembly.


Indexing Pan-Genomes for Variant Calling

by Veli Mäkinen

We survey different approaches for indexing individual genomes from the same species or reference genome with its frequent variants observed in population (a.k.a. pan-genome). Then we show how such approaches can be used for variant calling purposes through read alignment. Experiments on different realistic scenarios are reported for some selected indexing approaches. The experiments show that on highly-polymorphic genome regions, that show strong enrichment of known, disease-causing mutations, pan-genome indexing offers a signicant improvement in alignment accuracy against the standard single-reference approaches.


A Historical Perspective on Integer Linear Programs for Discovering Approximate Gene Clusters

by Sven Rahmann

In 2006, Gunnar Klau and myself presented a general framework to describe the similarity between genomes by providing not a single definition, but a class of definitions of the concept of approximate gene clusters that (1) can be written as integer linear programs (ILPs) and (2) allows several variations that include existing specialized definitions such as common intervals, r-windows, and max-gap clusters or gene teams. While the ILP formulation does not directly lead to optimal algorithms, it provides unprecedented generality and is competitive in practice. It allows a non-heuristic study of large approximate clusters in several genomes.


The Dutch Pan-Genome

by Tobias Marschall

In the Genome of the Netherlands (GoNL) Project, we have sequenced the whole genomes of 769 Dutch individuals from 250 families. These data give rise to a rather complete knowledge of genetic variation in the Dutch. After briefly advertizing the tools we developed to discover and genotype structural variations (especially “twilight zone” deletions of length 20-100bp), I will present some statistics on types, lengths, and allele frequencies of structural variations and SNPs in the Dutch population. This will showcase what types pan-genome data structures and associated tools are needed to aid a population-scale sequencing effort like GoNL.


Extremely low memory de-bruin graph representation and its potential pan-genomic applications

by Pierre Peterlongo

We will present an exact de-bruijn graph representation based on a bloom filter. This data-structure circumvents the issue of bloom filter false positives by storing those that are problematic for graph traversals. For instance, such a data-structure, included in the Minia assembler (http://minia.genouest.org), enabled the assembly of a human genome using less than 6 GB memory. We will review in a few words the algorithmic concepts for creating and using this structure. We will additionally discuss the pro/cons while using it for storing/querying pan-genomes and how it could be adapted.


Alignment-free comparison of multiple similar genomes with tiny memory footprints

by Dirk Willrodt

When comparing whole genomes one often needs to calculate an evolutionary distances based on the pairwise number of mutations. Traditionally, this is done by multiple sequence alignment methods, which often do not scale for large genomes. For closely related organisms, alignment free methods can be applied. Several studies have shown that alignment free distance calculations well approximate alignment based methods and often lead to considerable improvements in calculation time. Alignment-free methods are an efficient alternative, though until recently alignment-free distance measures could not be interpreted as mutation distances. A recently published distance measure, Kr, now combines the efficiency of alignment-free methods with the biological relevance of mutation-based distances. It can be interpreted as the well-known Jukes-Cantor estimator of the number of substitutions per site and is based on the lengths of shortest unique substrings between genomes. Kr is implemented in the software kr2, which looks up the required shortest unique substring lengths using enhanced suffix arrays. When kr2 is applied to large similar genomes, such as those of mammals, these data structures can become very large. For example, kr2 requires 72 GB of RAM to compare 12 complete Drosophila genomes. We modified the algorithm used in kr2 such that random access to the enhanced suffix array is eliminated. The resulting algorithm streams the tables of the enhanced suffix array in sequential order, leading to a very small memory peak during the calculation. It is complemented by an algorithm which constructs the enhanced suffix array in disjoint blocks, similar to, but much more efficient than this. We implemented these algorithms in the GenomeTools software package (http://genometools.org). This provides a scalable solution to computing the Kr-distance measure. For example, the 12 Drosophila genomes can be compared in about 3 hours using only 3 GB of RAM, which means a 20-fold space improvement over kr2 and a comparable running time. Another challenging dataset consists of 17 mice strain genomes (38:44 GB sequence data) which could be processed in 4:25 h using 109:97 GB of RAM.


EDGAR - A database framework for comparative gene content analyses

by Jochen Blom

As one result of next generation sequencing, it is now feasible to analyze large groups of related genomes in a comparative approach. A main task in comparative genomics is the identification of orthologous genes in different genomes and the classification of genes as core genes or singletons. To support these studies EDGAR, an “Efficient Database framework for comparative Genome Analyses using BLAST score Ratios” - was developed. EDGAR is designed to automatically perform genome comparisons in a high throughput approach. EDGAR significantly simplifies the comparative analysis of related genomes. The software supports a quick survey of evolutionary relationships and simplifies the process of obtaining new biological insights into the differential gene content of kindred genomes. Currently, the EDGAR platform is providing public access to 131 projects with precomputed genome comparisons for 1,450 genomes. In addition to the public EDGAR data resource, EDGAR is used for the analysis of unpublished data in scientific collaborations within 125 private projects investigating roughly 2,000 genomes.


How to reach the Workshop place

The workshop takes place on the 10th floor of the building U, where the DiDy Research Group is situated.

By train: Arriving at Bielefeld Hbf (main station), exit the building towards the city center; you should see the Bielefelder Hof hotel in front of you (if you see a Cinemaxx, you are on the wrong side!). Cross the street to get to the underground tram (Stadtbahn / U-Bahn) station. Take tram number 4 towards Universität / Lohmannshof. Exit the tram at Universität. Walk up the stairs and cross the red bridge towards the university main building. Enter it and walk up the stairs. You are now standing in the main hall. Turn right, walk about 50m, and look on your left hand side for a staircase labelled U,V,M. Take the elevator to the 10th floor (if the elevator does not go up to the 10th floor, you are at the wrong elevator!). When exiting, turn left and again left. You are now in U10, where our offices are.