Workshop on Data Structures in Bioinformatics 2015/16


Place: Bielefeld University, room X-E1-201
Date: February 23-24, 2016


Introduction

DSB is an incubator of ideas and facilitates exchanges as well as collaborations on topics related to data structures in bioinformatics. Speakers will present state of the art techniques and recent advances in this field.


Participants

Name Affiliation
Pall Melsted University of Iceland, Reykjavik, Iceland
Anthony J. Cox Illumina, Cambridge, UK
Zamin Iqbal Wellcome Trust Centre for Human Genetics, Oxford, UK
Sorina Maciuca Wellcome Trust Centre for Human Genetics, Oxford, UK
Rachel Norris Wellcome Trust Centre for Human Genetics, Oxford, UK
Solon P. Pissis King's College, London, UK
Rayan Chikhi University of Lille, France
Pierre Peterlongo INRIA, Rennes, France
Antoine Limasset INRIA, Rennes, France
Bastien Cazaux LIRMM and IBC, Montpellier, France
Rodrigo Canovas LIRMM and IBC, Montpellier, France
Karel Brinda University of Paris-Est, France
Nicola Prezza University of Udine, Italy
Jasmijn Baaijens CWI, Amsterdam, Netherlands
Alexander Schoenhuth CWI, Amsterdam, Netherlands
Johannes Fischer University of Dortmund, Germany
Agnieszka Danek Silesian University of Technology, Gliwice, Poland
Adam Gudys Silesian University of Technology, Gliwice, Poland
Marek Kokot Silesian University of Technology, Gliwice, Poland
Maciej Dlugosz Silesian University of Technology, Gliwice, Poland


Organizing Committee

Name Affiliation
Guillaume Holley Bielefeld University, Germany
Roland Wittler Bielefeld University, Germany
Jens Stoye Bielefeld University, Germany
Eric Rivals LIRMM, Montpellier, France


Schedule

Each talk of 25 minutes is followed by up to twenty minutes of questions/discussion

Tuesday, February 23rd

Start Name
9h Rayan Chikhi Fast navigation in de Bruijn graphs
9h45 Pall Melsted The de Bruijn Graph as an Index
10h30 Coffee Break
11h Nicola Prezza Space-efficient compression and indexing of genomic big data: theory and practice
11h45 Johannes Fischer Approximating LZ77
12h30 Lunch
14h Anthony Cox BWT-based predictive compression of DNA sequences
14h45 Guillaume Holley BFT-Comp: Alignment-free and reference-free compression of pan-genome sequencing reads
15h30 Coffee Break
16h Rodrigo Canovas Affix Tree Data Structures
16h45 Open discussions
Start Name
9h Bastien Cazaux The Superstring Graph
9h45 Solon P. Pissis Sequence Comparison Using Positive and Negative Information
10h30 Coffee Break
11h Sorina Maciuca Mapping to a graph of genome variation, with applications to P. falciparum
11h45 Agnieszka Danek Compression and indexing of collections of genomes
12h30 Lunch
14h Jasmijn Baaijens Viral quasispecies assembly to the next level: going de novo
14h45 Karel Brinda Indexing structures for metagenomic classification
15h30 Coffee Break
16h Open discussions


Talks

Sequence Comparison Using Positive and Negative Information

by Solon P. Pissis

Joint work with Maxime Crochemore, Gabriele Fici, Alice Heliou and Robert Mercaş.

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realized by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as q-gram distance, are usually computed in time linear with respect to the length of the sequences. In this talk, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an absent word of some sequence if it does not occur in the sequence. An absent word is minimal if all its proper factors occur in the sequence. We will present the first linear-time and linear-space algorithm to compare two sequences by considering all their minimal absent words. In the process, we will present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences.


Space-efficient compression and indexing of genomic big data: theory and practice

by Nicola Prezza

As the number of sequenced species and taxonomies advances at an ever growing pace, new computational challenges are arising in the effort to process these huge amounts of data. In particular, the problem of indexing big datasets (in order to speed up subsequent search/access queries) and compressing repetitive collections (e.g. taxonomies) is complicated by the fact that existing algorithms are extremely memory-consuming, therefore significantly limiting the size of the datasets that can be handled. Even if the size of the final compressed representation can be orders of magnitude smaller than that of the original dataset, standard algorithms require a lot of space (RAM or disk) at construction time; in some cases, this space can be several times the input size.

As showed by recent algorithmic results, the problems of text indexing and text compression are tightly connected. As a matter of fact, a large portion of text indexes, as well as compression algorithms, makes use of two main techniques: Burrows-Wheeler Transform (BWT) and Lempel-Ziv parsing (LZ77). An important problem to tackle is therefore the following: can we build the BWT and LZ77 using a working space close to the final compressed representation size?

In this talk, I will answer positively to the above question by showing algorithms and data structures to build BWT and LZ77 in compressed working space. This is particularly useful when the dataset is extremely repetitive: in such cases, our results require much less working space than standard algorithms working in uncompressed space. All the results have been implemented in a highly optimized C++ library, collecting several dynamic compressed data structures. The final part of the talk will be dedicated to a description of our library and to experimental results. I will conclude with some open problems concerning text repetitiveness measures and algorithmic challenges in text compression.


BFT-Comp: Alignment-free and reference-free compression of pan-genome sequencing reads

by Guillaume Holley

Joint work with Faraz Hach.

The advent of High Throughput Sequencing (HTS) raises a major concern about storage and transmission of data produced by sequencers. In particular, large-scale sequencing projects extract large amounts of DNA sequences from tens to several thousands of genomes per species. Such a large collection of highly similar and redundant sequences, called pan-genome, is ideal for compression but is often compressed in mutiple archives. Indeed, HTS-specific compression tools, developed to counter the massive growth of such data and reduce storage costs, do not offer the possibility without complete decompression to edit archives with newly produced data. We present a new alignment-free and reference-free compression method called BFT-Comp that is based on the Bloom Filter Trie. It tackles the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de-Bruijn graph. Sequencing reads can be added to a BFT-Comp archive without the need to decompress it entirely.


The Superstring Graph

by Bastien Cazaux

Joint work with Eric Rivals.

Merging words according to their overlap yields a superstring. This basic operation allows to infer long strings from a collection of short pieces, as in genome assembly. To capture a maximum of overlaps, the goal is to infer the shortest superstring of a set of input words. The Shortest Cyclic Cover of Strings (SCCS) problem asks, instead of a single linear superstring, for a set of cyclic strings that contain the words as substrings and whose sum of lengths is minimal. SCCS is used as a crucial step in polynomial time approximation algorithms for the notably hard Shortest Superstring problem, but it is solved in cubic time. The cyclic strings are then cut and merged to build a linear superstring. SCCS can also be solved by a greedy algorithm. Here, we propose a linear time algorithm for solving SCCS based on a Eulerian graph that captures all greedy solutions in linear space. Because the graph is Eulerian, this algorithm can also find a greedy solution of SCCS with the least number of cyclic strings. This has implications for solving certain instances of the Shortest linear or cyclic Superstring problems.


Mapping to a graph of genome variation, with applications to P. falciparum

by Sorina Maciuca

Joint work with Pf3k Consortium, Dominic Kwiatkowski, Gil McVean, Zamin Iqbal.

The current paradigm of mapping NGS reads to a single reference genome per species creates considerable problems in regions of high diversity and structural variation. In principle a data structure which incorporated known variation would provide a substrate for less biased analysis. We have developed a data structure that incorporates the variation seen in a population by embedding alternative sequences into the linear reference genome. This defines an implicit partial order graph that stores homologous sequences in regions with variation. The graph is linearised and encoded with an FM-index, enabling construction in low time and memory. Then we use a wavelet tree over the Burrows-Wheeler Transform and a modified version of the backwards search algorithm to align reads to paths in our reference graph. From this, we infer a personalised reference sequence for each sample, which enables the re-use of standard downstream analyses based on mapping pileups, and allows easier incorporation of annotation.


Fast navigation in de Bruijn graphs

by Rayan Chikhi

De Bruijn graphs are among the basic objects used by bioinformatics methods to analyze sequencing data. Formally, a de Bruijn graph is a directed graph, where nodes are all substrings of length k present in the data, and edges represent exact overlaps of length k-1 between pairs of nodes. A majority of de novo assembly tools are based on de Bruijn graphs. In order to assemble large genomes, it is critical to have efficient algorithms and data structures to construct and represent these graphs in memory.

This presentation deals with algorithms and data structures for the representation of de Bruijn graphs. In a RECOMB 2014 article, we proposed and studied the concept of navigational data structures, as well as algorithms that focused on achieving small memory. In this presentation, I will quickly recall some results and open problems from this paper. Then, I will present a simple data structure for representation of de Bruijn graphs, based on a minimal perfect hash function. The structure offers better performance for navigational queries, at the expense of using slightly more memory. It is succinct and relies on a minimal perfect hash function. An implementation is available in the GATB library.


Indexing structures for metagenomic classification

by Karel Břinda

Joint work with Gregory Kucherov, Kamil Salikhov and Maciej Sykulski.

Metagenomics is a powerful approach to study genetic content of environmental samples, which has been strongly promoted by NGS technologies. One of the main tasks is the assignment of reads of a metagenome to taxonomic units, and the subsequent abundance estimation. Most of recently developed programs for this task (such as LMAT, KRAKEN, KALLISTO) perform the assignment based on shared k-mers between reads and references. SIn such an approach raises, two major algorithmic subproblems can be distinguished: designing a k-mer index for a huge database of reference genomes and a given taxonomic tree, and designing an algorithm for assigning reads to taxonomic units from information on shared k-mers.

In this talk, we consider the problem of index design and present two novel data structures, both providing a full list of genomes containing a queried k-mer. The first structure is the tree of Bloom filters (somewhat similar to but different from Sequence Bloom Trees of [Solomon, Kingsford 2015]), applied to a binarized version of the taxonomic tree. The second approach is based on BWT-index applied to sequences encoding k-mers proper to each node of the taxonomic tree. We analyse the usefulness of both indexes and compare them in terms of speed and memory requirements.


The de Bruijn Graph as an Index

by Pall Melsted

The problem of mapping short reads to highly identical sequences (targets) is fundamental for solving problems for RNA-Seq, metagenomic sequencing, and population based graph references. The concept of a pseudoalignment was first introduced with kallisto to effectively solve the first two problems, simply put a pseudoalignment is a list of targets where the read could have come from rather than a base-by-base alignment of nucleotides.

The raw speed of kallisto is obtained by using a Target de Bruijn Graph, namely a dBG of the references provided, to produce pseudoalignments efficiently with as few cache misses as possible. In this talk we will delve into the nitty gritty details of the kallisto algorithm. We will also explore the tradeoffs between various k-mer sizes, extensions to using minimizers and how they relate to the T-DBGs, as well as looking at future applications of T-DBGs.


Viral quasispecies assembly to the next level: going de novo

by Jasmijn Baaijens

Joint work with Amal Makrini, Eric Rivals and Alexander Schoenhuth.

A virus infection usually consists of a group of virus strains, each of which presents its own haplotypic sequence, together commonly referred to as viral quasispecies. The number of different haplotypes can vary widely, and each strain may appear at a different frequency. In order to assemble not just a consensus sequence but each of the individual haplotypes, one needs to distinguish sequencing errors from true variants of very low frequency. Furthermore, the high diversity within a virus populations may also cause the available reference genomes to be very different from the haplotypes in question. This can be problematic for reference-guided assembly methods, which is why we have developed an algorithm for de novo assembly of viral quasispecies. Our method is based on an existing tool that uses a reference genome to construct an overlap-like graph, followed by an iterative procedure that merges maximal cliques into super-reads. Doing this reference-free presents several challenges, such as finding all approximate suffix-prefix overlaps among the reads, eliminating cycles from the graph, and enumerating maximal cliques efficiently. Eventually, this method results in growing, error-corrected haplotype contigs that can grow into full haplotypes. We present some preliminary results on a dataset consisting of Illumina reads of five HIV strains.


Approximating LZ77

by Johannes Fischer

We show how to approximate the LZ77 parse of a string of length n in O(n log n) time. If the optimal parse consists of z phrases, using only O(z) working space we can return a parse consisting of at most 2z phrases in O(n log n) time, and a parse of at most (1 + e)z phrases in O(n log2 n) (constant e > 0). As previous linearithmic-time algorithms for LZ77 use Omega(n poly log n) space, but z can be exponentially small in n, these improvements in space are substantial.


BWT-based predictive compression of DNA sequences

by Anthony Cox

Joint work with Lilian Janin.

Most DNA sequencing technologies exploit parallelization in some form - as time proceeds, data is added to multiple sequences at once. For Illumina platforms, every “cycle” of sequencing appends a base to each of the sequences found on the flowcell.

Once enough cycles have been completed, the next base of many of these sequences can often be correctly predicted from the bases that have been accumulated already. We will describe methods based on Burrows-Wheeler indexes that achieve compression by encoding only the relatively few discrepancies between this prediction and the actual data.

The compression is done in a cycle-by-cycle fashion that allows, for example, transfer of compressed data to the cloud to begin even before a sequencing run has been completed. We will consider both base calls and quality scores, and describe reference-free and reference-based flavours of the approach that offer different trade-offs between the computational resources that are needed and the level of compression that is attained.


Compression and indexing of collections of genomes

by Agnieszka Danek

Joint work with Sebastian Deorowicz.

The NGS technologies made the sequencing faster and cheaper and, thus, available to broad community of researchers. The attention is not only on the human genome. There are more and more sequenced genomes available and the growing interest is in the sequencing genomes of multiple individuals from the same species. The more genomes in the collection, the more challenging is storage, processing and analysis of such large datasets. In this talk, I will present state of the art approaches to cope with compression and indexing of large collection of genomes of the same species, focusing on the results achieved by our solutions.

The attempts to compress a single genome revealed it is almost incompressible. However, when dealing with collection of genomes of the same species, the fact that they are highly similar make them more susceptible to compression. Although the dictionary-based methods are most suitable, most of the general purpose LZ-style compressors cannot cope well with large distances between long matches and are often useless in such case. Consequently, many specialised algorithms appeared, where one (or more) genome is compressed in reference to the other one.

As for indexing a large collection of genomes, the most challenging are memory requirements of the built data structure. The efficient search for a pattern using the constructed index is possible, when it fits in the main computer memory. Classic indexes are not a good choice, as they struggle even with a single mammalian-sized genome.

Most of the existing approaches to compress or index a collection of genomes work with raw input sequences. Only few of the algorithms use some additional knowledge about differences between sequences in the set, while the nowadays practice is to represent a collection of genomes of the same species as a reference sequence together with a description of all variants occurring in each individual. I will show that this knowledge can greatly improve efficiency of compression and indexing, while reducing the amount of necessary resources.


Affix Tree Data Structures

by Rodrigo Canovas

Joint work with Eric Rivals.

The Suffix Tree is a well-known data structure for doing string analysis over a text, offering unidirectional algorithms to pattern search problems. The Affix Tree is a generalization of a Suffix Tree where bidirectional search is supported. The bottleneck of the Affix Trees is the space required to store the data structure and its complexity to create a practical implementation. In this work we explored the different approaches to store an uncompressed or compressed form the Affix Tree data structure.


How to reach the Workshop place

By train: Arriving at Bielefeld Hbf (main station), exit the building towards the city center; you should see the Bielefelder Hof hotel in front of you (if you see a Cinemaxx, you are on the wrong side!). Cross the street to get to the underground tram (Stadtbahn / U-Bahn) station. Take tram number 4 towards Universität / Lohmannshof. Exit the tram at Universität. Walk up the stairs and cross the red bridge towards the university main building. When exiting the bridge, take the stairs going down on your right. The X building is the new white and green building next to the “Frauenparkplatz” on the other side of the street. When entering the X building, take the green stairs in front of you to the first floor (E1). The room 201 is in the first corridor on the left of the coffee store.

This link provides a map of the university campus. The workshop takes place in the building number 2 “Gebaude X” and the grey dashed line represents the tram line.