Mini Workshop - January 26, 2012

High Performance Sequence Analysis

  • January 26, 2012
  • 15:30-18:00
  • room: U10-146

Speakers

  • Jochen Blom (Uni Bielefeld)
  • Sascha Steinbiss (ZBH Hamburg)
  • Birte Kehr (FU Berlin)

Schedule


15:30 Jochen Blom Exact and Complete Short Read Alignment to Microbial Genomes Using GPU Programming
16:15 Sascha Steinbiss A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences
17:15 Birte Kehr STELLAR: Fast and Exact Local Alignments

Abstracts

Jochen Blom: Exact and Complete Short Read Alignment to Microbial Genomes Using GPU Programming

The introduction of next generation sequencing techniques and especially the high-throughput systems Solexa (Illumina Inc.) and SOLiD (ABI) made the mapping of short reads to reference sequences a standard application in modern bioinformatics. Short read alignment is needed for reference based re-sequencing of complete genomes as well as for gene expression analysis based on transcriptome sequencing. Several approaches were developed during the last years allowing for a fast alignment of short sequences to a given template. Methods available to date use heuristic techniques to gain a speedup of the alignments, thereby missing possible alignment positions. Furthermore, most approaches return only one best hit for every query sequence, thus losing the potentially valuable information of alternative alignment positions with identical scores.

We present SARUMAN (Semiglobal Alignment of short Reads Using CUDA and NeedleMAN-Wunsch), a dedicated exact mapping approach for microbial genomes that returns all possible alignment positions of a read in a reference sequence under a given error threshold.

Alignments are computed in parallel on graphics hardware, facilitating an considerable speedup of this normally time consuming step. Combining an exact filter algorithm with CUDA-accelerated alignments, we were able to rapidly align reads to microbial genomes in time comparable or even faster than all published approaches. SARUMAN was compared to and has proven to be as fast or even faster than SOAP2, Bowtie, BWA, SHRiMP, and Pass, while still providing an exact, complete and optimal result. At the same time SARUMAN runs on every standard Linux PC with a CUDA compatible graphics accelerator.

Sascha Steinbiss: A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations.

This talk presents a recently published new space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2+8E-6 bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to many metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.

In addition, the talk will cover some new developments in the GtEncseq, particularly regarding efficient storage of short read sequences. Finally, I will briefly present the larger software system into which the GtEncseq is integrated.

Birte Kehr: STELLAR: Fast and Exact Local Alignments

Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. STELLAR is a new local pairwise aligner that has full sensitivity for eps-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. The aligner is composed of two steps, filtering and verification. It applies the SWIFT algorithm for lossless filtering, and a new verification strategy that we have proved to be exact.

In this talk, I will introduce STELLAR and show results on simulated and real genomic data that confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments.