The primary material analyzed by genome informatics are genomic sequences. Beyond the acquisition and basic analysis of these data, the next challenge is to extract the higher-level information encoded in them, which poses the need for sound mathematical models, efficient algorithms, and user-friendly software.
Research in the Genome Informatics group spans a broad spectrum in this exciting field, from the low level of DNA sequence comparison up to the higher levels of comparative genomics, metagenomics and phylogenetics.
A fundamental task in biological sequence analysis is the comparison of two DNA strings, with the aim of finding regions that match each other exactly or approximately.
Currently, we are using our index-based matching algorithm SWIFT as a basic component of the recently developed r2cat and treecat methods. These are employed in the context of whole genome sequencing for closing the gaps that remain between the contigs after a standard assembly of shotgun reads.
Genomes often contain several repetitive regions that do not encode any proteins, which may result in the development of genetic disorders. We have investigated the problem of identifying families of repetitive sequences in genomes that are only partially sequenced. Our approach is based on de Bruijn graphs, whose highly regular structure results in several appealing combinatorial properties.
The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and indexing of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes.
In the first and second year of this project, an alignment-free and reference-free data structure for indexing a pan-genome has been developed. A pan-genome is represented as a colored de Bruijn graph in which vertices represent colored k-mers, words of length k associated with the identity of the strains in which they occur. For storing and traversing such a graph, we proposed the Bloom Filter Trie (BFT), a trie that compresses and indexes strings of fixed-length k and their colors. The BFT is based on the burst trie and integrates Bloom filters to navigate in the trie and accelerate the graph traversal. The specificity of our work compared to existing pan-genome tools resides in four main characteristics:
Experimental results prove better performance compared to another state-of-the-art data structure. An API written in C is provided to access all functionalities of the the data structure as well as colored de Bruijn graph operations such that methods can be designed on top of the BFT. The BFT source code and documentation are available here.
The data structure was presented at the 15th International Workshop on Algorithms in Bioinformatics in Atlanta, USA. An extended version of the paper was published in the journal Algorithms for Molecular Biology in 2016.
Pan-genomes are also ideal for compression. However, HTS-specific compression tools, developed to counter the massive growth of sequencing data and reduce storage costs, are not designed to process them. These methods can only compress genomes individually such that the compression ratio of a pan-genome does not improve with newly compressed genomes. During the third year of this project, a new alignment-free and reference-free compression method has been developed. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update archives with new genome sequences without full decompression of the archive. This method can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, this method outperforms all other tested tools with a 30 % compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.
We are working on developing methods for the functional analysis of a pan-genome. In order to store a pan-genome, we use the Bloom Filter Trie (BFT), in which a pan-genome is represented as a colored de Bruijn graph (C-DBG). The first step is the identification of the core genome. Matching sequences can be interrupted by SNPs or other structural variations, resulting in branching nodes and additional paths in the graph. Therefore, we first introduce a formal definition of a core genome in a C-DBG.
For a set of input genomes, we consider the core as being composed of sequences shared by at least a required number of genomes, called the quorum. In order to model variations and include them in the core genome, we introduce the concept of a core bubble. A distance is used to limit the maximal distance allowed between two core paths, where a core path is a path containing only core nodes. Besides bubbles, we introduce the defi nition of connected and disconnected core paths in order to cover the remaining cases. Currently, we are developing an algorithm for detecting a core genome in a given C-DBG.
The goal of comparative genomics is to compare genomes, not any more at the level of DNA sequence, but rather at the higher level of gene content and gene order.
Our motivation is to devise a quantitative measure for the distance between two genomes, which can in turn be used in phylogenetic studies. A reasonable distance measure is the minimum number of rearrangement operations that transform one genome into another. In this context, we formally defined the Double-Cut-and-Join (DCJ) distance that includes all classical genome rearrangement operations, also for circular chromosomes. Based on this, we have developed the first linear-time algorithm for computing the so-called Hannenhalli-Pevzner distance between two multichromosomal linear genomes. This algorithm was recently implemented in the software UniMoG along with the most efficient algorithms for the genome rearrangement models that employ only DCJ operations, only restricted DCJ operations, only translocations or only inversions. After the DCJ-indel model, we are currently studying the inversion-indel model.
Virtually all algorithmic methods in genome-scale comparative genomics rely on gene family information. While the concept of gene families is well established from a biological point of view, in practice gene families are often predicted by clustering methods. Their outcome is prone to contain errors which deteriorate subsequent analyses.
We promote the idea that genome-scale comparative genomics studies can be performed without prior gene family assignment. Instead of predicting homologies between genes, we propose direct use of similarity values because such information not only allows to make more substantiated choices in resolving gene order in subsequent analyses, but can sometimes better reflect the biological reality. In support of our case, we recently presented new approaches to calculate two rearrangement distances: the number of conserved adjacencies, which is a similarity measure related to the breakpoint distance, and the Double Cut and Join distance (DCJ), without the use of gene families.
Gene clusters are sets of genetic markers that maintain close physical proximities to each other in several genomes. These genetic markers often embody protein coding genes that are functionally and/or regulatorily associated.
Traditional methods take into account one-dimensional distances in the genomic sequence to determine physical proximities. They do not consider any spatial information regarding the structure of the chromatid in the cell and therefore might miss spatial gene clusters within the genome. However, spatial structure and function seem to be correlated in many cases.
Nowadays, it is possible to gain three-dimensional information about the chromatid in the cell (e.g. by Hi-C experiments).
We try to include this Hi-C data to improve the gene cluster model in a spatial manner which - on the other hand - seems to raise the computational complexity. The aim is to develop a polynomial algorithm that is able to find all common spatial gene clusters between two genomes.
We aim at the integration of the concept of gene clusters into the framework of phylogenetics. Here, the focus is not any more on the discovery of new gene clusters, but on their evolution. Given the topology of a phylogenetic tree and the gene orders of the leaf nodes, we studied the reconstruction of parsimonious ancestral gene orders at the internal nodes under different evolutionary (rearrangement) models (see Rococo, RINGO, PhySca).
In addition, the recent development of ancient DNA (aDNA) sequencing led us to the problem of integrating this additional data in the reconstruction of ancestral genomes, aiming to scaffold fragmented aDNA assemblies and to improve the global reconstruction of all ancestors in the phylogeny.
Protein domains are described as a basic structural, evolutionary and functional units of proteins (Holm and Sander, 1994). An individual domain is an independent folding unit in a polypeptide. Although there are disordered protein regions(they lack of a stable secondary or terciary structure), domains are considered the functional building blocks of proteins which can even recombine to create new functional proteins. In prokaryotes, the protein repertoire has aproximately 66% of multidomain proteins, while in eukaryotes is up to 80%. Since domains are strongly related to the protein function itself, finding the domain sets between different species in various scenarios might give us insight on the gene clusters involved in diverse cellular processes.
Next-Generation Sequencing not only allows to sequence the “average” genome of an organism using relatively little money and time, but it also supports the investigation of the genomes of single individuals. These data are the basis for new approaches in scientific and clinical research.
Highly variable cell surface receptors of B cells and T cells are, as part of the human immune system, responsible for fending of everyday molecular threats to human life, for example pathogenous infective agents. Single nucleotide variation in the sequence of individual receptor molecules create a variable pool, the receptor repertoire, able to encounter the vast molecular variability of microbial pathogenic attack. Loss of variation in the receptor pool is thought to result in increasing susceptibility to severe infections possibly leading to death.
Analysis of the immune receptor repertoire for clinical and scientific research gains new insight into the unique and adaptive immune system. Function during health, infection, cancer, and autoimmune diseases project only small areas of target research.
Historically, techniques to analyze immune repertoires were limited by the amount of circulating lymphocyte cells in blood (~10^6) and only the advent of high-throughtput sequencing (HTS) techniques made comprehensive analysis of this unique system feasible. Still the huge data sets resulting from nowadays technologies pose challenges to analysis by being prone to error at frequencies similar to the small quantities of rare receptor variants. We developed one of the first tools, TCRProfiler, that is able to analyze HTS immune receptor data sets on single or multi-CPU systems. The tool aims to improve sequencing error correction and preparatory errors like PCR-induced single nucleotide substitution errors and produces comprehensive results in form of statistical compilation and visualization.
Structural variations in human genomes, such as insertions, deletions, or rearrangements, play an important role in cancer development. Next-Generation Sequencing (NGS) technologies have been central in providing ways to detect such variations, e.g. by paired-end mapping.
Many tools exists to efficiently and accurately call variations from a given NGS sample with respect to a reference genome, and straight-forward approaches became accepted to compare a call set to entries in variant databases or to other call sets. We focus on the precision work:
(1) By modeling a mixture of several cell-lines, we want to detect and unravel overlapping variations. Our deletion detection tool AggloIndel is specifically designed to cluster short-read paired-end data into possibly overlapping predictions.
(2) We address the combined problem of inaccurately predicted breakpoints and repeat-induced ambiguities. Given a reference sequence
…ACTGCTGCA…, both deleting the first
CTG at or deleting the last
TGC results in exactly the same sequence
…ACTGCA… and should thus be considered equivalent although not even overlapping. Our tool REDDY allows an error-tolerant and repeat-aware comparison of different sets of deletions or insertions, considering duplicates within one set as well.
Metagenomics studies microbial communities in their natural environment. The MGX project aims at the design and implementation of a novel analysis platform for metagenomics data, which will provide the necessary infrastructure to cope with the large amounts of sequence data and offer easy mechanisms for the inclusion of novel algorithms into the platform. Upon release, the platform will feature analysis capabilities employing state-of-the-art methods and high-quality result visualizations, thereby enabling researchers to easily obtain answers to their scientific questions.
Metatranscriptomics is a relatively new field of research in microbiology and gives the opportunity to study communities of microorganisms using the RNA of all microorganism present in an environment. In contrast to metagenomics the focus does not lie on the potential of the community, but on their actual activity. So the sequence reads must be analyzed based on function and taxonomy. In this exciting field, our research is directed at the analysis, visualization and the comparison of metatranscriptomes.
Amplicon sequencing is often used to analyze the composition of a
bacterial community, using e.g. the 16S rRNA marker gene. Prior to sequencing,
the DNA is amplified via PCR where chimeras can arise. These are
artificial, single-stranded DNA sequences formed by cross hybridization of
sequences during the amplification process. When not removed from the data,
chimeras can have a
falsifying influence on the consecutive analysis.
We developed ChimP, a method that identifies chimeras and the sequences they arose from in a set of NGS amplicon reads. Our algorithm identifies a chimera by aligning a query sequence to a multiple alignment of similar sequences with the Jumping Alignment algorithm. If the alignment of a putative chimeric query ‘jumps’ between the rows of the multiple alignment and a specific score threshold is reached, the query is marked as chimeric.
Modern analytical methods, such as GC-MS or LC-MS, produce high-dimensional data whose manual comparison is impossible. Our ChromA software allows to align multiple experiments using a dynamic time warping procedure in conjunction with a-priori chemical knowledge.
Moreover, we have developed the web-based MeltDB platform that provides a complete solution for the analysis and integration of metabolomics experimental data. MELTDB is coupled to CHROMA and other CeBiTec tools, such as GENDB and EMMA.
The cross-platform processing framework Maltcms, the basis for CHROMA, provides algorithms for peak-detection, -integration, peak- and profile alignment of chromatography-mass spectrometry data. It is complemented by the graphical user interface Maui. Current research focuses on the integration of metabolomics and gel-based proteomics data in our application Proteus.
Cancer is a set of diseases that is caused by accumulating genetic mutations
such as single-nucleotide alterations (SNAs) and structural variations including
copy number aberrations (CNAs). Over time, tumor cells evolve and accumulate
different mutations, leading to a heterogeneous cell population that consists of
different subclonal populations. Information about the different mutations in
the subpopulations and their frequency - the subclonal composition - can help to
identify driver mutations and to choose targeted therapies.
We are developing a method - called Onctopus - that reconstructs the clonal composition of a bulk-sequenced cancer sample using both SNAs and CNAs. We built a joint likelihood model and developed a linear relaxation of our model as a mixed integer linear program that can be solved with stateoftheart solvers.
This is a joint project with Gunnar Rätsch from ETH Zürich and Quaid Morris from University of Toronto.