This is an old revision of the document!

Research

The primary material analyzed by genome informatics are genomic sequences. Beyond the acquisition and basic analysis of these data, the next challenge is to extract the higher-level information encoded in them, which poses the need for sound mathematical models, efficient algorithms, and user-friendly software.

Research in the Genome Informatics group spans a broad spectrum in this exciting field, from the low level of DNA sequence comparison up to the higher levels of comparative genomics, and making better infrastructures.

Efficient Comparison of DNA Sequences (aka team kmers)

Luca

Lucas

The amount of publicly available sequencing data is growing faster than computational power. Searching for a sequence of interest among datasets is a fundamental need; however, no method scales to the dozens of petabytes of data already available today. Thus, new computational methods are required to perform a search against datasets.

Queries on large-scale datasets are usually done by indexing all k-mers (words of length k) from sequences. These k-mers are then typically indexed in an Approximate membership query (AMQ) data structure. The proportion of shared k-mers between the indexed datasets and the AMQ then gives an overview of the presence of the query in a dataset. To index a set of sequences, AMQ data structures typically require less space than the original set. However, AMQ data structures suffer from a non-null false positive rate, which biases downstream analysis.

AMQ data structures can be generalized for additionally recording the abundance of indexed elements, they are then called “counting AMQ” data structures. The abundance information is crucial for many biological applications such as transcriptomics or metagenomics. However, counting AMQs data structures suffer from false positives and overestimated calls.

We propose strategies to reduce the false positive rate and overestimation rate of both AMQ and counting AMQ data structures.

Fast Heuristic Local Alignment Search in Pangenome Graphs

The advent of High Throughput Sequencing raises a major concern about storage and indexing of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. Pangenomes are oftentimes represented by graphical data structures such as colored de Bruijn graphs (CDBGs) in which vertices represent colored k-mers, words of length k associated with the genomes in which they occur. CDBGs may also be compacted by merging vertices of unique non-branching paths. Representing pangenomes as compacted CDBGs is beneficial, but it requires modifications of methods to query the data. We study the problem of finding high scoring local alignments between a query sequence and a compacted CDBG that are likely to represent sequence homology. Our work is in line with the popular BLAST algorithm. An implementation of our method is available

Tizian

Computational Comparative Genomics

Comparative Pangenomics

Genome rearrangements have been studied extensively in theoretical works of Comparative Genomics. These results however, have only been applied on a limited scale to real genomes. The continuing progress of sequencing projects and technology made more and more high quality genomes available and enabled even Pangenomic analyses, that is, analyses that include all availailable genomes of a species. Pangenomics and theoretical Rearrangement Studies utilize remarkably similar graph data structures. Given the abundance of theoretical results in Comparative Genomics, it is likely that many of these results can be applied in Pangenomics. Conversely, the abundance of practical results in the construction of Pangenome graphs can likely contribute to these theoretical results seeing more real world applications.

Orthology Inference via Large-scale Rearrangements

Computing distances based on large-scale rearrangements between two family-annotated genomes (Bohnenkämper et al., 2021) was converted into a method that takes genome rearrangements into consideration for inferring gene orthologies of two genomes (Rubert, Martinez & Braga, 2021). For a set of k genomes, the inference of pairwise gene orthologies is the core of OrthoFFGC, a tool for inferring gene families across the k input genomes (Rubert, Doerr & Braga, 2021; Rubert & Braga, 2023; OrthoFFGC-website).

Sequence-based Phylogenomics

SANS| Comparative genomics often involves the reconstruction of phylogenies. The ever-increasing number of available genomes, many of which are published in an unfinished state or lack sufficient annotation, poses challenges to traditional phylogenetic inference methods that rely on the comparison of marker sequences. Whole-genome approaches have emerged as a solution to these challenges, but as these approaches are based on pairwise comparisons between genomes, their runtime increases quadratically with the number of input sequences, making them unsuitable in large-scale scenarios.

SANS (tool-website; Rempel and Wittler, 2021; Wittler, 2020) is a whole-genome based, alignment- and reference-free approach that does not rely on a pairwise comparison of genomes. In a pangenomic approach, evolutionary relationships are determined based on the similarity of the whole sequences. Sequence segments (k-mers) shared by a subset of genomes are interpreted as a phylogenetic split indicating the closeness of these genomes and their separation from the other genomes.

Evolution of Gene Clusters

Acestor-Reconstruction| We integrate the concept of conserved gene clusters into the framework of phylogenetics. Here, the focus is not any more on the discovery of new gene clusters, but on their evolution. Given the topology of a phylogenetic tree and the gene orders of the leaf nodes, our methods reconstruct ancestral gene orders at the internal nodes under different evolutionary (rearrangement) models (see Rococo, RINGO, PhySca).
In addition, the development of ancient DNA (aDNA) sequencing led us to the problem of integrating this additional data in the reconstruction of ancestral genomes, aiming to scaffold fragmented aDNA assemblies and to improve the global reconstruction of all ancestors in the phylogeny.

Infrastructure

SANS-ambages

SANS-ambages| We maintain a software tool for alignment-free, whole-genome based phylogenomics implementing the SANS approach (see above: “Sequence-based Phylogenomics”). The current version “SANS ambages” (abundance-filter, multi-threading and bootstrapping on amino-acid or genomic sequences) provides several new features: besides processing DNA sequences (whole genomes or assemblies), SANS ambages can also work on amino acid level taking protein sequences (translated or untranslated) as input. Further, the ability to process read data has been enhanced by the option to filter out low-abundant sequence segments. Multiple input sequences can be processed in parallel, and bootstrapping allows to augment the output with confidence values. SANS is hosted as a de.NBI service and can easily be obtained from our Gitlab repository.

SOCKS & PanBench

A major challenge in computational pangenomics is the memory- and time-efficient analysis of multiple genomes in parallel. Many software tools used in the current research follow the idea of colored k-mer sets to efficiently index and query a large collection of strings, but use different strategies when it comes to the implementation. The aim of this project is to evaluate the different solutions and establish a standard interface.

The SOCKS interface (Software for colored k-mer sets) defines a common set of core features and standard input and output formats that software tools in computational pangenomics should implement. It aims to enhance the comparability and interoperability of these tools for the benefit of both developers and users. A detailed description of the interface including some examples can be found on the dedicated project page.

PanBench (Pangenomics Benchmark and Workbench) is an open catalog of software tools for computational pangenomics and is an example of what such a common interface makes possible. It allows users and developers to search for tools by different criteria, compare the performance of these tools, and test each tool in a user-friendly web interface before downloading and installing the software on their local machine.

Genome Informatics