Arbeitsgruppenseminar Genominformatik

392188 Stoye Winter 2012/13 Thursday 16-18 in U10-146 ekvv

Course Description

In this seminar, current topics of the Genome Informatics research group are presented.

(In dieser Veranstaltung wird in Vorträgen über aktuelle Themen aus der Forschung der Arbeitsgruppe Genominformatik berichtet.)


Date Topic Name
05.10.2012, 14c.t. Biomarker discovery for personalized treatment - Mining complex, multidimensional and heterogeneous data sets Jasmin Straube
11.10.2012 Polynomial Time Algorithms for Estimating Transcript Expression with RNA-seq on Gene Graphs with Some Bounded Parameters Veli Mäkinen
18.10.2012 (Many people in Rio de Janeiro)
25.10.2012 Organizational matters Jens Stoye
01.11.2012 (Allerheiligen)
08.11.2012 MetaGX – Taxonomic classification of BWT compressed metagenome reads Christina Ander
15.11.2012 Computational assembly of the Black Death agent genome Cedric Chauve
22.11.2012 (Several people in Jena)
29.11.2012 Modeling Food-borne Disease Outbreaks Daniel Dörr
06.12.2012 Unraveling Overlapping Deletions by Agglomerative Clustering Roland Wittler
13.12.2012 HI da genotipi e da frammenti in pedigree Simone Zaccaria
20.12.2012 CeBiTec X-Mas Party
03.01.2013 (too cold)
10.01.2013 Integration of differential expression analysis capabilities into VAMP utilizing baySeq, DESeq and an own approach Kai-Bernd Stadermann
17.01.2013 (too much snow)
24.01.2013 Indel Reversal Distance Simone Zaccaria
31.01.2013 Consolidation algorithms for genomes fractionated after higher order polyploidization Katharina Jahn


Polynomial Time Algorithms for Estimating Transcript Expression with RNA-seq on Gene Graphs with Some Bounded Parameters
Veli Mäkinen

Aligning RNA-sequencing reads to the genome results into coverage values for exons and plausible splice variants. These coverage values can be assigned as weights in an exon chaining graph G=(V,E), where nodes V are exons and edges E are the splice variants. An RNA transcript candidate is a path from an exon (node) s in V containing start codon to an exon node t in V containing an end codon. We study the problem of finding k transcripts (paths) from s to t each associated with an expression level, such that they together best explain the coverages (weights) of the exons (nodes) and splice variants (edges). We give a dynamic programming algorithm to find the best paths and associated expression levels, such that the algorithm works in polynomial time assuming constant limit for k, for maximum degree in G, and for expression levels. We also show that the problem is NP-hard in general. Experimental results on prediction accuracy show that our method is very competitive as it provides better precision and recall on stringent conditions on prediction accuracy than popular tools such as Cufflinks and IsoLasso. Joint work with Alexandru Tomescu and Anna Kuosmanen.

Computational assembly of the Black Death agent genome
Cedric Chauve

The genome of a 650 year old Yersinia pestis bacteria, responsible for the medieval Black Death, was recently sequenced and assembled into 2,105 contigs from the main chromosome. We apply computational paleogenomics methods, aimed at reconstructing ancestral genome organizations, to correct, order and complete the contig set into a full chromosome. It highlights the exceptional mode of structural evolution, by rearrangements or insertion dynamics, in the Yersinia clade.

Modeling Foodborne Disease Outbreaks
Daniel Dörr

Over the last decades the globalization of trade has significantly altered the topology of food supply chains. Even though food-borne illness has been consistently on the decline, the hazardous impact of contamination events is larger. Possible contaminants include pathogenic bacteria, viruses, parasites, toxins or chemicals. Contamination can occur accidentally, e.g. due to improper handling, preparation, or storage, or intentionally as the melamine milk crisis proved.

To identify the source of a food-borne disease it is often necessary to reconstruct the food distribution networks spanning different distribution channels or product groups. The time needed to trace back the contamination source ranges from days to weeks and significantly influences the economic and public health impact of a disease outbreak.

In this work we describe a model-based approach designed to speed up the identification of a food-borne disease outbreak source. Further, we exploit the geospatial information of wholesaler-retailer food distribution networks limited to a given food type and apply a gravity model for food distribution from retailer to consumer. We present a likelihood framework that allows determining the likelihood of wholesale source(s) distributing contaminated food based on geo-coded case reports. The developed method is independent of the underlying food distribution kernel and thus particularly applicable to empirical distributions of food acquisition.

Unraveling Overlapping Deletions by Agglomerative Clustering
Roland Wittler

Structural variations in human genomes, such as deletions, play an important role in cancer development. Next-Generation Sequencing technologies have been central in providing ways to detect such variations. Methods like paired-end mapping allow to simultaneously analyze data from several samples in order to, e.g., distinguish tumor from patient specific variations. However, it has been shown that, especially in this setting, there is a need to explicitly take overlapping deletions into consideration. Existing tools have only minor capabilities to call overlapping deletions, unable to unravel complex signals to obtain consistent predictions.

We present a first approach specifically designed to cluster short-read paired-end data into possibly overlapping deletion predictions. The method does not make any assumptions on the composition of the data, such as the number of samples, heterogeneity, polyploidy, etc. Taking paired ends mapped to a reference genome as input, it iteratively merges mappings to clusters based on a similarity score that takes both the putative location and size of a deletion into account.

We demonstrate that agglomerative clustering is suitable to predict deletions. Analyzing real data from three samples of a cancer patient, we found putatively overlapping deletions and observed that, as a side-effect, erroneous mappings are mostly identified as singleton clusters. An evaluation on simulated data shows, compared to other methods which can output overlapping clusters, high accuracy in separating overlapping from single deletions.