Workshop - September 13, 2012 - Koper (Slovenia)

Combinatorial Algorithms in Bioinformatics

September 13, 2012
9.00 am - 6.00 pm
Lecture room “Velika predavalnica”, UP FAMNIT, Glagoljaška 8, 6000 Koper, Slovenia (map)

Workshop photos

Photos from the workshop are available here.

Overview

The workshop will take place at the University of Primorska, in the Faculty of Mathematics, Natural Sciences and Information Technologies, in the Slovenian coastal town of Koper. It will take place the day after this year's ESA and WABI conferences in Ljubljana.

The purpose of the workshop is to provide an opportunity for researchers working in the field of Combinatorial Algorithms in Bioinformatics or related areas to present their recent results, or to give a tutorial on their favorite topic in the field. The workshop will be informal in nature. There is no registration fee and there will be no proceedings. On Friday September 14, the day after the workshop, two lecture rooms at the University will be available to use for discussions.

Schedule

9:00am-10:00am	Dan Brown	Fast algorithms for problems in phylogenetics
coffee break
10:30am-12:30pm	Matthias Mnich	Haplotype inference for pedigrees with few recombinations and sites
Cinzia Pizzi	On irredundant tandem motifs
Jens Stoye	Multiple genome comparison based on overlap regions of pairwise local alignments
Katharina Jahn	Algorithmic approaches for consolidating genomes fractionated after higher order polyploidization
Annelyse Thévenin	Gene family assignment-free comparative genomics
lunch break
2:00pm-3:30pm	Marília Braga	Restricted DCJ-indel model: sorting linear genomes with DCJ and indels
István Miklós	Counting and sampling most parsimonious rearrangement scenarios
Kati Rozman	Algorithm for maximum clique problem and ProBiS Algorithm to finding protein-protein binding sites
coffee break
4:00pm-6:00pm	Luis Martínez Fernández	The shortest target string cover superstring problem and applications
Simon Dobrišek	An edit-distance model for the approximate matching of timed strings
Matevž Jekovec	Computer aided melodic analysis using suffix tree
Brona Brejova	Speeding up RNA motif search by data-driven element ordering

Accommodation

Hotel reservation and travel arrangements have to be taken care of directly by participants.

Here are some suggestions regarding hotels in Koper (with prices as of July 8, 2012):

Hotel Vodisek, single room: 48,21 EUR per night (if paid in cash), double room: 71,41 EUR per night (if paid in cash). Located near the historical center of Koper, about 10 minute walk from the workshop venue.
Hotel Bio, single room: 39 EUR per night (if paid in cash), double room: 59 EUR per night (if paid in cash). Located about 30 minute walk from the workshop venue.
Hotel Pristan: single room: 78,01 EUR per night, double room: 122,02 EUR per night. Located near the historical center of Koper, about 10-15 minute walk from the workshop venue.
Hotel Koper: double room: 71 EUR per person per night. Located in the historical center of Koper, by the sea, <5 minute walk from the workshop venue.
Hostel Histria, around 17 EUR per night (cash only), 6-8 bed rooms. Located in the historical center of Koper, about 5 minute walk from the workshop venue.

If any help with booking is needed, please feel free to contact the local organizer (Martin Milanič, martin.milanic@upr.si). You may also contact a secretary at the Faculty in Koper, Janja Kozlovič (janja.kozlovic@upr.si) (please mention in the correspondence that you are planning to attend the workshop).

Travel

The airport closest to Koper is Trieste Airport, about 45' drive from Koper. The second closest airport is Ljubljana Airport, about 1h20' drive from Koper.

If you are planning to attend ESA/WABI before coming to Koper, you might consider public transportation (either train or bus).

If you will be coming to Koper from ESA/WABI or from an airport, you might consider using shuttle transportation GoOpti (which can be quite cheap, especially if booked at least a month in advance).

Participants (33)

Marília Braga, Bioinformatics research group, Inmetro, Brazil
Brona Brejova, Comenius University, Bratislava, Slovakia
Dan Brown, University of Waterloo, Canada
Poly Hannah da Silva, Bioinformatics research group, Inmetro, Brazil
Simon Dobrišek, University of Ljubljana, Slovenia
Péter Erdős, Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, Budapest, Hungary
Pavel Fičur, University of Primorska, Koper, Slovenia
Matjaž Hladnik, University of Primorska, Koper, Slovenia
Katharina Jahn, Bielefeld University, Germany
Tobias Jakobi, Bielefeld University, Germany
Matevž Jekovec, University of Ljubljana, Slovenia
Urša Kačar, University of Primorska, Koper, Slovenia
Annachiara Korchmaros, University of Perugia, Italy
István Kóvacs, University of Primorska, Koper, Slovenia
Klemen Krnel, University of Primorska, Koper, Slovenia
Klavdija Kutnar, University of Primorska, Koper, Slovenia
Luis Martínez Fernández, University of the Basque Country, Bilbao, Spain
István Miklós, Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, Budapest, Hungary
Štefko Miklavič, University of Primorska, Koper, Slovenia
Martin Milanič, University of Primorska, Koper, Slovenia
Matthias Mnich, Saarland University, Saarbrücken, Germany
Cinzia Pizzi, University of Padova, Italy
Kati Rozman, National Institute of Chemistry, Ljubljana, Slovenia
Dragan Stevanović, University of Primorska, Koper, Slovenia, and University of Niš, Serbia
Jens Stoye, Bielefeld University, Germany
Linda Sundermann, Bielefeld University, Germany
Annelyse Thévenin, Bielefeld University, Germany
Jakub Truszkowski, University of Waterloo, Canada
Lev Turk, University of Primorska, Koper, Slovenia
Gabriel Verret, University of Primorska, Koper, Slovenia
Tomáš Vinař, Comenius University, Bratislava, Slovakia
Eyla Willing, Bielefeld University, Germany
Urška Zirnstein, University of Primorska, Koper, Slovenia

Abstracts

Fast algorithms for problems in phylogenetics

Dan Brown, University of Waterloo, Canada

We present two extremely fast algorithms for phylogenetic reconstruction.

In the first, we incrementally add new taxa to a growing phylogenetic tree by essentially doing a randomized binary search for the location of each new taxon, using a clever data structure. This algorithm, called QTree, uses O (n log n) runtime to build a tree with n taxa. In experiments, it gives high accuracy, under somewhat unrealistic expectations for the accuracy of distance measurements in trees. (In practice, it is comparable in quality to other fast phylogenetic methods, while being substantially faster than them.)

In the second algorithm, we start with each taxon in its own tree, and merge trees (not necessarily at their roots, which is the practice in algorithms like Neighbour-Joining) until a single tree remains. The algorithm is sped up by using locality-sensitive hashing to find pairs of nearby sequences in the tree. We also approximately reconstruct the sequences at ancestral positions of the tree, to enable joins of trees where the proper edge is far from the leaves of the tree. The algorithm takes sub-quadratic runtime, and can be shown to yield high-quality trees in Markov models of evolution. It also gives excellent trees in practice.

Both algorithms are joint work with PhD student Jakub Truszkowski.

Haplotype inference for pedigrees with few recombinations and sites

Matthias Mnich, Saarland University, Germany

Pedigrees, or family trees, are graphs giving the relationships between the individuals appearing in the graph, and they are useful for studying inheritance in families. Given a pedigree with all the individuals genotyped at every site, our goal is to find haplotypes with minimum number of required recombinations. Such haplotypes are solutions to the Minimum Recombination Haplotype Configuration problem. We provide fixed-parameter algorithms for the inference of haplotypes for pedigrees, when the number of recombinations and number of sites is small.

On irredundant tandem motifs

Cinzia Pizzi, University of Padova, Italy

The problem of discovering “tandem motifs” consists in the extraction of pair of subwords (m1,m2) from a text string of length n, such that the distance from the beginning of the subwords m1 and m2 is at most d, for a given constant d.

We present some recent results that focus on the task of reducing the redundancy from the candidate set of tandem motifs in a given input string. To this aim the concepts of maximality and irredundancy (that have a well known counterpart for single motifs) are defined in the context of tandem motifs.

Our study showed that the number of non-overlapping irredundant tandem motifs is O(d^2n), which is linear in the length of the input string, given that d is a constant. This improves of one order of magnitude previous compact indexes for tandem motifs extraction.

Multiple genome comparison based on overlap regions of pairwise local alignments

Jens Stoye, Bielefeld University, Germany

Mancheron, Uricaru and Rivals (Nucleic Acids Res. 39:e101, 2011) recently introduced a new approach in the context of multiple genome comparison that allows to detect regions of strong overlaps in a set of pairwise local alignments between several reference genomes and one tar- get genome. Such overlap regions are an important source of information in genome annotation. We introduce a series of algorithms that improve over the approach of Mancheron et al., both in terms of computational complexity and in practical runtime. We also extend the problem definition such that overlaps to different reference genomes can be rated differently and regions overlapping only a subset of the reference genomes are detected.

This is joint work with Katharina Jahn and Henner Sudek.

Algorithmic approaches for consolidating genomes fractionated after higher order polyploidization

Katharina Jahn, Bielefeld University, Germany

It has recently been shown that fractionation, the random loss of excess gene copies after a whole genome duplication event, is a major cause of gene order disruption. When estimating evolutionary distances between genomes based on chromosomal rearrangement, fractionation inevitably leads to significant overestimation of classic rearrangement distances. This bias can be largely avoided when genomes are preprocessed by consolidation“, a procedure that identifies and accounts for regions of fractionation.

In this talk, I introduce the theory of fractionation and present an efficient approach to consolidate fractionated genomes that underwent polyploidization events of higher order. The core algorithmic problem of our method is the identification of fractionation intervals”, genomic regions that may have been rearranged internally, but have (so far) been unaffected by rearrangements exchanging genes from within the interval and genes external to the interval. This problem setting is related to the well-studied field of detecting of Common Intervals, but exceeds its complexity, as the gene content of any ancient genomic region may be spread over several and different intervals in each fractionated polyploid present-day genome. I conclude this talk with some results on plant genomes as well as simulated data that both show the effect of fractionation in ancient hexaploid genomes.

Gene family assignment-free comparative genomics

Annelyse Thévenin, Bielefeld University, Germany

The comparison of relative gene orders between two genomes offers deep insights into functional correlations of genes and the evolutionary relationships between the corresponding organisms. Methods for gene order analyses often require prior knowledge of homologies between all genes of the genomic data set. Since such information is hard to obtain, it is common to predict homologous groups based on sequence similarity. These hypothetical groups of homologous genes are called gene families.

This talk promotes a new branch of gene order studies in which prior assignment of gene families is not required. As a case study, we present a new similarity measure between pairs of genomes that is related to the breakpoint distance. We propose an exact and a heuristic algorithm for its computation. We evaluate our methods on a data set comprising 12 gamma-proteobacteria from literature.

In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small genomes. Moreover, the results of our heuristic are close to those of the exact algorithm. In general, we demonstrate that gene order studies can be improved by direct, gene family assignment-free comparisons.

Restricted DCJ-indel model: sorting linear genomes with DCJ and indels

Marília Braga, Bioinformatics research group, Inmetro, Brazil

The double-cut-and-join (DCJ) is a model which is able to efficiently sort a genome into another, generalizing the typical mutations (inversions, fusions, fissions, translocations) to which genomes are subject, but allowing the existence of circular chromosomes at the intermediate steps. In the general model many circular chromosomes can coexist in some intermediate step. However, when the compared genomes are linear, it is more plausible to use the so-called restricted DCJ model, in which we proceed the reincorporation of a circular chromosome immediately after its creation. These two consecutive DCJ operations, which create and reincorporate a circular chromosome, mimic a transposition or a block-interchange. When the compared genomes have the same content, it is known that the genomic distance for the restricted DCJ model is the same as the distance for the general model. If the genomes have unequal contents, in addition to DCJ it is necessary to consider indels, which are insertions and deletions of DNA segments. Linear time algorithms were proposed to compute the distance and to find a sorting scenario in a general, unrestricted DCJ-indel model which considers DCJ and indels.

In the present work we consider the restricted DCJ-indel model for sorting linear genomes with unequal contents. We allow DCJ operations and indels with the following constraint: if a circular chromosome is created by a DCJ, it has to be reincorporated in the next step (no other DCJ or indel can be applied between the creation and the reincorporation of a circular chromosome). We then develop a sorting algorithm and give an upper bound for the restricted DCJ-indel distance. The question whether this bound can be reduced so that both the general and the restricted DCJ-indel distances are equal remains open.

Joint work with Poly H. da Silva, Raphael Machado and Simone Dantas.

Counting and sampling most parsimonious rearrangement scenarios

István Miklós, Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, Budapest, Hungary

In this talk, we will give a survey on counting and sampling the most parsimonious genome rearrangement scenarios. In case of DCJs, the exact counting and sampling of solutions seems to be hard, however, stochastic approximations for the number of scenarios (a so-called FPRAS algorithm, Fully Polynomial Randomized Approximation Scheme) as well as almost uniform sampling (FPAUS, Fully Polynomial Almost Uniform Sampling) are possible. The DCJ median is known to be NP-complete, and thus, under standard complexity theory assumptions, it is unlikely that even stochastic approximations are possible for the number of solutions to the median problem. In case of SCorJ (Single Cut or Join) model, counting the exact number of SCorJ sorting scenarios between two genomes is possible in polynomial running time, and also it is possible to sample sharp the uniform distribution of the solutions in polynomial running time. One most parsimonious solution for an arbitrary phylogenetic tree can be found in polynomial running time, hence at least stochastic approximations for the number of solutions seems to be expectable. However, we report here a negative result: the number of solutions cannot be stochastically approximated unless the complexity classes RP and NP are the same (which is generally assumed not to be the case). Similarly, if P does not equal to NP, then the exact number of solutions on an arbitrary phylogenetic tree cannot be calculated in polynomial running time. In case of the Hannenhalli-Pevzner model, we have only partial results and conjectures, some of them leading to very nice open combinatorial problems, from which we will introduce our favorite conjecture, the so-called four reversal conjecture. This is a joint work with Eric Tannier, Sandor Z. Kiss, Bence Melykuti and Krister Swenson.

Algorithm for maximum clique problem and ProBiS Algorithm to finding protein-protein binding sites

Kati Rozman, National Institute of Chemistry, Ljubljana, Slovenia

Maximum clique search algorithm in an undirected graph is described. Ordering vertices and a vertex coloring has been improved and used to provide bounds to the size of the maximum clique in a basic algorithm which finds a maximum clique. This basic algorithm was then extended to include dynamically varying bounds. The resulting maximum clique search algorithm is significantly faster than the comparable algorithms. This algorithm is used for comparing protein structures, to provide the information about protein function and also the information about possible interactions between proteins.

Joint work with Dušanka Janežič.

The shortest target string cover superstring problem and applications

Luis Martínez Fernández, University of the Basque Country, Bilbao, Spain

We introduce the problem of covering in a balanced way a given set of substrings of a family of strings by a string of minimum length and discuss its biological applications.

An edit-distance model for the approximate matching of timed strings

Simon Dobrišek, University of Ljubljana, Slovenia

An edit-distance model that can be used for the approximate matching of contiguous and non-contiguous timed strings is presented. The model extends the concept of the weighted string-edit distance by introducing timed edit operations and by making the edit costs time dependent. Special attention is paid to the timed null symbols that are associated with the timed insertions and deletions. The presented model is commonly used for the classification of speech recognition errors. However, it could also be used for the approximate matching of other similar types of strings where time segments are replaced, for instance, by some kind of spatial segments or even by larger genome units/sequences. Different variants and parameter setting of the string alignment algorithm can produce different classification of errors/differences between the two strings and, in this context, several statistical criterion functions are investigated that can be used for the comparison and evaluation of these differences.

Computer aided melodic analysis using suffix tree

Matevž Jekovec, University of Ljubljana, Slovenia

Quality melodic analysis of the score is one of the most significant, but also awkward and time consuming musicological tasks. In this paper we propose a computer aided melodic analysis based on suffix tree. Our data structure represents hierarchically combined melodic patterns for a given score. The potential interestingness of melodic patterns to the musicologist is then estimated from their diversity, length and frequency. We tested the proposed method on 48 fugues from J. S. Bach's Well-Tempered Clavier opus. All our approaches were integrated into the free musicological application Harmonia, which allows musicologists to explore the most common theme, its submelodies, common motifs and other melody-based score features.

Speeding up RNA motif search by data-driven element ordering

Brona Brejova, Comenius University, Bratislava, Slovakia

We present a new tool, RNArobo, for finding occurrences of a given RNA structural motif in DNA sequences. Our tool allows expert human users to describe the most relevant features of a target RNA structure and then to search for distant homologs in available genomic data. We extend a commonly used motif descriptor format with the possibility to allow insertions in RNA helices, which is useful for RNA structures that tolerate bulges. More importantly, we study the problem of element ordering in a backtracking algorithm for this problem and develop a strategy for reordering elements during the search based on observed performance of different orders. Our strategy outperforms existing tools on complex motifs.

Joint work with Ladislav Rampasek, Andrej Luptak and Tomas Vinar

Genome Informatics

Table of Contents