Genome Informatics

Parmigiani

Winter 2025/26

Wednesday 10:15-11:45

U10-146

This course is held in English.

Based on original research papers, the participants will give oral presentations (20-45 min) and write short summaries (5-10 pages) about algorithmic problems in bioinformatics and their solutions. Talks and essays must be done in English. The first day covers an overview of possible topics, which will then be distributed to the students. Aspects of scientific writing and presenting will be covered as well.

The overarching topic of this semester are k-mers (a.k.a. q-grams). This simple concept builds a basis for many algorithmic solutions in bioinformatics, such as assembly, alignment, genome comparison, pangenomics, etc.

To practice algorithm design and presentation, each participant will specify a simple k-mer counting algorithm and present it using pseudocode (a LaTeX template will be provided). Afterwards, they will implement the algorithm as a basic prototype. In a coding showdown, all implementations will battle it out to see which one takes the crown! (This can be credited as “392041 Implementation of Algorithms (Ü)”, 1 LP.)

Possible concrete methods/publications to be presented/discussed in the seminar are:

Assembly

Zerbino, Daniel R., and Ewan Birney. “Velvet: algorithms for de novo short read assembly using de Bruijn graphs.” Genome research 18.5 (2008): 821-829.
Chikhi, Rayan, and Guillaume Rizk. “Space-efficient and exact de Bruijn graph representation based on a Bloom filter. (Minia)” Algorithms for Molecular Biology 8.1 (2013): 1-9.

De Bruijn Graphs

Holley, Guillaume, and Páll Melsted. “Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs.” Genome biology 21.1 (2020): 1-20.
Ekim, Barış, Bonnie Berger, and Rayan Chikhi. “Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.” Cell systems 12.10 (2021): 958-968.

Alignment

Li, Heng. “Minimap2: pairwise alignment for nucleotide sequences.” Bioinformatics 34.18 (2018): 3094-3100.
Ondov, Brian D., et al. “Mash: fast genome and metagenome distance estimation using MinHash.” Genome biology 17.1 (2016): 1-14.

Counting k-mers

Kokot, Marek, Maciej Długosz, and Sebastian Deorowicz. “KMC 3: counting and manipulating k-mer statistics.” Bioinformatics 33.17 (2017): 2759-2761. (plus previous versions)
Lucas Robidou and Pierre Peterlongo. “fimpera: drastic improvement of Approximate Membership Query data-structures with counts” bioRxiv (2023)
Téo Lemane, Paul Medvedev, Rayan Chikhi, and Pierre Peterlongo. “kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.” Bioinformatics Advances, 2.1 (2022)
Jens Zentgraf and Sven Rahmann. “Fast gapped k-mer counting with subdivided multi-way bucketed Cuckoo hash table.” In 22nd International Workshop on Algorithms in Bioinformatics (WABI). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2022)

Storing k-mers

Sebastian Schmidt and Jarno N. Alanko. “Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time.” Algorithms for Molecular Biology 18(1), 5 (2023)
Giulio Ermanno Pibir. “Sparse and skew hashing of K-mers.” Bioinformatics 38.Suppl1 (2022):i185-i194.

Timetable

15.10.	–
22.10.	Organization, topic selection, Scientific reading	Slides: HowToRead
29.10.	Pseudocode	algorithm2e-docu, LaTeX template (remove .pdf from file name), notes/example
5.11.	– (self study)	Scientific Writing, Ten Simple Rules for Making Good Oral Presentations, Philip E Bourne, Ten simple rules for short and swift presentations, Christopher J. Lortie
12.11.	–
19.11.	Pseudocode presentation
26.11.	k-mer counting challenge	Everybody
3.12.	Mitan	Sparse and skew hashing of K-mers.
10.12.	Kiril	Space-efficient and exact de Bruijn graph representation based on a Bloom filter. (Minia)
17.12.	–
14.1.	Tom	Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
21.1.	Nils & John	Mash: fast genome and metagenome distance estimation using MinHash. & Fast gapped k-mer (cuckoo hash tables)
28.1.	Mehran	Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time.
4.2.	Nicola	Efficient q-gram filters
25.2.	Elif	Fimpera
4.3.	–
11.3.	–
18.3.	–
25.3.	–

Details on the k-mer counting exercise

Input:

Multiple fasta file, containing nucleotide sequences (small or capital letters, maybe N's)
k-mer length k
cutoff c : a k-mer must appear at least c times in the same fasta file to be counted.

Output:

Number of canonical k-mers occurring at least c times.