References for Algorithms in Genome Research


Parts of the material is covered by the following two textbooks. Some topics are newer than these books. For most of these, specialized references are given below.

  1. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, 1997.
  2. D. W. Mount, editor. Bioinformatics Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., 2nd edition, 2004.

Biological data types and formats

The main issues of this section are discussed in Mounts textbook. Some of the major biological sequence databases are:

  1. Genbank (at NCBI):
  2. Ensembl Genome Sequence Database:
  3. DDBJ (Japan):
  4. UniProt (SwissProt and trEMBL):

Sequencing technologies and applications

Papers on SOLiD sequencing:

  1. Voelkerding et al. Next-generation sequencing: from basic research to diagnostics. Clinical Chemistry 55(4):641-658, 2009.
  2. Rumble et al. SHRiMP: Accurate Mapping of Short Color-space Reads. PLOS Computational Biology 5(5), 2009.

Physical mapping

This section is mainly based on Chapter 16 of Gusfield's textbook. The Tightest Layout Problem is originally from (Alizadehet al., 1995). Another reference is (Heber et al., 2000).

  1. F. Alizadeh, R. M. Karp, L. A. Newberg, and D. K. Weisser. Physical mapping of chromosomes: A combinatorial problem in molecular biology. Algorithmica 13(1/2):52-76, 1995.
  2. S. Heber, J. Stoye, M. Frohme, J. Hoheisel, and M. Vingron. Contig selection in physical mapping. J. Comp. Biol. 7(3/4):395-408, 2000.

Genome assembly Ia: Overlap-Layout-Consensus - Clone-by-clone and whole-genome shotgun assembly

Lander-Waterman Statistics:

  1. E.S. Lander and M.S. Waterman, Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2:231-239, 1988.

Overlap-Layout-Consensus Assembly: There are many good description available online, just search for them.

Genome assembly Ib: Re-sequencing, comparative (reference-based) assembly

A good introduction to comparative genome assembly is [1]. The main algorithmic challenge is to map millions of (most very short) sequence reads onto one or more referene geneome(s). Suitable mapping algorithms for this task are SWIFT [2], Bowtie [6], ELAND (Cox, unpublished), MAQ [3], RMAP, SOAP [4], SHRiMP, SeqMap [5], TAGGER [7], ZOOM [8], BWA [9], GSNAP [10], SARUMAN [11], SSAHA2 [12], NextGenMap [13], etc.

  1. M. Pop, A. Phillippy, A. L. Delcher, and S. L. Salzberg. Comparative genome assembly. Briefings in Bioinformatics 5(3):237-248, 2004.
  2. K. Rasmussen, J. Stoye, E. W. Myers. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comp. Biol. 13(2):296-308, 2006.
  3. H. Li, J. Ruan, R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851-1858, 2008.
  4. R. Li, Y. Li, K. Kristiansen, J. Wang. SOAP: short oligonucleotide alignment program. Bioinformatics 24(5):713-714, 2008.
  5. H. Jiang, W. H Wong. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24(20):2395-2396, 2008.
  6. B. Langmead, C. Trapnell, M. Pop, S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10:R25, 2009.
  7. C. Iseli, G. Ambrosini, P. Bucher, C. V. Jongeneel. Indexing Strategies for Rapid Searches of Short Words in Genome Sequences. PLoS ONE 2(6):e579, 2007.
  8. H. Lin, Z. Zhang, M. Q. Zhang, B. Ma, M. Li. ZOOM! Zillions of oligos mapped. Bioinformatics 24(21): 2431-2437, 2008.
  9. H. Li, R. Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14): 1754-1760, 2009.
  10. T. Wu, S. Nacu. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7): 873–881, 2010.
  11. J. Blom, T. Jakobi, D. Doppmeier, S. Jaenicke, J. Kalinowski, J. Stoye, A. Goesmann. Exact and complete short read alignment to microbial genomes using GPU programming. Bioinformatics 27(10): 1351-1358, 2011.
  12. Z. Ning, A.J. Cox. SSAHA: A Fast Search Method for Large DNA Databases. Genome Res. 11(10): 1725-1729, 2001.
  13. F. J. Sedlazeck, P. Rescheneder, A. von Haeseler. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 29(21): 2790-2791, 2013.
  14. L. Oesper, A. Ritz, S. J. Aerni, R. Drebin, B. J. Raphael. Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics 13(Suppl. 6):S10, 2012.

Genome assembly IIa: De-novo assembly of HTS data

These methods are all very similar, using variants of de-Bruijn graphs plus additional tricks: EULER-SR [1], Velvet [2,3], MIRA2 (Link), SSAKE [4],VCAKE [5], SHARCGS [6], Medvedev/Brudno [7], ABySS (Link), ALLPATHS [8], SOAPdenovo [9], IDBA [10], etc. A very good recent review paper is [11].

  1. M. J. Chaisson, P. A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Res. 18(2):324-330, 2008.
  2. D. R. Zerbino, E. Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5):821-829, 2008.
  3. D. R. Zerbino, G. K. McEwen, E. H. Margulies, E. Birney. Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler. PLoS ONE 4(12):e8407, 2009.
  4. R. L. Warren, G. G. Sutton, S. J. M. Jones, R. A. Holt. Assembling millions of short dna sequences using ssake. Bioinformatics 23(4):500-501, 2006.
  5. W. R. Jeck, J. A. Reinhardt, D. A. Baltrus, M. T. Hickenbotham, V. Magrini, E. R. Mardis, J. L. Dang, C. D. Jones. Extending assembly of short dna sequences to handle error. Bioinformatics 23(21):2942-2944, 2007.
  6. J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. Sharcgs, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17(11):1697-1706, 2007.
  7. P. Medvedev, M. Brudno. Ab initio whole genome shotgun assembly with mated short reads. Proc. RECOMB 2008 pages 50-64, 2008.
  8. J. Butler, I. Maccallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, D. B. Jaffe. Allpaths: De novo assembly of whole-genome shotgun microreads. Genome Res. 18(5):810-820, 2008.
  9. R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, J. Wang. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2):265-272, 2010.
  10. Y. Peng, H. C. M. Leung, S. M. Yiu, F. Y. L. Chin. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. Proceedings of RECOMB 2010, LNBI 6044, 426-440, 2010.
  11. J. R. Miller, S. Koren, G. Sutton. Assembly algorithms for next-generation sequencing data. Genomics in press, 2010.
  12. Phillip E. C. Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29, 987-991, 2011.

Genome assembly IIb: Hybrid/long read assembly

  1. A. C. English, S. Richards, Y. Han, M. Wang, V. Vee, J. Qu, X. Qin, D. M. Muzny, J. G. Reid, K. C. Worley, R. A. Gibbs. Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology. PLoS ONE 7(11): e47768, 2012.
  2. S. Koren, M. C. Schatz, B. P. Walenz, J. Martin, J. T. Howard, G. Ganapathy, Z. Wang, D. A. Rasko, W. R. McCombie, E. D. Jarvis, A. M. Phillippy. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnolopgy 30:693-700, 2012.
  3. C.-S. Chin, D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner, J. Korlach. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10:563-569, 2013.
  4. G. Myers. Efficient Local Alignment Discovery amongst Noisy Long Reads. Proceedings of WABI 2014, LNBI 8701, 52-67, 2014.
  5. F. J. Sedlazeck, P. Rescheneder, M. Smolka, H. Fang, M. Nattestad, A. von Haeseler, M. C. Schatz. Accurate detection of complex structural variations using single molecule sequencing. Nat. Methods 15(6): 461–468, 2018.
  6. E. Haghshenas, H. Asghari, J. Stoye, C. Chauve, F. Hach. HASLR: Fast Hybrid Assembly of Long Reads. iScience 23(8): 101389, 2020.

String graph assembly for diploid genomes with long reads is explained on the following poster by PacBio.

Pre-processing (correction of long reads):

  1. L. Salmela and E. Rivals. LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24):3506–3514, 2014.
  2. L. Salmela, R. Walve, E. Rivals and E. Ukkonen. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics, 33(6):799–806, 2017.

Genome annotation I: Basic techniques

A general introduction to HMMs in Bioinformatics is in the textbook by Durbinet al. [1]. Covariance models were introduced in [2]. A similar concept was developed in [3].

  1. R. Durbin, S. Eddy, A. Krogh, G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998.
  2. S. R. Eddy, R. Durbin. RNA sequence analysis using covariance models. Nucleic Acids Res. 22(11):2079-2088, 1994.
  3. Y. Sakakibara, M. Brown, R. Hughey, I. S. Mian, K. Sjølander, R. C. Underwood, D. Haussler. Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res. 22(23):5112-5120, 1994.

Genome annotation II: Finding RNA genes

Here are several references, in chronological order:

  1. R. Staden. A computer program to search for tRNA genes. Nucleic Acids Res. 8:817-825, 1980.
  2. G. A. Fichant and C. Burks. Identifying potential tRNA genes in genomic DNA sequences. J. Mol. Biol. 220:659-671, 1991.
  3. A. Pavesi, F. Conterio, A. Bolchi, G. Dieci, and S. Ottonello. Identification of new eukaryotic tRNA genes in genomic DNA databases by a multistep weight matrix analysis of transcriptional control regions. Nucleic Acids Res. 22:1247-1256, 1994.
  4. S. R. Eddy and R. Durbin. RNA sequence analysis using covariance models. Nucleic Acids Res. 22(11):2079-2088, 1994.
  5. T. M. Lowe and S. R. Eddy. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25:955-964, 1997.

Genome annotation III: Finding protein-coding genes

Here are several references, in chronological order. First, for prokaryotes

  1. M. Borodovsky and J. McIninch. Genmark: Parallel gene recognition for both DNA strands. Computers Chem. 17(2):123-133, 1993.
  2. A. V. Lukashin and M. Borodovsky. GeneMark.hmm: New solutions for gene finding. Nucleic Acids Res. 26(4):1107-1115, 1998.
  3. S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26(2):544-548, 1998.
  4. A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27(23):4636-3641, 1999.
  5. D. Frishman, A. Mironov, H.-W. Mewes, and M. Gelfand. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 26(12):2941-2947, 1998.
  6. S. S. Hannenhalli, W. S. Hayes, A. G. Hatzigeorgiou, and J. W. Fickett. Bacterial start site prediction. Nucleic Acids Res. 27(17):3577-3582, 1999.
  7. J. H. Badger and G. J. Olsen. CRITICA: Coding region identification tool invoking comparative analysis. Mol. Biol. Evol. 16(4):512-524, 1999.
  8. T. S. Larsen and A. Krogh. EasyGene a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics 4(21), 2003.
  9. L. Krause, A. McHardy, T. Nattkemper, A. Pühler, J. Stoye, F. Meyer. GISMO - gene identification using a support vector machine for ORF classification. Nucleic Acids Res. 35(2):540-549, 2006.

… and now for eukaryotes:

  1. G. Stormo. Consensus patterns in DNA. In R. F. Doolittle, editor, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Meth. Enzymol. volume 183 chapter 13, pages 211-221. Academic Press, San Diego, CA, 1990.
  2. R. Guigó, S. Knudsen, N. Drake, and T. Smith. Prediction of gene structure. J. Mol. Biol. 226(1):141-157, 1992.
  3. E. C. Uberbacher and R. J. Mural. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA 88(24):11261-11265, 1991.
  4. E. C. Uberbacher, Y. Xu, and R. J. Mural. Discovering and understanding genes in human DNA sequence using GRAIL. In R. F. Doolittle, editor, Computer Methods for Macromolecular Sequence Analysis, Meth. Enzymol. volume 266 chapter 28, pages 259-281. Academic Press, San Diego, CA, 1996.
  5. M. G. Reese, F. H. Eeckman, D. Kulp, and D. Haussler. Improved splice site detection in Genie. J. Comp. Biol. 4(3):311-323, 1997.
  6. C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1):78-94, 1997.
  7. G. Parra, E. Blanco, and R. Guigó. GeneID in Drosophila. Genome Res. 10(4):511-515, 2000.
  8. M. G. Reese, D. Kulp, H. Tammana, and D. Haussler. Genie - gene finding in Drosophila melanogaster. Genome Res. 10:529-538, 2000.
  9. M. Pertea and S. L. Salzberg. Computational gene finding in plants. Plant Mol. Biol. 48(1-2):39-48, 2002.

Genome annotation IV: Patterns in non-coding regions

This part is almost completely based on the work of Mathieu Blanchette and co-authors:

  1. M. Blanchette, B. Schwikowski, and M. Tompa. Algorithms for phylogenetic footprinting. J. Comp. Biol. 9(2):211-223, 2002.
  2. M. Blanchette and M. Tompa. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12:739-748, 2002.

Genome annotation V: Functional genome annotation

All bioinformatics textbooks contain good overviews of sequence analysis by Smith-Waterman and fast heuristic methods for database search like FASTA and BLAST. Here are some references to the original papers, including one by ourselves:

  1. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol. 147(1):195-197, 1981.
  2. W. R. Pearson. Rapid and sensitive sequence comparison with FASTP and FASTA. In R. F. Doolittle, editor, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Meth. Enzymol. volume 183 chapter 5, pages 63-98. Academic Press, San Diego, CA, 1990.
  3. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol. 215(3):403-410, 1990.
  4. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-3402, 1997.
  5. K. Rasmussen, J. Stoye, and E. W. Myers. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comp. Biol. 13(2):296-308, 2006.

Transcriptomics I: mRNA analysis, RNA-Seq

A few papers on EST clustering and splicing graphs:

  1. S. Heber, M. Alekseyev, S.-H. Sze, H. Tang, P. A. Pevzner. Splicing graphs and EST assembly problem. Bioinformatics 18(Suppl. 1):S181-S188, 2002.
  2. M. Sammeth, S. Foissac, R. Guigó. A general definition and nomenclature for alternative splicing events. PLoS Comput. Biol. 4(8):e1000147, 2008.
  3. V. Lacroix, M. Sammeth, R. Guigó, A. Bergeron. Exact Transcriptome Reconstruction from Short Sequence reads. Proc. WABi 2008, LNBI 5251, 50-63, 2008.

Very good review about NGS transcriptomics (RNA-seq):

  1. J. A. Martin, Z. Wang. Next-generation transcriptome assembly. Nat. Rev. Genet., 12(10), 671–682, 2011.

Examples for software supporting RNA-seq:

  1. C. Trapnell, L. Pachter, S. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9): 1105-1111, 2009.
  2. K. Wang, D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, P. Mieczkowski, S. A. Grimm, C. M. Perou, J. N. MacLeod, D. Y. Chiang, J. F. Prins, J. Liu. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38(18): e178, 2010.

Tanscriptomics II: DNA microarray design

Just a few in-house papers, because they are so nice:

  1. S. Rahmann. Fast large scale oligonucleotide selection using the longest common factor approach. J. Bioinf. Comput. Biol. 1(2):343-361, 2003.
  2. S. Rahmann. The shortest common supersequence problem in a microarray production setting. Bioinformatics 19(Suppl. 2):156-161, 2003. (Proceedings of ECCB 2003).
  3. S.A. de Carvalho Jr. and S. Rahmann. Improving the layout of oligonucleotide microarrays: Pivot Partitioning. In Proc. 6th Workshop of Algorithms in Bioinformatics LNBI volume 4175 pages 321-332. Springer, 2006.
  4. S.A. de Carvalho Jr. and S. Rahmann. Microarray layout as a quadratic assignment problem. In Proc. German Conference on Bioinformatics, volume P-83 of Lecture Notes in Informatics, pages 11-20. GI, 2006.

Tanscriptomics III: DNA microarray analysis

A couple of papers and one book:

  1. W. Huber, A. von Heydebreck, H. Sueltmann, A. Poustka, and M. Vingron. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl. 1):96-104, 2002.
  2. T. P. Speed, editor. Statistical Analysis of Gene Expression Microarray Data. Chapman and Hall/CRC, 2003.
  3. C. Steinhoff and M. Vingron. Normalization and quantification of differential expression in gene expression microarrays. Brief. Bioinform. 7(2):166-177, 2006.

RNA secondary structure, structure space, RNA-based regulation

The most classical algorithm for RNA secondary structure prediction is Nussinov's algorithm:

  1. R. Nussinov, G. Pieczenik, J. R. Griggs, D. J. Kleitman. Algorithms for Loop Matchings. SIAM J. Appl. Math. 35(1):68–82, 1978.
  2. R. Giegerich, B. Voß, M. Rehmsmeier. Abstract shapes of RNA. Nucleic Acids Res. 32(16):4843-4851, 2004.

Proteomics I: 2D gels

A survey of various techniques for two-dimensional gel alignment is [1]. The method described in the lecture is from [2], [3].

  1. A. W. Dowsey, M. J. Dunn, and G.-Z. Yang. The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics 3(8):1567-1596, 2003.
  2. F. A. Potra, X. Liu, F. Seillier-Moiseiwitsch, A. Roy, Y. Hang, M. R. Marten, B. Raman, and C. Whisnant. Protein image alignment via piecewise affine transformations. J. Comp. Biol. 13(3):614-630, 2006.
  3. Andrew W. Dowsey, Michael J. Dunn and Guang-Zhong Yang. Automated image alignment for 2D gel electrophoresis in a high-throughput proteomics pipeline. Bioinformatics 24(7):950-957, 2008.

Proteomics II: Analysis of mass spectra

The algorithms discussed include: de-novo protein sequencing by mass spectra [1,2], the money changing problem [3], alignment of time-series of mass spectra [4].

  1. T Chen, Ming-Yang Kao, M Tepel, J Rush, and G M Church. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 8(3):325-337, 2001.
  2. Lijuan Mo, Debojyoti Dutta, Yunhu Wan, and Ting Chen. MSNovo: A Dynamic Programming Algorithm for de Novo Peptide Sequencing via Tandem Mass Spectrometry. Anal. Chem. 79(13): 4870–4878, 2007.
  3. Sebastian Böcker and Zsuzsanna Lipták. A fast and simple algorithm for the money changing problem. Algorithmica 48(4):413-432, 2007.
  4. Amol Prakash, Parag Mallick, Jeffrey Whiteaker, Heidi Zhang, Amanda Paulovich, Mark Flory, Hookeun Lee, Ruedi Aebersold, and Benno Schwikowski. Signal maps for mass spectrometry-based comparative proteomics. Mol Cell Proteomics 5(3):423-432, 2006.

Computational Metabolomics

Some software developed in Bielefeld has been published here:

  1. H. Neuweger, S. P. Albaum, M. Dondrup, M. Persicke, T. Watt, K. Niehaus, J. Stoye, A. Goesmann. MeltDB: A software Platform for the Analysis and Integration of Metabolomics Experiment Data. Bioinformatics 24(23):2726-2732, 2008.
  2. N. Hoffmann, J. Stoye. Generic Software Frameworks for GC-MS Based Metabolomics. In: U. Roessner (ed.): Metabolomics. Chapter 4, pp. 73-98. InTech, 2012.

Computational Systems Biology: Networks, their models and algorithms

A textbook that covers this (and much more) is:

  1. E. Klipp, R. Herwig, A. Kowald, C. Wierling, H. Lehrach. Systems Biology in Practice - Concepts, Implementation and Application. Wiley-VCH, 2005.

Computational pangenomics

The gene based method is considered here (for example):

  1. J. Blom, S. P. Albaum, D. Doppmeier, A. Pühler, F.-J. Vorhölter, M. Zakrzewski, and A. Goesmann. EDGAR: A software framework for the comparative analysis of prokaryotic genomes. BMC Bioinformatics 10:154, 2009.
  2. J. Blom, J. Kreis, S. Spänig, T. Juhre, C. Bertelli, C. Ernst, and A. Goesmann. EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res. 44(W1):W22–W28, 2016.
  3. J. Blom, S. P. Glaeser, T. Juhre, J. Kreis, P. H. G. Hanel, J. G. Schrader, P. Kämpfer, and A. Goesmann. EDGAR: A Versatile Tool for Phylogenomics. In: W. B. Whitman (ed.). Bergey's Manual of Systematics of Archaea and Bacteria, Wiley, 2019.

A good overview of genome-based computational pangenomics gives the following review paper:

  1. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief. Bioinf. 19(1), 118–135, 2018.

Some more specialized papers are the following.

(A) Data structures

  1. B. Paten, D. Earl, N. Nguyen, M. Diekhans, D. Zerbino, D. Haussler. Cactus: Algorithms for genome multiple sequence alignment. Genome Research 21, 1512–1528, 2011
  2. C. Ernst, S. Rahmann. PanCake: A Data Structure for Pangenomes. Proc. of GCB 2013, 35-45, 2013.
  3. G. Holley, R. Wittler, and J. Stoye. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11: 3, 2016.
  4. E. Garrison, J. Sirén, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F Lin, B. Paten, and R. Durbin. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879, 2018.

(B) Sequence-to-graph mapping/alignment

  1. M. Rautiainen, V. Mäkinen, and T. Marschall. Bit-parallel sequence-to-graph alignment. Bioinformatics 35(19), 3599-3607, 2019.
  2. R. Martiniano, E. Garrison, E. R. Jones, A. Manica, and R. Durbin. Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph. Genome Biol. 21: 250, 2020.
  3. M. Rautiainen and T. Marschall. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21: 253, 2020.
  4. A. Kuhnle, T. Mun, C. Boucher, T. Gagie, B. Langmead, and G. Manzini. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J. Comp. Biol. 27(4), 500-513, 2020.
  5. N. Luhmann, G. Holley, and M. Achtman. BlastFrost: Fast querying of 100,000s of bacterial genomes in Bifrost graphs. BioRxiv, 2020.
  6. T. Schulz, R. Wittler, S. Rahmann, F. Hach, and J. Stoye. Detecting High Scoring Local Alignments in Pangenome Graphs. Bioinformatics 37(16), 2266–2274, 2021.

(C) Phylogenomics:

  1. R. Wittler. Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algorithms Mol. Biol. 15: 4, 2020.
  2. A. Rempel, R. Wittler. SANS serif: alignment-free, whole-genome-based phylogenetic reconstruction. Bioinformatics 37(24), 4868-4870, 2021.

(D) Haplotype inference:

See below.

Comparative genomics I: Genome alignment, repeat analysis

Some relevant papers in this area are:

  1. A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg. Alignment of whole genomes. Nucleic Acids Res. 27(11):2369-2376, 1999.
  2. S. Schwartz, Z. Zhang, K. A. Fraser, A. Smit, C. Riemer, J. Bouck, R. Gibson, R. Hardisson, and W. Miller. PipMaker-A web server for aligning two genomic DNA sequences. Genome Res. 10(4):577-586, 2000.
  3. M. Höhl, S. Kurtz, and E. Ohlebusch. Efficient multiple genome alignment. Bioinformatics 18(Suppl. 1):S312-S320, 2002. (Proceedings of ISMB 2002).
  4. S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, Shumway M., C. Antonescu, and S. L. Salzberg. Versatile and open software for comparing large genomes. Genome Biol. 5:R 12, 2004.
  5. A. C. E. Darling, B. Mau, F. R. Blattner, and N. T. Perna. Mauve: Multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14(7):1394-1403, 2004.
  6. E. Ohlebusch, M. I. Abouelhoda. Chaining Algorithms and Applications in Comparative Genomics. In: Handbook of Computational Molecular Biology (Chapter 15), edited by S. Aluru, Chapman & Hall/CRC Computer and Information Science Series, 2006.
  7. S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29(22):4643-4653, 2001.

Comparative genomics II: Genome rearrangement

The classical papers are by Hannenhalli and Pevzner (1999 and 1995). A much better readable description for the reversal distance can be found in the book chapter by Bergeron et al. (2005). The general DCJ model is described in (Yancopoulos et al., 2005) and (much nicer) in (Bergeron et al., 2006a). A correct algorithm for sorting by translocations is given in (Bergeron et al., 2006b), an almost correct one for the HP distance in (Bergeron et al., 2009), one formula of which is corrected in (Erdős et al., 2011). A very good overview of median, halving and guided halving results can be found in (Tannier et al., 2009).

  1. S. Hannenhalli and P. A. Pevzner. Transforming men into mice (polynomial algorithm for genomic distance problem). In Proc. 36th Annu. Symp. Found. Comput. Sci. (FOCS 1995), pages 581-592. IEEE Press, 1995.
  2. A. Bergeron. A very elementary presentation of the Hannenhalli-Pevzner theory. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM 2001), volume 2089 of LNCS, pages 106-117, 2001.
  3. A. Bergeron, J. Mixtacki, and J. Stoye. The inversion distance problem. In O. Gascuel, editor, Mathematics of Evolution and Phylogeny, chapter 10, pages 262-290. Oxford University Press, Oxford, UK, 2005.
  4. S. Yancopoulos, O. Attie, and R. Friedberg. Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 21(16):3340-3346, 2005.
  5. A. Bergeron, J. Mixtacki, and J. Stoye. A unifying view of genome rearrangements. In: Proceedings of the 6th International Workshop on Algorithms in Bioinformatics (WABI 2006), volume 4175 of LNBI, pages 163-173, 2006.
  6. A. Bergeron, J. Mixtacki, and J. Stoye. On sorting by translocations. J. Comp. Biol. 13(2):567-578, 2006.
  7. A. Bergeron, J. Mixtacki, and J. Stoye. A new linear time algorithm to compute the genomic distance via the double cut and join distance. Theor. Comput. Sci. 410(51):5300-5316, 2009.
  8. P. L. Erdős, L. Soukup, J. Stoye. Balanced vertices in trees and a simpler algorithm to compute the genomic distance. Appl. Math. Lett. 24:82-86, 2011.
  9. E. Tannier, C. Zheng, D. Sankoff. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics 10:120, 2009.

Comparative genomics III: Gene clusters

The following are the algorithmic papers in this area. Apart from that, many papers on applications of gene clusters and statistical properties exist, but are not listed here.

(a.) Common intervals of permutations:

  1. T. Uno and M. Yagiura. Fast algorithms to enumerate all common intervals of two permutations. Algorithmica 26(2):290-309, 2000.
  2. S. Heber and J. Stoye. Finding all common intervals of k permutations. In A. Amir and G. Landau, editors, Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, CPM 2001, volume 2089 of LNCS, pages 207-218, Berlin, 2001. Springer Verlag.
  3. S. Heber and J. Stoye. Algorithms for finding gene clusters. In O. Gascuel and B. Moret, editors,Proceedings of the First International Workshop on Algorithms in Bioinformatics, WABI 2001, volume 2149 of LNCS, pages 252-263, Berlin, 2001. Springer Verlag.
  4. A. Bergeron, S. Corteel, and M. Raffinot. The algorithmic of gene teams. In R. Guigó and D. Gusfield,editors, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, WABI 2002, volume 2452 of LNCS, pages 464-476, Berlin, 2002. Springer Verlag.
  5. N. Luc, J.-L. Risler, A. Bergeron, and M. Raffinot. Gene teams: a new formalization of gene clusters for comparative genomics. Comp. Biol. Chem. 27:59-67, 2003.
  6. G. M. Landau, L. Parida, and O. Weimann. Gene proximity analysis across whole genomes via PQ tree. J. Comp. Biol. 12(10):1289–1306, 2005.
  7. A. Bergeron, C. Chauve, F. de Montgolfier, and M. Raffinot. Computing common intervals of K permutations, with applications to modular decomposition of graphs. SIAM J. Discrete Mathematics 22(3):1022–1039, 2008.
  8. S. Heber, R. Mayr, J. Stoye. Common Intervals of Multiple Permutations. Algorithmica 60(2):175-206, 2011.

(b.) Common intervals of sequences:

  1. T. Schmidt and J. Stoye. Quadratic time algorithms for finding common intervals in two and more sequences. In S. C. Sahinalp, S. Muthukrishnan, and U. Dogrusoz, editors, Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, CPM 2004, volume 3109 of LNCS, pages 347-358, Berlin, 2004. Springer Verlag.
  2. G. Didier, T. Schmidt, J. Stoye, D. Tsur. Character Sets of Strings. J. Discr. Alg. 5(2):330-340, 2007.
  3. X. He and M. H. Goldwasser. Identifying conserved gene clusters in the presence of homology families. J. Comp. Biol. 12(6):638-656, 2005.

(c.) Approximate common intervals of sequences:

  1. S. Böcker, K. Jahn, J. Mixtacki, J. Stoye. Computation of Median Gene Clusters. J. Comp. Biol. 16(8):1085-1099, 2009.
  2. F. Hufsky, L. Kuchenbecker, K. Jahn, J. Stoye, S. Böcker. Swiftly Computing Center Strings. BMC Bioinformatics 12:106, 2011.

(d.) Common intervals of indeterminate strings:

  1. D. Doerr, J. Stoye, S. Böcker, K. Jahn. Identifying Gene Clusters by Discovering Common Intervals in Indeterminate Strings. BMC Genomics 15(Suppl. 6): S2, 2014.

Comparative Genomics IV: Multiple genome rearrangement

  1. D. Sankoff, G. Sunaram, J. Kececioglu. Steiner points in the space of genome rearrangements. International Journal of Foundations of Computer Science 7(1):1-9, 1996.
  2. G. Bourque, P. A. Pevzner. Genome-Scale Evolution: Reconstructing Gene Orders in the Ancestral Species. Genome Res. 12:26-36, 2002.
  3. J. Mixtacki. Genome halving under DCJ revisited. In: Proceedings of the 14th Annual International Conference on Computing and Combinatorics (COCOON 2008), volume 5092 of LNCS, pages 276-286, 2008.
  4. R. Warren, D. Sankoff. Genome Halving with Double Cut and Join. Journal of Bioinformatics and Computational Biology 7(2): 357-371, 2009.
  5. P Feijão, J. Meidanis. SCJ: A Breakpoint-Like Distance that Simplifies Several Rearrangement Problems. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(3): 1318-1329, 2011.

Comparative Genomics V: Reconstruction of ancestral gene clusters

  1. R. Wittler. Phylogeny-based analysis of gene clusters. Ph.D. Thesis, Faculty of Technology, Bielefeld University, 2010.
  2. J. Stoye, R. Wittler. A Unified Approach for Reconstructing Ancient Gene Clusters. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(3):387-400, 2009.
  3. R. Wittler, J. Maňuch, M. Patterson, J. Stoye. Consistency of Sequence-Based Gene Clusters. J. Comp. Biol. 18(9):1023-1039, 2011.

Haplotype inference

A great overview of the combinatorial problems and algorithms in the following book chapter:

  1. D. Gusfield, S. Hecht Orzack. Haplotype Inference. In: Handbook of Computational Molecular Biology (Chapter 18), edited by S. Aluru, Chapman & Hall/CRC Computer and Information Science Series, 2006.

A more recent works on the topic, focussing on molecular haplotyping:

  1. G. W. Klau, T. Marschall. A guided tour to computational haplotyping. In: Proc. of CiE 2017, LNCS 10307, Springer Verlag, 2017.
  2. M. Patterson, T. Marschall, N. Pisanti, L. v. Iersel, L. Stougie, G. W. Klau, A. Schönhuth. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. Journal of Computational Biology 22(6), 498-509, 2015.

The ILP discussed in class is from the following textbook, Section 20.2:

  1. Dan Gusfield. Integer Linear Programming in Computational and Systems Biology. Cambridge University Press, 2019.

SNP-disease associations

Here are a few of the more algorithmic papers on the topic, but there exist several more. You may look up the references in this one.

  1. H. T. T. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, and J. Kere. Gene mapping by haplotype pattern mining. In IEEE International Symposium on Bio-Informatics and Biomedical Engineering (BIBE'00): 99, 2000.
  2. T. Mailund, S. Besenbacher, and M. Schierup. Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics 7:454, 2006.
  3. S. Besenbacher, C. N. S. Pedersen, and T. Mailund. A fast algorithm for genome-wide haplotype pattern mining. BMC Bioinformatics 10(Suppl 1):S74, 2009.
  4. S. Besenbacher, T. Mailand, M. H. Schierup. Association Mapping and Disease: Evolutionary Perspectives. Evolutionary Genomics: Statistical and Computational Methods, Volume 2 Methods in Molecular Biology, vol. 856. Chapter 11, 2012.

Computational metagenomics, astrobiology

Papers on tools for the analysis of metagenomics data:

  1. D. H. Huson, A. F. Auch, J. Qi, S. C. Schuster. MEGAN analysis of metagenomic data. Genome Res. 17:377-386, 2007.
  2. Q. Wang, G. M. Garrity, J. M. Tiedje, J. R. Cole. Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Env. Microbiol., 73(16):5261-5267, 2007.
  3. F. Meyer, D. Paarmann, M. D'Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R. A. Edwards. The Metagenomics RAST server - A public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9:386, 2008.
  4. L. Krause, N. N. Diaz, A. Goesmann, S. Kelley, T. W. Nattkemper, F. Rohwer, R. A. Edwards, J. Stoye. Phylogenetic classification of short environmental DNA fragments. Nuleic Acids Res. 36(7):2230-2239, 2008.
  5. M. M. Haque, T. S. Ghosh, D. Komanduri, S. S. Mande. SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics 25(14):1722-1730, 2009.
  6. W. Gerlach, S. Jünemann, F. Tille, A. Goesmann, J. Stoye. WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads. BMC Bioinformatics 10:430, 2009.
  7. D. E. Wood, S. L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15:R46, 2014.
  8. A. Oulas, C. Pavloudi, P. Polymenakou, G. A. Pavlopoulos, N. Papanikolaou, G. Kotoulas, C. Arvanitidis, I. Iliopoulos. Metagenomics: Tools and insights for analyzing next-generation sequencing data derived from biodiversity studies. Bioinformatics and Biology Insights 9:75-88, 2015.

(Astrobiology was mainly meant as a joke, but have a look here: NAI. Or here: NASA unveils arsenic life form.)