Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Previous revision
teaching:alggrliterature [2019/02/10 15:33]
teaching:alggrliterature [2022/11/21 09:57] (current)
jstoye [Genome assembly IIb: Hybrid/long read assembly]
Line 1: Line 1:
 +===== References for Algorithms in Genome Research =====
  
 +==== Overview ====
 +Parts of the material is covered by the following two textbooks. Some topics are newer than these books. For most of these, specialized references are given below. ​
 +
 +  - D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, 1997. 
 +  - D. W. Mount, editor. Bioinformatics ​ Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., 2nd edition, 2004. 
 +
 +==== Biological data types and formats ====
 +The main issues of this section are discussed in Mounts textbook. Some of the major biological sequence databases are: 
 +
 +  - Genbank (at NCBI): www.ncbi.nlm.nih.gov/​Genbank ​
 +  - Ensembl Genome Sequence Database: www.ensembl.org ​
 +  - DDBJ (Japan): www.ddbj.nig.ac.jp ​
 +  - UniProt (SwissProt and trEMBL): www.uniprot.org ​
 +
 +==== Sequencing technologies and applications ====
 +Papers on SOLiD sequencing: ​
 +
 +  - Voelkerding et al. [[https://​doi.org/​10.1373/​clinchem.2008.112789|Next-generation sequencing: from basic research to diagnostics]]. //Clinical Chemistry// **55**(4):​641-658,​ 2009. 
 +  - Rumble et al. [[https://​doi.org/​10.1371/​journal.pcbi.1000386|SHRiMP:​ Accurate Mapping of Short Color-space Reads]]. //PLOS Computational Biology// **5**(5), 2009. 
 +
 +==== Physical mapping ====
 +This section is mainly based on Chapter 16 of Gusfield'​s textbook. The Tightest Layout Problem is originally from (Alizadeh//​et al.//, 1995). Another reference is (Heber //et al.//, 2000). ​
 +
 +  - F. Alizadeh, R. M. Karp, L. A. Newberg, and D. K. Weisser. [[https://​doi.org/​10.1007/​BF01188581|Physical mapping of chromosomes:​ A combinatorial problem in molecular biology]]. //​Algorithmica//​ **13**(1/​2):​52-76,​ 1995. 
 +  - S. Heber, J. Stoye, M. Frohme, J. Hoheisel, and M. Vingron. [[https://​doi.org/​10.1089/​106652700750050853|Contig selection in physical mapping]]. //J. Comp. Biol.// **7**(3/​4):​395-408,​ 2000.
 +
 +==== Genome assembly Ia: Overlap-Layout-Consensus - Clone-by-clone and whole-genome shotgun assembly ====
 +Lander-Waterman Statistics:
 +
 +  - E.S. Lander and M.S. Waterman, [[https://​doi.org/​10.1016/​0888-7543(88)90007-9|Genomic mapping by fingerprinting random clones: a mathematical analysis]]. //​Genomics//​ **2**:​231-239,​ 1988.
 +
 +Overlap-Layout-Consensus Assembly: There are many good description available online, just search for them.
 +
 +==== Genome assembly Ib: Re-sequencing,​ comparative (reference-based) assembly ====
 +A good introduction to comparative genome assembly is [1]. The main algorithmic challenge is to map millions of (most very short) sequence reads onto one or more referene geneome(s). Suitable mapping algorithms for this task are [[http://​bibiserv.cebitec.uni-bielefeld.de/​swift/​|SWIFT]] [2], [[http://​bowtie-bio.sourceforge.net/​index.shtml|Bowtie]] [6], ELAND (Cox, unpublished),​ [[http://​maq.sourceforge.net/​|MAQ]] [3], [[http://​rulai.cshl.edu/​rmap/​|RMAP]],​ [[http://​soap.genomics.org.cn/​|SOAP]] [4], [[http://​compbio.cs.toronto.edu/​shrimp/​|SHRiMP]],​ SeqMap [5], TAGGER [7], ZOOM [8], [[http://​bio-bwa.sourceforge.net/​bwa.shtml|BWA]] [9], GSNAP [10], SARUMAN [11], SSAHA2 [12], NextGenMap [13], etc.
 +
 +  - M. Pop, A. Phillippy, A. L. Delcher, and S. L. Salzberg. [[https://​doi.org/​10.1093/​bib/​5.3.237|Comparative genome assembly]]. //Briefings in Bioinformatics//​ **5**(3):​237-248,​ 2004. 
 +  - K. Rasmussen, J. Stoye, E. W. Myers. [[https://​doi.org/​10.1089/​cmb.2006.13.296|Efficient q-gram filters for finding all epsilon-matches over a given length]]. //J. Comp. Biol.// **13**(2):​296-308,​ 2006. 
 +  - H. Li, J. Ruan, R. Durbin. [[https://​doi.org/​10.1101/​gr.078212.108|Mapping short DNA sequencing reads and calling variants using mapping quality scores]]. //Genome Res.// **18**:​1851-1858,​ 2008. 
 +  - R. Li, Y. Li, K. Kristiansen,​ J. Wang. [[https://​doi.org/​10.1093/​bioinformatics/​btn025|SOAP:​ short oligonucleotide alignment program]]. //​Bioinformatics//​ **24**(5):​713-714,​ 2008. 
 +  - H. Jiang, W. H Wong. [[https://​doi.org/​10.1093/​bioinformatics/​btn429|SeqMap:​ mapping massive amount of oligonucleotides to the genome]]. //​Bioinformatics//​ **24**(20):​2395-2396,​ 2008. 
 +  - B. Langmead, C. Trapnell, M. Pop, S. Salzberg. [[https://​doi.org/​10.1186/​gb-2009-10-3-r25|Ultrafast and memory-efficient alignment of short DNA sequences to the human genome]]. //Genome Biol.// **10**:R25, 2009. 
 +  - C. Iseli, G. Ambrosini, P. Bucher, C. V. Jongeneel. [[https://​doi.org/​10.1371/​journal.pone.0000579|Indexing Strategies for Rapid Searches of Short Words in Genome Sequences]]. //PLoS ONE// **2**(6):​e579,​ 2007. 
 +  - H. Lin, Z. Zhang, M. Q. Zhang, B. Ma, M. Li. [[https://​doi.org/​10.1093/​bioinformatics/​btn416|ZOOM! Zillions of oligos mapped]]. //​Bioinformatics//​ **24**(21): 2431-2437, 2008. 
 +  - H. Li, R. Durbin. [[https://​doi.org/​10.1093/​bioinformatics/​btp324|Fast and accurate short read alignment with Burrows–Wheeler transform]]. //​Bioinformatics//​ **25**(14): 1754-1760, 2009. 
 +  - T. Wu, S. Nacu. [[https://​doi.org/​10.1093/​bioinformatics/​btq057|Fast and SNP-tolerant detection of complex variants and splicing in short reads]]. //​Bioinformatics//​ **26**(7): 873–881, 2010. 
 +  - J. Blom, T. Jakobi, D. Doppmeier, S. Jaenicke, J. Kalinowski, J. Stoye, A. Goesmann. [[https://​doi.org/​10.1093/​bioinformatics/​btr151|Exact and complete short read alignment to microbial genomes using GPU programming]]. //​Bioinformatics//​ **27**(10): 1351-1358, 2011. 
 +  - Z. Ning, A.J. Cox. [[https://​doi.org/​10.1101/​gr.194201|SSAHA:​ A Fast Search Method for Large DNA Databases]]. //Genome Res.// **11**(10): 1725-1729, 2001. 
 +  - F. J. Sedlazeck, P. Rescheneder,​ A. von Haeseler. [[https://​doi.org/​10.1093/​bioinformatics/​btt468|NextGenMap:​ fast and accurate read mapping in highly polymorphic genomes]]. //​Bioinformatics//​ **29**(21): 2790-2791, 2013.
 +  - L. Oesper, A. Ritz, S. J. Aerni, R. Drebin, B. J. Raphael. [[https://​doi.org/​10.1186/​1471-2105-13-S6-S10|Reconstructing cancer genomes from paired-end sequencing data]]. //BMC Bioinformatics//​ **13**(Suppl. 6):S10, 2012. 
 +
 +==== Genome assembly IIa: De-novo assembly of HTS data ====
 +These methods are all very similar, using variants of de-Bruijn graphs plus additional tricks: EULER-SR [1], Velvet [2,3], MIRA2 ([[http://​www.chevreux.org/​projectsmira.html|Link]]),​ SSAKE [4],VCAKE [5], SHARCGS [6], Medvedev/​Brudno [7], ABySS ([[http://​www.bcgsc.ca/​platform/​bioinfo/​software/​abyss|Link]]),​ ALLPATHS [8], SOAPdenovo [9], IDBA [10], etc. A very good recent review paper is [11]. 
 +
 +  - M. J. Chaisson, P. A. Pevzner. [[https://​doi.org/​10.1101/​gr.7088808|Short read fragment assembly of bacterial genomes]]. //Genome Res.// **18**(2):​324-330,​ 2008. 
 +  - D. R. Zerbino, E. Birney. [[https://​doi.org/​10.1101/​gr.074492.107|Velvet:​ Algorithms for de novo short read assembly using de Bruijn graphs]]. //Genome Res.// **18**(5):​821-829,​ 2008. 
 +  - D. R. Zerbino, G. K. McEwen, E. H. Margulies, E. Birney. [[https://​doi.org/​10.1371/​journal.pone.0008407|Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler]]. //PLoS ONE// **4**(12):​e8407,​ 2009. 
 +  - R. L. Warren, G. G. Sutton, S. J. M. Jones, R. A. Holt. [[https://​doi.org/​10.1093/​bioinformatics/​btl629|Assembling millions of short dna sequences using ssake]]. //​Bioinformatics//​ **23**(4):​500-501,​ 2006. 
 +  - W. R. Jeck, J. A. Reinhardt, D. A. Baltrus, M. T. Hickenbotham,​ V. Magrini, E. R. Mardis, J. L. Dang, C. D. Jones. [[https://​doi.org/​10.1093/​bioinformatics/​btm451|Extending assembly of short dna sequences to handle error]]. //​Bioinformatics//​ **23**(21):​2942-2944,​ 2007. 
 +  - J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. [[https://​doi.org/​10.1101/​gr.6435207|Sharcgs,​ a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing]]. //Genome Res.// **17**(11):​1697-1706,​ 2007. 
 +  - P. Medvedev, M. Brudno. [[https://​doi.org/​10.1007/​978-3-540-78839-3_5|Ab initio whole genome shotgun assembly with mated short reads]]. //Proc. RECOMB 2008// pages 50-64, 2008. 
 +  - J. Butler, I. Maccallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, D. B. Jaffe. [[https://​doi.org/​10.1101/​gr.7337908|Allpaths:​ De novo assembly of whole-genome shotgun microreads]]. //Genome Res.// **18**(5):​810-820,​ 2008. 
 +  - R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen,​ S. Li, H. Yang, J. Wang, J. Wang. [[https://​doi.org/​10.1101/​gr.097261.109|De novo assembly of human genomes with massively parallel short read sequencing]]. //Genome Res.// **20**(2):​265-272,​ 2010. 
 +  - Y. Peng, H. C. M. Leung, S. M. Yiu, F. Y. L. Chin. [[https://​doi.org/​10.1007/​978-3-642-12683-3_28|IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler]]. Proceedings of RECOMB 2010, LNBI 6044, 426-440, 2010. 
 +  - J. R. Miller, S. Koren, G. Sutton. [[https://​doi.org/​10.1016/​j.ygeno.2010.03.001|Assembly algorithms for next-generation sequencing data]]. //​Genomics//​ in press, 2010. 
 +  - Phillip E. C. Compeau, Pavel A. Pevzner, Glenn Tesler. [[https://​doi.org/​10.1038/​nbt.2023|How to apply de Bruijn graphs to genome assembly]]. //Nature biotechnology//,​ **29**, 987-991, 2011.
 +
 +==== Genome assembly IIb: Hybrid/long read assembly ====
 +  - A. C. English, S. Richards, Y. Han, M. Wang, V. Vee, J. Qu, X. Qin, D. M. Muzny, J. G. Reid, K. C. Worley, R. A. Gibbs. [[https://​doi.org/​10.1371/​journal.pone.0047768|Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology]]. //PLoS ONE// **7**(11): e47768, 2012.
 +  - S. Koren, M. C. Schatz, B. P. Walenz, J. Martin, J. T. Howard, G. Ganapathy, Z. Wang, D. A. Rasko, W. R. McCombie, E. D. Jarvis, A. M. Phillippy. [[https://​doi.org/​10.1038/​nbt.2280|Hybrid error correction and de novo assembly of single-molecule sequencing reads]]. //Nature Biotechnolopgy//​ **30**:​693-700,​ 2012.
 +  - C.-S. Chin, D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner, J. Korlach. [[https://​doi.org/​10.1038/​nmeth.2474|Nonhybrid,​ finished microbial genome assemblies from long-read SMRT sequencing data]]. //Nature Methods// **10**:​563-569,​ 2013.
 +  - G. Myers. [[https://​doi.org/​10.1007/​978-3-662-44753-6_5|Efficient Local Alignment Discovery amongst Noisy Long Reads]]. //​Proceedings of WABI 2014//, LNBI 8701, 52-67, 2014.
 +  - F. J. Sedlazeck, P. Rescheneder,​ M. Smolka, H. Fang, M. Nattestad, A. von Haeseler, M. C. Schatz. [[https://​doi.org/​10.1038/​s41592-018-0001-7|Accurate detection of complex structural variations using single molecule sequencing]]. //Nat. Methods// **15**(6): 461–468, 2018.
 +  - E. Haghshenas, H. Asghari, J. Stoye, C. Chauve, F. Hach. [[https://​doi.org/​10.1016/​j.isci.2020.101389|HASLR:​ Fast Hybrid Assembly of Long Reads]]. //​iScience//​ **23**(8): 101389, 2020.
 +
 +String graph assembly for diploid genomes with long reads is explained on the following [[https://​s3.amazonaws.com/​files.pacb.com/​pdf/​String+Graph+Assembly+For+Diploid+Genomes+with+Long+Reads.pdf|poster by PacBio]]. ​
 +
 +Pre-processing (correction of long reads):
 +
 +  - L. Salmela and E. Rivals. [[https://​doi.org/​10.1093/​bioinformatics/​btu538|LoRDEC:​ accurate and efficient long read error correction]]. //​Bioinformatics//​ **30**(24):​3506–3514,​ 2014.
 +  - L. Salmela, R. Walve, E. Rivals and E. Ukkonen. [[https://​doi.org/​10.1093/​bioinformatics/​btw321|Accurate self-correction of errors in long reads using de Bruijn graphs]]. //​Bioinformatics//,​ **33**(6):​799–806,​ 2017.
 + 
 +==== Genome annotation I: Basic techniques ====
 +A general introduction to HMMs in Bioinformatics is in the textbook by Durbin//et al.// [1]. Covariance models were introduced in [2]. A similar concept was developed in [3]. 
 +
 +  - R. Durbin, S. Eddy, A. Krogh, G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998. 
 +  - S. R. Eddy, R. Durbin. [[https://​doi.org/​10.1093/​nar/​22.11.2079|RNA sequence analysis using covariance models]]. //Nucleic Acids Res.// **22**(11):​2079-2088,​ 1994. 
 +  - Y. Sakakibara, M. Brown, R. Hughey, I. S. Mian, K. Sjølander, R. C. Underwood, D. Haussler. [[https://​doi.org/​10.1093/​nar/​22.23.5112|Stochastic context-free grammars for tRNA modeling]]. //Nucleic Acids Res.// **22**(23):​5112-5120,​ 1994.
 +
 +==== Genome annotation II: Finding RNA genes ====
 +Here are several references, in chronological order: ​
 +
 +  - R. Staden. [[http://​www.pubmedcentral.nih.gov/​picrender.fcgi?​artid=327313&​amp;​blobtype=pdf|A computer program to search for tRNA genes]]. //Nucleic Acids Res.// **8**:​817-825,​ 1980. 
 +  - G. A. Fichant and C. Burks. [[https://​doi.org/​10.1016/​0022-2836(91)90108-I|Identifying potential tRNA genes in genomic DNA sequences]]. //J. Mol. Biol.// **220**:​659-671,​ 1991. 
 +  - A. Pavesi, F. Conterio, A. Bolchi, G. Dieci, and S. Ottonello. [[https://​doi.org/​10.1093/​nar/​22.7.1247|Identification of new eukaryotic tRNA genes in genomic DNA databases by a multistep weight matrix analysis of transcriptional control regions]]. //Nucleic Acids Res.// **22**:​1247-1256,​ 1994. 
 +  - S. R. Eddy and R. Durbin. [[https://​doi.org/​10.1093/​nar/​22.11.2079|RNA sequence analysis using covariance models]]. //Nucleic Acids Res.// **22**(11):​2079-2088,​ 1994. 
 +  - T. M. Lowe and S. R. Eddy. [[https://​doi.org/​10.1093/​nar/​25.5.0955|tRNAscan-SE:​ A program for improved detection of transfer RNA genes in genomic sequence]]. //Nucleic Acids Res.// **25**:​955-964,​ 1997.
 +
 +==== Genome annotation III: Finding protein-coding genes ====
 +Here are several references, in chronological order. First, for **prokaryotes** ...
 +
 +  - M. Borodovsky and J. McIninch. [[https://​doi.org/​10.1016/​0097-8485(93)85004-V|Genmark:​ Parallel gene recognition for both DNA strands]]. //Computers Chem.// **17**(2):​123-133,​ 1993. 
 +  - A. V. Lukashin and M. Borodovsky. [[https://​doi.org/​10.1093/​nar/​26.4.1107|GeneMark.hmm:​ New solutions for gene finding]]. //Nucleic Acids Res.// **26**(4):​1107-1115,​ 1998. 
 +  - S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White. [[https://​doi.org/​10.1093/​nar/​26.2.544|Microbial gene identification using interpolated Markov models]]. //Nucleic Acids Res.// **26**(2):​544-548,​ 1998. 
 +  - A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg. [[https://​doi.org/​10.1093/​nar/​27.23.4636|Improved microbial gene identification with GLIMMER]]. //Nucleic Acids Res.// **27**(23):​4636-3641,​ 1999. 
 +  - D. Frishman, A. Mironov, H.-W. Mewes, and M. Gelfand. [[https://​doi.org/​10.1093/​nar/​26.12.2941|Combining diverse evidence for gene recognition in completely sequenced bacterial genomes]]. //Nucleic Acids Res.// **26**(12):​2941-2947,​ 1998. 
 +  - S. S. Hannenhalli,​ W. S. Hayes, A. G. Hatzigeorgiou,​ and J. W. Fickett. [[https://​doi.org/​10.1093/​nar/​27.17.3577|Bacterial start site prediction]]. //Nucleic Acids Res.// **27**(17):​3577-3582,​ 1999. 
 +  - J. H. Badger and G. J. Olsen. [[http://​mbe.oxfordjournals.org/​content/​16/​4/​512|CRITICA:​ Coding region identification tool invoking comparative analysis]]. //Mol. Biol. Evol.// **16**(4):​512-524,​ 1999. 
 +  - T. S. Larsen and A. Krogh. [[https://​doi.org/​10.1186/​1471-2105-4-21|EasyGene ​ a prokaryotic gene finder that ranks ORFs by statistical significance]]. //BMC Bioinformatics//​ **4**(21), 2003. 
 +  - L. Krause, A. McHardy, T. Nattkemper, A. Pühler, J. Stoye, F. Meyer. [[https://​doi.org/​10.1093/​nar/​gkl1083|GISMO - gene identification using a support vector machine for ORF classification]]. //Nucleic Acids Res.// **35**(2):​540-549,​ 2006.
 + 
 +... and now for **eukaryotes**:​
 +
 +  - G. Stormo. [[https://​doi.org/​10.1016/​0076-6879(90)83015-2|Consensus patterns in DNA]]. In R. F. Doolittle, editor, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, //Meth. Enzymol.// **volume 183** chapter 13, pages 211-221. Academic Press, San Diego, CA, 1990. 
 +  - R. Guigó, S. Knudsen, N. Drake, and T. Smith. [[https://​doi.org/​10.1016/​0022-2836(92)90130-C|Prediction of gene structure]]. //J. Mol. Biol.// **226**(1):​141-157,​ 1992. 
 +  - E. C. Uberbacher and R. J. Mural. [[http://​www.pnas.org/​content/​88/​24/​11261|Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach]]. //Proc. Natl. Acad. Sci. USA// **88**(24):​11261-11265,​ 1991. 
 +  - E. C. Uberbacher, Y. Xu, and R. J. Mural. [[https://​doi.org/​10.1016/​S0076-6879(96)66018-2|Discovering and understanding genes in human DNA sequence using GRAIL]]. In R. F. Doolittle, editor, Computer Methods for Macromolecular Sequence Analysis, //Meth. Enzymol.// **volume 266** chapter 28, pages 259-281. Academic Press, San Diego, CA, 1996. 
 +  - S. L. Salzberg. [[https://​doi.org/​10.1093/​bioinformatics/​13.4.365|A method for identifying splice sites and translation start sites in eukaryotic mRNA]]. //CABIOS// **13**(4):​365-376,​ 1997. 
 +  - M. G. Reese, F. H. Eeckman, D. Kulp, and D. Haussler. [[https://​doi.org/​10.1089/​cmb.1997.4.311|Improved splice site detection in Genie]]. //J. Comp. Biol.// **4**(3):​311-323,​ 1997. 
 +  - C. Burge and S. Karlin. [[https://​doi.org/​10.1006/​jmbi.1997.0951|Prediction of complete gene structures in human genomic DNA]]. //J. Mol. Biol.// **268**(1):​78-94,​ 1997. 
 +  - G. Parra, E. Blanco, and R. Guigó. [[https://​doi.org/​10.1101/​gr.10.4.511|GeneID in Drosophila]]. //Genome Res.// **10**(4):​511-515,​ 2000. 
 +  - M. G. Reese, D. Kulp, H. Tammana, and D. Haussler. [[https://​doi.org/​10.1101/​gr.10.4.529|Genie - gene finding in Drosophila melanogaster]]. //Genome Res.// **10**:​529-538,​ 2000. 
 +  - M. Pertea and S. L. Salzberg. [[https://​doi.org/​10.1023/​A:​1013770123580|Computational gene finding in plants]]. //Plant Mol. Biol.// **48**(1-2):​39-48,​ 2002.
 +
 +==== Genome annotation IV: Patterns in non-coding regions ====
 +This part is almost completely based on the work of Mathieu Blanchette and co-authors: ​
 +
 +  - M. Blanchette, B. Schwikowski,​ and M. Tompa. [[https://​doi.org/​10.1089/​10665270252935421|Algorithms for phylogenetic footprinting]]. //J. Comp. Biol.// **9**(2):​211-223,​ 2002. 
 +  - M. Blanchette and M. Tompa. [[https://​doi.org/​10.1101/​gr.6902|Discovery of regulatory elements by a computational method for phylogenetic footprinting]]. //Genome Res.// **12**:​739-748,​ 2002.
 +
 +==== Genome annotation V: Functional genome annotation ====
 +All bioinformatics textbooks contain good overviews of sequence analysis by Smith-Waterman and fast heuristic methods for database search like FASTA and BLAST. Here are some references to the original papers, including one by ourselves: ​
 +
 +  - T. F. Smith and M. S. Waterman. [[https://​doi.org/​10.1016/​0022-2836(81)90087-5|Identification of common molecular subsequences]]. //J. Mol. Biol.// **147**(1):​195-197,​ 1981. 
 +  - W. R. Pearson. [[https://​doi.org/​10.1016/​0076-6879(90)83007-V|Rapid and sensitive sequence comparison with FASTP and FASTA]]. In R. F. Doolittle, editor, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, //Meth. Enzymol.// **volume 183** chapter 5, pages 63-98. Academic Press, San Diego, CA, 1990. 
 +  - S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. [[https://​doi.org/​10.1016/​S0022-2836(05)80360-2|Basic local alignment search tool]]. //J. Mol. Biol.// **215**(3):​403-410,​ 1990. 
 +  - S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. [[https://​doi.org/​10.1093/​nar/​25.17.3389|Gapped BLAST and PSI-BLAST: A new generation of protein database search programs]]. //Nucleic Acids Res.// **25**(17):​3389-3402,​ 1997. 
 +  - K. Rasmussen, J. Stoye, and E. W. Myers. [[https://​doi.org/​10.1089/​cmb.2006.13.296|Efficient q-gram filters for finding all epsilon-matches over a given length]]. //J. Comp. Biol.// **13**(2):​296-308,​ 2006.
 +
 +==== Transcriptomics I: mRNA analysis, RNA-Seq ====
 +A few papers on EST clustering and splicing graphs: ​
 +
 +  - S. Heber, M. Alekseyev, S.-H. Sze, H. Tang, P. A. Pevzner. [[https://​doi.org/​10.1093/​bioinformatics/​18.suppl_1.S181
 +|Splicing graphs and EST assembly problem]]. //​Bioinformatics//​ **18**(Suppl. 1):​S181-S188,​ 2002. 
 +  - M. Sammeth, S. Foissac, R. Guigó. [[https://​doi.org/​10.1371/​journal.pcbi.1000147|A general definition and nomenclature for alternative splicing events]]. //PLoS Comput. Biol.// **4**(8):​e1000147,​ 2008. 
 +  - V. Lacroix, M. Sammeth, R. Guigó, A. Bergeron. [[https://​doi.org/​10.1007/​978-3-540-87361-7_5|Exact Transcriptome Reconstruction from Short Sequence reads]]. Proc. WABi 2008, LNBI 5251, 50-63, 2008. 
 +
 +Very good review about NGS transcriptomics (RNA-seq): ​
 +
 +  - J. A. Martin, Z. Wang. [[https://​doi.org/​10.1038/​nrg3068|Next-generation transcriptome assembly]]. Nat. Rev. Genet., **12**(10), 671–682, 2011.  ​
 +
 +Examples for software supporting RNA-seq: ​
 +
 +  - C. Trapnell, L. Pachter, S. Salzberg. [[https://​doi.org/​10.1093/​bioinformatics/​btp120|TopHat:​ discovering splice junctions with RNA-Seq]]. //​Bioinformatics//​ **25**(9): 1105-1111, 2009. 
 +  - K. Wang, D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, P. Mieczkowski,​ S. A. Grimm, C. M. Perou, J. N. MacLeod, D. Y. Chiang, J. F. Prins, J. Liu. [[https://​doi.org/​10.1093/​nar/​gkq622|MapSplice:​ Accurate mapping of RNA-seq reads for splice junction discovery]]. //Nucleic Acids Res.// **38**(18): e178, 2010.
 +
 +==== Tanscriptomics II: DNA microarray design ====
 +Just a few in-house papers, because they are so nice: 
 +
 +  - S. Rahmann. [[https://​doi.org/​10.1142/​S0219720003000125|Fast large scale oligonucleotide selection using the longest common factor approach]]. //J. Bioinf. Comput. Biol.// **1**(2):​343-361,​ 2003. 
 +  - S. Rahmann. [[https://​doi.org/​10.1093/​bioinformatics/​btg1073|The shortest common supersequence problem in a microarray production setting]]. //​Bioinformatics//​ **19**(Suppl. 2):156-161, 2003. (Proceedings of ECCB 2003). ​
 +  - S.A. de Carvalho Jr. and S. Rahmann. [[https://​doi.org/​10.1007/​11851561_30|Improving the layout of oligonucleotide microarrays:​ Pivot Partitioning]]. In Proc. 6th Workshop of Algorithms in Bioinformatics //LNBI// volume **4175** pages 321-332. Springer, 2006. 
 +  - S.A. de Carvalho Jr. and S. Rahmann. [[http://​subs.emis.de/​LNI/​Proceedings/​Proceedings83/​article7005.html|Microarray layout as a quadratic assignment problem]]. In Proc. German Conference on Bioinformatics,​ volume P-83 of Lecture Notes in Informatics,​ pages 11-20. GI, 2006.
 +
 +==== Tanscriptomics III: DNA microarray analysis ====
 +A couple of papers and one book: 
 +
 +  - W. Huber, A. von Heydebreck, H. Sueltmann, A. Poustka, and M. Vingron. [[https://​doi.org/​10.1093/​bioinformatics/​18.suppl_1.S96|Variance stabilization applied to microarray data calibration and to the quantification of differential expression]]. //​Bioinformatics//​ **18**(Suppl. 1):96-104, 2002. 
 +  - T. P. Speed, editor. [[https://​www.crcpress.com/​Statistical-Analysis-of-Gene-Expression-Microarray-Data/​Speed/​p/​book/​9781584883272|Statistical Analysis of Gene Expression Microarray Data]]. Chapman and Hall/CRC, 2003. 
 +  - G. K. Smyth. [[https://​doi.org/​10.2202/​1544-6115.1027|Linear models and empirical bayes methods for assessing differential expression in microarray experiments]]. //Stat. Appl. Genet. Mol. Biol.// **3**(1):​1-25,​ 2004. 
 +  - C. Steinhoff and M. Vingron. [[https://​doi.org/​10.1093/​bib/​bbl002|Normalization and quantification of differential expression in gene expression microarrays]]. //Brief. Bioinform.//​ **7**(2):​166-177,​ 2006.
 +
 +==== RNA secondary structure, structure space, RNA-based regulation ====
 +The most classical algorithm for RNA secondary structure prediction is Nussinov'​s algorithm:
 +
 +  - R. Nussinov, G. Pieczenik, J. R. Griggs, D. J. Kleitman. [[https://​doi.org/​10.1137/​0135006|Algorithms for Loop Matchings]]. //SIAM J. Appl. Math.// **35**(1):​68–82,​ 1978.
 +  - R. Giegerich, B. Voß, M. Rehmsmeier. [[https://​doi.org/​10.1093/​nar/​gkh779|Abstract shapes of RNA]]. //Nucleic Acids Res.// **32**(16):​4843-4851,​ 2004.
 +
 +==== Proteomics I: 2D gels ====
 +A survey of various techniques for two-dimensional gel alignment is [1]. The method described in the lecture is from [2], [3]. 
 +
 +  - A. W. Dowsey, M. J. Dunn, and G.-Z. Yang. [[https://​doi.org/​10.1002/​pmic.200300459|The role of bioinformatics in two-dimensional gel electrophoresis]]. //​Proteomics//​ **3**(8):​1567-1596,​ 2003. 
 +  - F. A. Potra, X. Liu, F. Seillier-Moiseiwitsch,​ A. Roy, Y. Hang, M. R. Marten, B. Raman, and C. Whisnant. [[https://​doi.org/​10.1089/​cmb.2006.13.614|Protein image alignment via piecewise affine transformations]]. //J. Comp. Biol.// **13**(3):​614-630,​ 2006. 
 +  - F. A. Potra, X. Liu. [[https://​doi.org/​10.1089/​cmb.2006.13.1384|Aligning Families of Two-Dimensional Gels by a Combined Multiresolution Forward-Inverse Transformation Approach]]. //J. Comp. Biol.// **13**(7):​1384-1395,​ 2006. 
 +  - Andrew W. Dowsey, Michael J. Dunn and Guang-Zhong Yang. [[https://​doi.org/​10.1093/​bioinformatics/​btn059|Automated image alignment for 2D gel electrophoresis in a high-throughput proteomics pipeline]]. //​Bioinformatics//​ **24**(7):​950-957,​ 2008.
 +
 +==== Proteomics II: Analysis of mass spectra ====
 +The algorithms discussed include: de-novo protein sequencing by mass spectra [1,2], the money changing problem [3], alignment of time-series of mass spectra [4]. 
 +
 +  - T Chen, Ming-Yang Kao, M Tepel, J Rush, and G M Church. [[https://​doi.org/​10.1089/​10665270152530872|A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry]]. //J. Comput. Biol.// **8**(3):​325-337,​ 2001. 
 +  - Lijuan Mo, Debojyoti Dutta, Yunhu Wan, and Ting Chen. [[https://​doi.org/​10.1021/​ac070039n|MSNovo:​ A Dynamic Programming Algorithm for de Novo Peptide Sequencing via Tandem Mass Spectrometry]]. //Anal. Chem.// **79**(13): 4870–4878,​ 2007. 
 +  - Sebastian Böcker and Zsuzsanna Lipták. [[https://​doi.org/​10.1007/​s00453-007-0162-8|A fast and simple algorithm for the money changing problem]]. //​Algorithmica//​ **48**(4):​413-432,​ 2007. 
 +  - Amol Prakash, Parag Mallick, Jeffrey Whiteaker, Heidi Zhang, Amanda Paulovich, Mark Flory, Hookeun Lee, Ruedi Aebersold, and Benno Schwikowski. [[https://​doi.org/​10.1074/​mcp.M500133-MCP200|Signal maps for mass spectrometry-based comparative proteomics]]. //Mol Cell Proteomics//​ **5**(3):​423-432,​ 2006. 
 +
 +==== Computational Metabolomics ====
 +Some software developed in Bielefeld has been published here: 
 +
 +  - H. Neuweger, S. P. Albaum, M. Dondrup, M. Persicke, T. Watt, K. Niehaus, J. Stoye, A. Goesmann. [[https://​doi.org/​10.1093/​bioinformatics/​btn452|MeltDB:​ A software Platform for the Analysis and Integration of Metabolomics Experiment Data]]. //​Bioinformatics//​ **24**(23):​2726-2732,​ 2008.
 +  - N. Hoffmann, J. Stoye. [[https://​doi.org/​10.5772/​31224|Generic Software Frameworks for GC-MS Based Metabolomics]]. In: U. Roessner (ed.): Metabolomics. Chapter 4, pp. 73-98. InTech, 2012.
 +
 +==== Computational Systems Biology: Networks, their models and algorithms ====
 +A textbook that covers this (and much more) is: 
 +
 +  - E. Klipp, R. Herwig, A. Kowald, C. Wierling, H. Lehrach. [[https://​doi.org/​10.1002/​3527603603|Systems Biology in Practice - Concepts, Implementation and Application]]. Wiley-VCH, 2005. 
 +
 +==== Computational pangenomics ====
 +The gene based method is considered here (for example):
 +
 +  - H. Tettelin et al. [[https://​doi.org/​10.1073/​pnas.0506758102|Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implicationsfor the microbial ‘‘pan-genome’’]]. //Proc. Natl. Academy. Sci. USA// **102**(39):​ 13950-13955,​ 2005.
 +  - J. Blom, S. P. Albaum, D. Doppmeier, A. Pühler, F.-J. Vorhölter, M. Zakrzewski, and A. Goesmann. [[https://​doi.org/​10.1186/​1471-2105-10-154|EDGAR:​ A software framework for the comparative analysis of prokaryotic genomes]]. //BMC Bioinformatics//​ 10:154, 2009.
 +  - J. Blom,  J. Kreis, ​ S. Spänig, ​ T. Juhre, ​ C. Bertelli, C. Ernst, and A. Goesmann. [[https://​doi.org/​10.1093/​nar/​gkw255|EDGAR 2.0: an enhanced software platform for comparative gene content analyses]]. //Nucleic Acids Res.// **44**(W1):​W22–W28,​ 2016.
 +  - J. Blom, S. P. Glaeser, T. Juhre, J. Kreis, P. H. G. Hanel, J. G. Schrader, P. Kämpfer, and A. Goesmann. [[https://​doi.org/​10.1002/​9781118960608.bm00038|EDGAR:​ A Versatile Tool for Phylogenomics]]. In: W. B. Whitman (ed.). Bergey'​s Manual of Systematics of Archaea and Bacteria, Wiley, 2019.
 +
 +A good overview of genome-based computational pangenomics gives the following review paper:
 +
 +  - The Computational Pan-Genomics Consortium. [[https://​doi.org/​10.1093/​bib/​bbw089|Computational pan-genomics:​ status, promises and challenges]]. //Brief. Bioinf.// **19**(1), 118–135, 2018.
 +
 +Some more specialized papers are the following.
 +
 +(A) Data structures
 +  - B. Paten, D. Earl, N. Nguyen, M. Diekhans, D. Zerbino, D. Haussler. [[https://​doi.org/​10.1101/​gr.123356.111|Cactus:​ Algorithms for genome multiple sequence alignment]]. //Genome Research// **21**, 1512–1528,​ 2011
 +  - C. Ernst, S. Rahmann. [[https://​drops.dagstuhl.de/​opus/​volltexte/​2013/​4231/​pdf/​p035-ernst.pdf|PanCake:​ A Data Structure for Pangenomes]]. Proc. of //GCB 2013//, 35-45, 2013.
 +  - G. Holley, R. Wittler, and J. Stoye. [[https://​doi.org/​10.1186/​s13015-016-0066-8 |Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage]]. //​Algorithms Mol. Biol.// **11**: 3, 2016.
 +  -  E. Garrison, J. Sirén, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F Lin, B. Paten, and R. Durbin. [[https://​doi.org/​10.1038/​nbt.4227|Variation graph toolkit improves read mapping by representing genetic variation in the reference]]. //Nat. Biotechnol.//​ **36**, 875–879, 2018.
 +  - G. Holley and P. Melsted. [[https://​doi.org/​10.1186/​s13059-020-02135-8|Bifrost:​ highly parallel construction and indexing of colored and compacted de Bruijn graphs]]. //Genome Biol.// **21**: 249, 2020.
 +
 +(B) Sequence-to-graph mapping/​alignment
 +  - M. Rautiainen, V. Mäkinen, and T. Marschall. [[https://​doi.org/​10.1093/​bioinformatics/​btz162|Bit-parallel sequence-to-graph alignment]]. //​Bioinformatics//​ **35**(19), 3599-3607, 2019. 
 +  -  R. Martiniano, E. Garrison, E. R. Jones, A. Manica, and R. Durbin. [[ https://​doi.org/​10.1186/​s13059-020-02160-7|Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph]]. //Genome Biol.// **21**: 250, 2020.
 +  - M. Rautiainen and T. Marschall. [[https://​doi.org/​10.1186/​s13059-020-02157-2|GraphAligner:​ rapid and versatile sequence-to-graph alignment]]. //Genome Biol.// **21**: 253, 2020.
 +  - A. Kuhnle, T. Mun, C. Boucher, T. Gagie, B. Langmead, and G. Manzini. [[https://​doi.org/​10.1089/​cmb.2019.0309|Efficient Construction of a Complete Index for Pan-Genomics Read Alignment]]. //J. Comp. Biol.// **27**(4), 500-513, 2020.
 +  -  N. Luhmann, G. Holley, and M. Achtman. [[https://​doi.org/​10.1101/​2020.01.21.914168|BlastFrost:​ Fast querying of 100,000s of bacterial genomes in Bifrost graphs]]. //​BioRxiv//,​ 2020.
 +  -  T. Schulz, R. Wittler, S. Rahmann, F. Hach, and J. Stoye. [[https://​doi.org/​10.1093/​bioinformatics/​btab077|Detecting High Scoring Local Alignments in Pangenome Graphs]]. //​Bioinformatics//​ **37**(16), 2266–2274,​ 2021.
 +
 +(C) Phylogenomics:​
 +
 +  - R. Wittler. [[https://​doi.org/​10.1186/​s13015-020-00164-3|Alignment- and reference-free phylogenomics with colored de Bruijn graphs]]. //​Algorithms Mol. Biol.// **15**: 4, 2020.
 +  - A. Rempel, R. Wittler. [[https://​doi.org/​10.1093/​bioinformatics/​btab444|SANS serif: alignment-free,​ whole-genome-based phylogenetic reconstruction]]. //​Bioinformatics//​ **37**(24), 4868-4870, 2021.
 +
 +(D) Haplotype inference:
 +
 +See [[#​haplotype_inference|below]].
 +
 +==== Comparative genomics I: Genome alignment, repeat analysis ====
 +Some relevant papers in this area are: 
 +
 +  - A. L. Delcher, S. Kasif, R. D. Fleischmann,​ J. Peterson, O. White, and S. L. Salzberg. [[https://​doi.org/​10.1093/​nar/​27.11.2369|Alignment of whole genomes]]. //Nucleic Acids Res.// **27**(11):​2369-2376,​ 1999. 
 +  - S. Schwartz, Z. Zhang, K. A. Fraser, A. Smit, C. Riemer, J. Bouck, R. Gibson, R. Hardisson, and W. Miller. [[https://​doi.org/​10.1101/​gr.10.4.577|PipMaker-A web server for aligning two genomic DNA sequences]]. //Genome Res.// **10**(4):​577-586,​ 2000. 
 +  - M. Höhl, S. Kurtz, and E. Ohlebusch. [[https://​doi.org/​10.1093/​bioinformatics/​18.suppl_1.S312|Efficient multiple genome alignment]]. //​Bioinformatics//​ **18**(Suppl. 1):​S312-S320,​ 2002. (Proceedings of ISMB 2002). ​
 +  - S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, Shumway M., C. Antonescu, and S. L. Salzberg. [[https://​doi.org/​10.1186/​gb-2004-5-2-r12|Versatile and open software for comparing large genomes]]. //Genome Biol.// **5**:R 12, 2004. 
 +  - A. C. E. Darling, B. Mau, F. R. Blattner, and N. T. Perna. [[https://​doi.org/​10.1101/​gr.2289704|Mauve:​ Multiple alignment of conserved genomic sequence with rearrangements]]. //Genome Res.// **14**(7):​1394-1403,​ 2004. 
 +  - E. Ohlebusch, M. I. Abouelhoda. [[https://​doi.org/​10.1201/​9781420036275.ch15|Chaining Algorithms and Applications in Comparative Genomics]]. In: Handbook of Computational Molecular Biology (Chapter 15), edited by S. Aluru, Chapman & Hall/CRC Computer and Information Science Series, 2006. 
 +  - S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C. Schleiermacher,​ J. Stoye, and R. Giegerich. [[https://​doi.org/​10.1093/​nar/​29.22.4633|REPuter:​ the manifold applications of repeat analysis on a genomic scale]]. //Nucleic Acids Res.// **29**(22):​4643-4653,​ 2001.
 +
 +==== Comparative genomics II: Genome rearrangement ====
 +The classical papers are by Hannenhalli and Pevzner (1999 and 1995). A much better readable description for the reversal distance can be found in the book chapter by Bergeron //et al.// (2005). The general DCJ model is described in (Yancopoulos //et al.//, 2005) and (much nicer) in (Bergeron //et al.//, 2006a). A correct algorithm for sorting by translocations is given in (Bergeron //et al.//, 2006b), an almost correct one for the HP distance in (Bergeron //et al.//, 2009), one formula of which is corrected in (Erdős et al., 2011). A very good overview of median, halving and guided halving results can be found in (Tannier et al., 2009). ​
 +
 +  - S. Hannenhalli and P. A. Pevzner. [[https://​doi.org/​10.1145/​300515.300516|Transforming cabbage into turnip: Polynomial algorithm for sorting signed permutations by reversals]]. //J. ACM// **46**(1):​127,​ 1999. 
 +  - S. Hannenhalli and P. A. Pevzner. [[https://​doi.org/​10.1109/​SFCS.1995.492588|Transforming men into mice (polynomial algorithm for genomic distance problem)]]. In //Proc. 36th Annu. Symp. Found. Comput. Sci.// (FOCS 1995), pages 581-592. IEEE Press, 1995.
 +  - A. Bergeron. [[https://​doi.org/​10.1007/​3-540-48194-X_9|A very elementary presentation of the Hannenhalli-Pevzner theory]]. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM 2001), volume 2089 of LNCS, pages 106-117, 2001.
 +  - A. Bergeron, J. Mixtacki, and J. Stoye. The inversion distance problem. In O. Gascuel, editor, [[https://​global.oup.com/​academic/​product/​mathematics-of-evolution-and-phylogeny-9780198566106?​q=9780198566106&​lang=en&​cc=de|Mathematics of Evolution and Phylogeny]],​ chapter 10, pages 262-290. Oxford University Press, Oxford, UK, 2005. 
 +  - S. Yancopoulos,​ O. Attie, and R. Friedberg. [[https://​doi.org/​10.1093/​bioinformatics/​bti535|Efficient sorting of genomic permutations by translocation,​ inversion and block interchange]]. //​Bioinformatics//​ **21**(16):​3340-3346,​ 2005. 
 +  - A. Bergeron, J. Mixtacki, and J. Stoye. [[https://​doi.org/​10.1007/​11851561_16|A unifying view of genome rearrangements]]. In: Proceedings of the 6th International Workshop on Algorithms in Bioinformatics (WABI 2006), volume 4175 of LNBI, pages 163-173, 2006. 
 +  - A. Bergeron, J. Mixtacki, and J. Stoye. [[https://​doi.org/​10.1089/​cmb.2006.13.567|On sorting by translocations]]. //J. Comp. Biol.// **13**(2):​567-578,​ 2006. 
 +  - A. Bergeron, J. Mixtacki, and J. Stoye. [[https://​doi.org/​10.1016/​j.tcs.2009.09.008|A new linear time algorithm to compute the genomic distance via the double cut and join distance]]. //Theor. Comput. Sci.// **410**(51):​5300-5316,​ 2009. 
 +  - P. L. Erdős, L. Soukup, J. Stoye. [[https://​doi.org/​10.1016/​j.aml.2010.08.021|Balanced vertices in trees and a simpler algorithm to compute the genomic distance]]. //Appl. Math. Lett.// **24**:​82-86,​ 2011. 
 +  - E. Tannier, C. Zheng, D. Sankoff. [[https://​doi.org/​10.1186/​1471-2105-10-120|Multichromosomal median and halving problems under different genomic distances]]. //BMC Bioinformatics//​ **10**:120, 2009.
 +
 +==== Comparative genomics III: Gene clusters ====
 +The following are the algorithmic papers in this area. Apart from that, many papers on applications of gene clusters and statistical properties exist, but are not listed here. 
 +
 +(a.) Common intervals of permutations:​
 +  - T. Uno and M. Yagiura. [[https://​doi.org/​10.1007/​s004539910014|Fast algorithms to enumerate all common intervals of two permutations]]. //​Algorithmica//​ **26**(2):​290-309,​ 2000. 
 +  - S. Heber and J. Stoye. [[https://​doi.org/​10.1007/​3-540-48194-X_19|Finding all common intervals of k permutations]]. In A. Amir and G. Landau, editors, Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, //CPM 2001//, volume 2089 of LNCS, pages 207-218, Berlin, 2001. Springer Verlag. ​
 +  - S. Heber and J. Stoye. [[https://​doi.org/​10.1007/​3-540-44696-6_20|Algorithms for finding gene clusters]]. In O. Gascuel and B. Moret, editors,​Proceedings of the First International Workshop on Algorithms in Bioinformatics,​ //WABI 2001//, volume 2149 of LNCS, pages 252-263, Berlin, 2001. Springer Verlag.
 +  - A. Bergeron, S. Corteel, and M. Raffinot. [[https://​doi.org/​10.1007/​3-540-45784-4_36|The algorithmic of gene teams]]. In R. Guigó and D. Gusfield,​editors,​ Proceedings of the Second International Workshop on Algorithms in Bioinformatics,​ //WABI 2002//, volume 2452 of LNCS, pages 464-476, Berlin, 2002. Springer Verlag. ​
 +  - N. Luc, J.-L. Risler, A. Bergeron, and M. Raffinot. [[https://​doi.org/​10.1016/​S1476-9271(02)00097-X|Gene teams: a new formalization of gene clusters for comparative genomics]]. //Comp. Biol. Chem.// **27**:​59-67,​ 2003. 
 +  - G. M. Landau, L. Parida, and O. Weimann. [[https://​doi.org/​10.1089/​cmb.2005.12.1289|Gene proximity analysis across whole genomes via PQ tree]]. //J. Comp. Biol.// **12**(10):​1289–1306,​ 2005.
 +  - A. Bergeron, C. Chauve, F. de Montgolfier,​ and M. Raffinot. [[https://​doi.org/​10.1137/​060651331|Computing common intervals of K permutations,​ with applications to modular decomposition of graphs]]. //SIAM J. Discrete Mathematics//​ **22**(3):​1022–1039,​ 2008.
 +  - S. Heber, R. Mayr, J. Stoye. [[https://​doi.org/​10.1007/​s00453-009-9332-1|Common Intervals of Multiple Permutations]]. //​Algorithmica//​ **60**(2):​175-206,​ 2011.
 +
 +(b.) Common intervals of sequences:
 +  - T. Schmidt and J. Stoye. [[https://​doi.org/​10.1007/​978-3-540-27801-6_26|Quadratic time algorithms for finding common intervals in two and more sequences]]. In S. C. Sahinalp, S. Muthukrishnan,​ and U. Dogrusoz, editors, Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, //CPM 2004//, volume 3109 of LNCS, pages 347-358, Berlin, 2004. Springer Verlag. ​
 +  - G. Didier, T. Schmidt, J. Stoye, D. Tsur. [[https://​doi.org/​10.1016/​j.jda.2006.03.021|Character Sets of Strings]]. //J. Discr. Alg.// **5**(2):​330-340,​ 2007.
 +  - X. He and M. H. Goldwasser. [[https://​doi.org/​10.1089/​cmb.2005.12.638|Identifying conserved gene clusters in the presence of homology families]]. //J. Comp. Biol.// **12**(6):​638-656,​ 2005.
 +
 +(c.) Approximate common intervals of sequences:
 +  - S. Böcker, K. Jahn, J. Mixtacki, J. Stoye. [[https://​doi.org/​10.1089/​cmb.2009.0098|Computation of Median Gene Clusters]]. //J. Comp. Biol.// **16**(8):​1085-1099,​ 2009.
 +  - F. Hufsky, L. Kuchenbecker,​ K. Jahn, J. Stoye, S. Böcker. [[https://​doi.org/​10.1186/​1471-2105-12-106|Swiftly Computing Center Strings]]. //BMC Bioinformatics//​ **12**:106, 2011.
 +
 +(d.) Common intervals of indeterminate strings:
 +  - D. Doerr, J. Stoye, S. Böcker, K. Jahn. [[https://​doi.org/​10.1186/​1471-2164-15-S6-S2|Identifying Gene Clusters by Discovering Common Intervals in Indeterminate Strings]]. //BMC Genomics// **15**(Suppl. 6): S2, 2014.
 +
 +==== Comparative Genomics IV: Multiple genome rearrangement ====
 +
 +  - D. Sankoff, G. Sunaram, J. Kececioglu. [[https://​doi.org/​10.1142/​S0129054196000026|Steiner points in the space of genome rearrangements]]. //​International Journal of Foundations of Computer Science// **7**(1):​1-9,​ 1996. 
 +  - G. Bourque, P. A. Pevzner. [[http://​genome.cshlp.org/​content/​12/​1/​26|Genome-Scale Evolution: Reconstructing Gene Orders in the Ancestral Species]]. //Genome Res.// **12**:​26-36,​ 2002.
 +  - J. Mixtacki. [[https://​doi.org/​10.1007/​978-3-540-69733-6_28|Genome halving under DCJ revisited]]. In: Proceedings of the 14th Annual International Conference on Computing and Combinatorics (COCOON 2008), volume 5092 of LNCS, pages 276-286, 2008.
 +  - R. Warren, D. Sankoff. [[https://​doi.org/​10.1142/​S0219720009004102|Genome Halving with Double Cut and Join]]. //Journal of Bioinformatics and Computational Biology// **7**(2): 357-371, 2009.
 +  - P Feijão, J. Meidanis. [[https://​doi.org/​10.1109/​TCBB.2011.34|SCJ:​ A Breakpoint-Like Distance that Simplifies Several Rearrangement Problems]]. //IEEE/ACM Transactions on Computational Biology and Bioinformatics//​ **6**(3): 1318-1329, 2011.
 +
 +==== Comparative Genomics V: Reconstruction of ancestral gene clusters ====
 +
 +  - R. Wittler. [[https://​pub.uni-bielefeld.de/​publication/​2301939|Phylogeny-based analysis of gene clusters]]. Ph.D. Thesis, Faculty of Technology, Bielefeld University, 2010. 
 +  - J. Stoye, R. Wittler. [[https://​doi.org/​10.1109/​TCBB.2008.135|A Unified Approach for Reconstructing Ancient Gene Clusters]]. //IEEE/ACM Transactions on Computational Biology and Bioinformatics//​ **6**(3):​387-400,​ 2009.
 +  - R. Wittler, J. Maňuch, M. Patterson, J. Stoye. [[https://​doi.org/​10.1089/​cmb.2011.0083|Consistency of Sequence-Based Gene Clusters]]. //J. Comp. Biol.// **18**(9):​1023-1039,​ 2011.
 +
 +==== Haplotype inference ====
 +A great overview of the combinatorial problems and algorithms in the following book chapter: ​
 +
 +  - D. Gusfield, S. Hecht Orzack. [[https://​doi.org/​10.1201/​9781420036275|Haplotype Inference]]. In: Handbook of Computational Molecular Biology (Chapter 18), edited by S. Aluru, Chapman & Hall/CRC Computer and Information Science Series, 2006.
 +
 +A more recent works on the topic, focussing on molecular haplotyping:​
 +
 +  - G. W. Klau, T. Marschall. [[https://​doi.org/​10.1007/​978-3-319-58741-7_6|A guided tour to computational haplotyping]]. In: Proc. of CiE 2017, LNCS 10307, Springer Verlag, 2017.
 +  - M. Patterson, T. Marschall, N. Pisanti, L. v. Iersel, L. Stougie, G. W. Klau, A. Schönhuth. [[https://​doi.org/​10.1089/​cmb.2014.0157|WhatsHap:​ Weighted Haplotype Assembly for Future-Generation Sequencing Reads]]. //Journal of Computational Biology// **22**(6), 498-509, 2015.
 +
 +The ILP discussed in class is from the following textbook, Section 20.2:
 +
 +  - Dan Gusfield. [[https://​doi.org/​10.1017/​9781108377737|Integer Linear Programming in Computational and Systems Biology]]. Cambridge University Press, 2019.
 +
 +==== SNP-disease associations ====
 +Here are a few of the more algorithmic papers on the topic, but there exist several more. You may look up the references in this one. 
 +
 +  - H. T. T. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, and J. Kere. [[https://​doi.org/​10.1109/​BIBE.2000.889596|Gene mapping by haplotype pattern mining]]. In //IEEE International Symposium on Bio-Informatics and Biomedical Engineering//​ (BIBE'​00):​ 99, 2000. 
 +  - T. Mailund, S. Besenbacher,​ and M. Schierup. [[https://​doi.org/​10.1186/​1471-2105-7-454|Whole genome association mapping by incompatibilities and local perfect phylogenies]]. //BMC Bioinformatics//​ **7**:454, 2006. 
 +  - J. Nielsen and T. Mailund. [[https://​doi.org/​10.1186/​1471-2105-9-526|SNPFile - A software library and file format for large scale association mapping and population genetics studies]]. //BMC Bioinformatics//​ **9**:526, 2008. 
 +  - S. Besenbacher,​ C. N. S. Pedersen, and T. Mailund. [[https://​doi.org/​10.1186/​1471-2105-10-S1-S74|A fast algorithm for genome-wide haplotype pattern mining]]. //BMC Bioinformatics//​ **10**(Suppl 1):S74, 2009.
 +  - S. Besenbacher,​ T. Mailand, M. H. Schierup. [[https://​doi.org/​10.1007/​978-1-61779-585-5_11|Association Mapping and Disease: Evolutionary Perspectives]]. //​Evolutionary Genomics: Statistical and Computational Methods, Volume 2// Methods in Molecular Biology, vol. 856. Chapter 11, 2012.
 +
 +==== Computational metagenomics,​ astrobiology ====
 +Papers on tools for the analysis of metagenomics data:
 +
 +  - D. H. Huson, A. F. Auch, J. Qi, S. C. Schuster. [[https://​doi.org/​10.1101/​gr.5969107|MEGAN analysis of metagenomic data]]. //Genome Res.// **17**:​377-386,​ 2007.
 +  - Q. Wang, G. M. Garrity, J. M. Tiedje, J. R. Cole. [[https://​doi.org/​10.1128/​AEM.00062-07|Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy]]. //Appl. Env. Microbiol.//,​ **73**(16):​5261-5267,​ 2007.
 +  - F. Meyer, D. Paarmann, M. D'​Souza,​ R. Olson, E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R. A. Edwards. [[https://​doi.org/​10.1186/​1471-2105-9-386|The Metagenomics RAST server - A public resource for the automatic phylogenetic and functional analysis of metagenomes]]. //BMC Bioinformatics//​ **9**:386, 2008.
 +  - L. Krause, N. N. Diaz, A. Goesmann, S. Kelley, T. W. Nattkemper, F. Rohwer, R. A. Edwards, J. Stoye. [[https://​doi.org/​10.1093/​nar/​gkn038|Phylogenetic classification of short environmental DNA fragments]]. //Nuleic Acids Res.// **36**(7):​2230-2239,​ 2008.
 +  - M. M. Haque, T. S. Ghosh, D. Komanduri, S. S. Mande. [[https://​doi.org/​10.1093/​bioinformatics/​btp317|SOrt-ITEMS:​ Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences]]. //​Bioinformatics//​ **25**(14):​1722-1730,​ 2009.
 +  - W. Gerlach, S. Jünemann, F. Tille, A. Goesmann, J. Stoye. [[https://​doi.org/​10.1186/​1471-2105-10-430|WebCARMA:​ a web application for the functional and taxonomic classification of unassembled metagenomic reads]]. //BMC Bioinformatics//​ **10**:430, 2009.
 +  - D. E. Wood, S. L. Salzberg. [[https://​doi.org/​10.1186/​gb-2014-15-3-r46|Kraken:​ ultrafast metagenomic sequence classification using exact alignments]]. //Genome Biol.// **15**:R46, 2014.
 +  - A. Oulas, C. Pavloudi, P. Polymenakou,​ G. A. Pavlopoulos,​ N. Papanikolaou,​ G. Kotoulas, C. Arvanitidis,​ I. Iliopoulos. [[https://​doi.org/​10.4137/​BBI.S12462|Metagenomics:​ Tools and insights for analyzing next-generation sequencing data derived from biodiversity studies.]] //​Bioinformatics and Biology Insights// **9**:​75-88,​ 2015.
 +
 +(//​Astrobiology//​ was mainly meant as a joke, but have a look here: [[http://​astrobiology.nasa.gov/​nai/​|NAI]]. Or here: [[https://​www.wired.com/​2010/​12/​nasa-finds-arsenic-life-form|NASA unveils arsenic life form]].) ​