This is an old revision of the document!
References for Algorithms in Genome Research
Overview
Parts of the material is covered by the following two textbooks. Some topics are newer than these books. For most of these, specialized references are given below. 
-  D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, 1997.  
-  D. W. Mount, editor. Bioinformatics  Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., 2nd edition, 2004.  
 
The main issues of this section are discussed in Mounts textbook. Some of the major biological sequence databases are: 
- 
- 
- 
- 
 
Sequencing technologies and applications
Papers on SOLiD sequencing: 
- 
- 
 
Physical mapping
This section is mainly based on Chapter 16 of Gusfield's textbook. The Tightest Layout Problem is originally from (Alizadehet al., 1995). Another reference is (Heber et al., 2000). 
- 
- 
 
Genome assembly Ia: Overlap-Layout-Consensus - Clone-by-clone and whole-genome shotgun assembly
Lander-Waterman Statistics:
- 
Overlap-Layout-Consensus Assembly: There are many good description available online, just search for them.
 
Genome assembly Ib: Re-sequencing, comparative (reference-based) assembly
A good introduction to comparative genome assembly is [1]. The main algorithmic challenge is to map millions of (most very short) sequence reads onto one or more referene geneome(s). Suitable mapping algorithms for this task are SWIFT [2], Bowtie [6], ELAND (Cox, unpublished), MAQ [3], RMAP, SOAP [4], SHRiMP, SeqMap [5], TAGGER [7], ZOOM [8], BWA [9], GSNAP [10], SARUMAN [11], SSAHA2 [12], NextGenMap [13], etc.
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
 
Genome assembly IIa: De-novo assembly of HTS data
These methods are all very similar, using variants of de-Bruijn graphs plus additional tricks: EULER-SR [1], Velvet [2,3], MIRA2 (Link), SSAKE [4],VCAKE [5], SHARCGS [6], Medvedev/Brudno [7], ABySS (Link), ALLPATHS [8], SOAPdenovo [9], IDBA [10], etc. A very good recent review paper is [11]. 
- 
- 
- 
- 
- 
- 
- 
- 
-  R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, J. Wang.  De novo assembly of human genomes with massively parallel short read sequencing- .  Genome Res. 20- (2):265-272, 2010.  
- 
- 
- 
 
Genome assembly IIb: Hybrid/long read assembly
-  A. C. English, S. Richards, Y. Han, M. Wang, V. Vee, J. Qu, X. Qin, D. M. Muzny, J. G. Reid, K. C. Worley, R. A. Gibbs.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology- .  PLoS ONE 7- (11): e47768, 2012. 
-  S. Koren, M. C. Schatz, B. P. Walenz, J. Martin, J. T. Howard, G. Ganapathy, Z. Wang, D. A. Rasko, W. R. McCombie, E. D. Jarvis, A. M. Phillippy.  Hybrid error correction and de novo assembly of single-molecule sequencing reads- .  Nature Biotechnolopgy 30- :693-700, 2012. 
-  C.-S. Chin, D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner, J. Korlach.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data- .  Nature Methods 10- :563-569, 2013. 
- 
- 
String graph assembly for diploid genomes with long reads is explained on the following poster by PacBio. 
Pre-processing (correction of long reads):
- 
- 
 
Genome annotation I: Basic techniques
A general introduction to HMMs in Bioinformatics is in the textbook by Durbinet al. [1]. Covariance models were introduced in [2]. A similar concept was developed in [3]. 
-  R. Durbin, S. Eddy, A. Krogh, G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998.  
- 
- 
 
Genome annotation II: Finding RNA genes
Here are several references, in chronological order: 
- 
- 
- 
- 
- 
 
Genome annotation III: Finding protein-coding genes
Here are several references, in chronological order. First, for prokaryotes …
- 
- 
- 
- 
- 
- 
- 
- 
- 
… and now for eukaryotes:
-  G. Stormo.  Consensus patterns in DNA- . In R. F. Doolittle, editor, Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences,  Meth. Enzymol. volume 183-  chapter 13, pages 211-221. Academic Press, San Diego, CA, 1990.  
- 
- 
- 
- 
- 
- 
- 
- 
- 
 
Genome annotation IV: Patterns in non-coding regions
This part is almost completely based on the work of Mathieu Blanchette and co-authors: 
- 
- 
 
Genome annotation V: Functional genome annotation
All bioinformatics textbooks contain good overviews of sequence analysis by Smith-Waterman and fast heuristic methods for database search like FASTA and BLAST. Here are some references to the original papers, including one by ourselves: 
- 
- 
- 
- 
- 
 
Transcriptomics I: mRNA analysis, RNA-Seq
A few papers on EST clustering and splicing graphs: 
- 
- 
- 
Very good review about NGS transcriptomics (RNA-seq): 
- 
Examples for software supporting RNA-seq: 
- 
-  K. Wang, D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, P. Mieczkowski, S. A. Grimm, C. M. Perou, J. N. MacLeod, D. Y. Chiang, J. F. Prins, J. Liu.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery- .  Nucleic Acids Res. 38- (18): e178, 2010. 
 
Tanscriptomics II: DNA microarray design
Just a few in-house papers, because they are so nice: 
- 
- 
- 
- 
 
Tanscriptomics III: DNA microarray analysis
A couple of papers and one book: 
- 
- 
- 
- 
 
RNA secondary structure, structure space, RNA-based regulation
The most classical algorithm for RNA secondary structure prediction is Nussinov's algorithm:
- 
- 
 
Proteomics I: 2D gels
A survey of various techniques for two-dimensional gel alignment is [1]. The method described in the lecture is from [2], [3]. 
- 
- 
- 
- 
 
Proteomics II: Analysis of mass spectra
The algorithms discussed include: de-novo protein sequencing by mass spectra [1,2], the money changing problem [3], alignment of time-series of mass spectra [4]. 
- 
- 
- 
- 
 
Some software developed in Bielefeld has been published here: 
- 
- 
 
Computational Systems Biology: Networks, their models and algorithms
A textbook that covers this (and much more) is: 
- 
 
Computational pangenomics
The gene based method is considered here (for example):
- 
- 
- 
-  J. Blom, S. P. Glaeser, T. Juhre, J. Kreis, P. H. G. Hanel, J. G. Schrader, P. Kämpfer, and A. Goesmann.  EDGAR: A Versatile Tool for Phylogenomics- . In: W. B. Whitman (ed.). Bergey's Manual of Systematics of Archaea and Bacteria, Wiley, 2019. 
A good overview of genome-based computational pangenomics gives the following review paper:
- 
Some more specialized papers are the following.
(A) Data structures
- 
- 
- 
-   E. Garrison, J. Sirén, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F Lin, B. Paten, and R. Durbin.  Variation graph toolkit improves read mapping by representing genetic variation in the reference- .  Nat. Biotechnol. 36- , 875–879, 2018. 
- 
(B) Sequence-to-graph mapping/alignment
- 
- 
- 
- 
- 
- 
(C) Phylogenomics:
- 
- 
(D) Haplotype inference:
See below.
 
Comparative genomics I: Genome alignment, repeat analysis
Some relevant papers in this area are: 
-  A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg.  Alignment of whole genomes- .  Nucleic Acids Res. 27- (11):2369-2376, 1999.  
- 
- 
- 
- 
- 
- 
 
Comparative genomics II: Genome rearrangement
The classical papers are by Hannenhalli and Pevzner (1999 and 1995). A much better readable description for the reversal distance can be found in the book chapter by Bergeron et al. (2005). The general DCJ model is described in (Yancopoulos et al., 2005) and (much nicer) in (Bergeron et al., 2006a). A correct algorithm for sorting by translocations is given in (Bergeron et al., 2006b), an almost correct one for the HP distance in (Bergeron et al., 2009), one formula of which is corrected in (Erdős et al., 2011). A very good overview of median, halving and guided halving results can be found in (Tannier et al., 2009). 
- 
- 
- 
-  A. Bergeron, J. Mixtacki, and J. Stoye. The inversion distance problem. In O. Gascuel, editor,  Mathematics of Evolution and Phylogeny- , chapter 10, pages 262-290. Oxford University Press, Oxford, UK, 2005.  
- 
-  A. Bergeron, J. Mixtacki, and J. Stoye.  A unifying view of genome rearrangements- . In: Proceedings of the 6th International Workshop on Algorithms in Bioinformatics (WABI 2006), volume 4175 of LNBI, pages 163-173, 2006.  
- 
- 
- 
- 
 
Comparative genomics III: Gene clusters
The following are the algorithmic papers in this area. Apart from that, many papers on applications of gene clusters and statistical properties exist, but are not listed here. 
(a.) Common intervals of permutations:
- 
-  S. Heber and J. Stoye.  Finding all common intervals of k permutations- . In A. Amir and G. Landau, editors, Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching,  CPM 2001- , volume 2089 of LNCS, pages 207-218, Berlin, 2001. Springer Verlag.  
-  S. Heber and J. Stoye.  Algorithms for finding gene clusters- . In O. Gascuel and B. Moret, editors,Proceedings of the First International Workshop on Algorithms in Bioinformatics,  WABI 2001- , volume 2149 of LNCS, pages 252-263, Berlin, 2001. Springer Verlag. 
-  A. Bergeron, S. Corteel, and M. Raffinot.  The algorithmic of gene teams- . In R. Guigó and D. Gusfield,editors, Proceedings of the Second International Workshop on Algorithms in Bioinformatics,  WABI 2002- , volume 2452 of LNCS, pages 464-476, Berlin, 2002. Springer Verlag.  
- 
- 
- 
- 
(b.) Common intervals of sequences:
- 
- 
- 
(c.) Approximate common intervals of sequences:
- 
- 
(d.) Common intervals of indeterminate strings:
- 
 
Comparative Genomics IV: Multiple genome rearrangement
- 
- 
-  J. Mixtacki.  Genome halving under DCJ revisited- . In: Proceedings of the 14th Annual International Conference on Computing and Combinatorics (COCOON 2008), volume 5092 of LNCS, pages 276-286, 2008. 
- 
- 
 
Comparative Genomics V: Reconstruction of ancestral gene clusters
Haplotype inference
A great overview of the combinatorial problems and algorithms in the following book chapter: 
-  D. Gusfield, S. Hecht Orzack.  Haplotype Inference- . In: Handbook of Computational Molecular Biology (Chapter 18), edited by S. Aluru, Chapman & Hall/CRC Computer and Information Science Series, 2006. 
A more recent works on the topic, focussing on molecular haplotyping:
- 
- 
The ILP discussed in class is from the following textbook, Section 20.2:
- 
 
SNP-disease associations
Here are a few of the more algorithmic papers on the topic, but there exist several more. You may look up the references in this one. 
-  H. T. T. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, and J. Kere.  Gene mapping by haplotype pattern mining- . In  IEEE International Symposium on Bio-Informatics and Biomedical Engineering-  (BIBE'00): 99, 2000.  
- 
- 
- 
- 
 
Papers on tools for the analysis of metagenomics data:
- 
- 
-  F. Meyer, D. Paarmann, M. D'Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R. A. Edwards.  The Metagenomics RAST server - A public resource for the automatic phylogenetic and functional analysis of metagenomes- .  BMC Bioinformatics 9- :386, 2008. 
- 
- 
- 
- 
- 
(Astrobiology was mainly meant as a joke, but have a look here: NAI. Or here: NASA unveils arsenic life form.)