Differences

This shows you the differences between two versions of the page.

--- teaching:alggrliterature [2020/11/12 21:48]
jstoye
+++ teaching:alggrliterature [2022/11/21 09:57]
jstoye [Genome assembly IIb: Hybrid/long read assembly]
@@ Line 35: / Line 35: @@
 ==== Genome assembly Ib: Re-sequencing, comparative (reference-based) assembly ====
-A good introduction to comparative genome assembly is [1]. The main algorithmic challenge is to map millions of (most very short) sequence reads onto one or more referene geneome(s). Suitable mapping algorithms for this task are [[http://bibiserv.cebitec.uni-bielefeld.de/swift/|SWIFT]] [2], [[http://bowtie-bio.sourceforge.net/index.shtml|Bowtie]] [6], ELAND (Cox, unpublished), [[http://maq.sourceforge.net/|MAQ]] [3], [[http://rulai.cshl.edu/rmap/|RMAP]], [[http://soap.genomics.org.cn/|SOAP]] [4], [[http://compbio.cs.toronto.edu/shrimp/|SHRiMP]], SeqMap [5], TAGGER [7], ZOOM [8], [[http://bio-bwa.sourceforge.net/bwa.shtml|BWA]] [9], GSNAP [10], SARUMAN [11], SSAHA2 [12] etc. Methods especially suited for mapping SOLiD reads are presented in [13,14].
+A good introduction to comparative genome assembly is [1]. The main algorithmic challenge is to map millions of (most very short) sequence reads onto one or more referene geneome(s). Suitable mapping algorithms for this task are [[http://bibiserv.cebitec.uni-bielefeld.de/swift/|SWIFT]] [2], [[http://bowtie-bio.sourceforge.net/index.shtml|Bowtie]] [6], ELAND (Cox, unpublished), [[http://maq.sourceforge.net/|MAQ]] [3], [[http://rulai.cshl.edu/rmap/|RMAP]], [[http://soap.genomics.org.cn/|SOAP]] [4], [[http://compbio.cs.toronto.edu/shrimp/|SHRiMP]], SeqMap [5], TAGGER [7], ZOOM [8], [[http://bio-bwa.sourceforge.net/bwa.shtml|BWA]] [9], GSNAP [10], SARUMAN [11], SSAHA2 [12], NextGenMap [13], etc.
   - M. Pop, A. Phillippy, A. L. Delcher, and S. L. Salzberg. [[https://doi.org/10.1093/bib/5.3.237|Comparative genome assembly]]. //Briefings in Bioinformatics// **5**(3):237-248, 2004.
@@ Line 49: / Line 49: @@
   - J. Blom, T. Jakobi, D. Doppmeier, S. Jaenicke, J. Kalinowski, J. Stoye, A. Goesmann. [[https://doi.org/10.1093/bioinformatics/btr151|Exact and complete short read alignment to microbial genomes using GPU programming]]. //Bioinformatics// **27**(10): 1351-1358, 2011.
   - Z. Ning, A.J. Cox. [[https://doi.org/10.1101/gr.194201|SSAHA: A Fast Search Method for Large DNA Databases]]. //Genome Res.// **11**(10): 1725-1729, 2001.
-  - L. Noé, M. Gîrdea, G. Kucherov. [[https://doi.org/10.1007/978-3-642-12683-3_25|Seed Design Framework for Mapping SOLiD Reads]]. Proceedings of RECOMB 2010, LNBI 6044, 384-396, 2010.
+  - F. J. Sedlazeck, P. Rescheneder, A. von Haeseler. [[https://doi.org/10.1093/bioinformatics/btt468|NextGenMap: fast and accurate read mapping in highly polymorphic genomes]]. //Bioinformatics// **29**(21): 2790-2791, 2013.
-  - M. Csűrös, Sz. Juhos, A. Bérces. [[https://doi.org/10.1007/978-3-642-15294-8_15|Fast Mapping and Precise Alignment of AB SOLiD Color Reads to Reference DNA]]. Proceedings of WABI 2010, LNBI 6293, 176-188, 2010.
   - L. Oesper, A. Ritz, S. J. Aerni, R. Drebin, B. J. Raphael. [[https://doi.org/10.1186/1471-2105-13-S6-S10|Reconstructing cancer genomes from paired-end sequencing data]]. //BMC Bioinformatics// **13**(Suppl. 6):S10, 2012.
@@ Line 74: / Line 73: @@
   - C.-S. Chin, D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner, J. Korlach. [[https://doi.org/10.1038/nmeth.2474|Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data]]. //Nature Methods// **10**:563-569, 2013.
   - G. Myers. [[https://doi.org/10.1007/978-3-662-44753-6_5|Efficient Local Alignment Discovery amongst Noisy Long Reads]]. //Proceedings of WABI 2014//, LNBI 8701, 52-67, 2014.
+  - F. J. Sedlazeck, P. Rescheneder, M. Smolka, H. Fang, M. Nattestad, A. von Haeseler, M. C. Schatz. [[https://doi.org/10.1038/s41592-018-0001-7|Accurate detection of complex structural variations using single molecule sequencing]]. //Nat. Methods// **15**(6): 461–468, 2018.
   - E. Haghshenas, H. Asghari, J. Stoye, C. Chauve, F. Hach. [[https://doi.org/10.1016/j.isci.2020.101389|HASLR: Fast Hybrid Assembly of Long Reads]]. //iScience// **23**(8): 101389, 2020.
@@ Line 205: / Line 205: @@
   - E. Klipp, R. Herwig, A. Kowald, C. Wierling, H. Lehrach. [[https://doi.org/10.1002/3527603603|Systems Biology in Practice - Concepts, Implementation and Application]]. Wiley-VCH, 2005.
+==== Computational pangenomics ====
+The gene based method is considered here (for example):
+  - H. Tettelin et al. [[https://doi.org/10.1073/pnas.0506758102|Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implicationsfor the microbial ‘‘pan-genome’’]]. //Proc. Natl. Academy. Sci. USA// **102**(39): 13950-13955, 2005.
+  - J. Blom, S. P. Albaum, D. Doppmeier, A. Pühler, F.-J. Vorhölter, M. Zakrzewski, and A. Goesmann. [[https://doi.org/10.1186/1471-2105-10-154|EDGAR: A software framework for the comparative analysis of prokaryotic genomes]]. //BMC Bioinformatics// 10:154, 2009.
+  - J. Blom,  J. Kreis,  S. Spänig,  T. Juhre,  C. Bertelli, C. Ernst, and A. Goesmann. [[https://doi.org/10.1093/nar/gkw255|EDGAR 2.0: an enhanced software platform for comparative gene content analyses]]. //Nucleic Acids Res.// **44**(W1):W22–W28, 2016.
+  - J. Blom, S. P. Glaeser, T. Juhre, J. Kreis, P. H. G. Hanel, J. G. Schrader, P. Kämpfer, and A. Goesmann. [[https://doi.org/10.1002/9781118960608.bm00038|EDGAR: A Versatile Tool for Phylogenomics]]. In: W. B. Whitman (ed.). Bergey's Manual of Systematics of Archaea and Bacteria, Wiley, 2019.
+A good overview of genome-based computational pangenomics gives the following review paper:
+  - The Computational Pan-Genomics Consortium. [[https://doi.org/10.1093/bib/bbw089|Computational pan-genomics: status, promises and challenges]]. //Brief. Bioinf.// **19**(1), 118–135, 2018.
+Some more specialized papers are the following.
+(A) Data structures
+  - B. Paten, D. Earl, N. Nguyen, M. Diekhans, D. Zerbino, D. Haussler. [[https://doi.org/10.1101/gr.123356.111|Cactus: Algorithms for genome multiple sequence alignment]]. //Genome Research// **21**, 1512–1528, 2011
+  - C. Ernst, S. Rahmann. [[https://drops.dagstuhl.de/opus/volltexte/2013/4231/pdf/p035-ernst.pdf|PanCake: A Data Structure for Pangenomes]]. Proc. of //GCB 2013//, 35-45, 2013.
+  - G. Holley, R. Wittler, and J. Stoye. [[https://doi.org/10.1186/s13015-016-0066-8 |Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage]]. //Algorithms Mol. Biol.// **11**: 3, 2016.
+  -  E. Garrison, J. Sirén, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F Lin, B. Paten, and R. Durbin. [[https://doi.org/10.1038/nbt.4227|Variation graph toolkit improves read mapping by representing genetic variation in the reference]]. //Nat. Biotechnol.// **36**, 875–879, 2018.
+  - G. Holley and P. Melsted. [[https://doi.org/10.1186/s13059-020-02135-8|Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs]]. //Genome Biol.// **21**: 249, 2020.
+(B) Sequence-to-graph mapping/alignment
+  - M. Rautiainen, V. Mäkinen, and T. Marschall. [[https://doi.org/10.1093/bioinformatics/btz162|Bit-parallel sequence-to-graph alignment]]. //Bioinformatics// **35**(19), 3599-3607, 2019.
+  -  R. Martiniano, E. Garrison, E. R. Jones, A. Manica, and R. Durbin. [[ https://doi.org/10.1186/s13059-020-02160-7|Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph]]. //Genome Biol.// **21**: 250, 2020.
+  - M. Rautiainen and T. Marschall. [[https://doi.org/10.1186/s13059-020-02157-2|GraphAligner: rapid and versatile sequence-to-graph alignment]]. //Genome Biol.// **21**: 253, 2020.
+  - A. Kuhnle, T. Mun, C. Boucher, T. Gagie, B. Langmead, and G. Manzini. [[https://doi.org/10.1089/cmb.2019.0309|Efficient Construction of a Complete Index for Pan-Genomics Read Alignment]]. //J. Comp. Biol.// **27**(4), 500-513, 2020.
+  -  N. Luhmann, G. Holley, and M. Achtman. [[https://doi.org/10.1101/2020.01.21.914168|BlastFrost: Fast querying of 100,000s of bacterial genomes in Bifrost graphs]]. //BioRxiv//, 2020.
+  -  T. Schulz, R. Wittler, S. Rahmann, F. Hach, and J. Stoye. [[https://doi.org/10.1093/bioinformatics/btab077|Detecting High Scoring Local Alignments in Pangenome Graphs]]. //Bioinformatics// **37**(16), 2266–2274, 2021.
+(C) Phylogenomics:
+  - R. Wittler. [[https://doi.org/10.1186/s13015-020-00164-3|Alignment- and reference-free phylogenomics with colored de Bruijn graphs]]. //Algorithms Mol. Biol.// **15**: 4, 2020.
+  - A. Rempel, R. Wittler. [[https://doi.org/10.1093/bioinformatics/btab444|SANS serif: alignment-free, whole-genome-based phylogenetic reconstruction]]. //Bioinformatics// **37**(24), 4868-4870, 2021.
+(D) Haplotype inference:
+See [[#haplotype_inference|below]].
 ==== Comparative genomics I: Genome alignment, repeat analysis ====
@@ Line 231: / Line 269: @@
   - E. Tannier, C. Zheng, D. Sankoff. [[https://doi.org/10.1186/1471-2105-10-120|Multichromosomal median and halving problems under different genomic distances]]. //BMC Bioinformatics// **10**:120, 2009.
-==== Comparative genomics III: Synteny Hierarchies and Gene clusters ====
+==== Comparative genomics III: Gene clusters ====
 The following are the algorithmic papers in this area. Apart from that, many papers on applications of gene clusters and statistical properties exist, but are not listed here.
+(a.) Common intervals of permutations:
   - T. Uno and M. Yagiura. [[https://doi.org/10.1007/s004539910014|Fast algorithms to enumerate all common intervals of two permutations]]. //Algorithmica// **26**(2):290-309, 2000.
   - S. Heber and J. Stoye. [[https://doi.org/10.1007/3-540-48194-X_19|Finding all common intervals of k permutations]]. In A. Amir and G. Landau, editors, Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, //CPM 2001//, volume 2089 of LNCS, pages 207-218, Berlin, 2001. Springer Verlag.
   - S. Heber and J. Stoye. [[https://doi.org/10.1007/3-540-44696-6_20|Algorithms for finding gene clusters]]. In O. Gascuel and B. Moret, editors,Proceedings of the First International Workshop on Algorithms in Bioinformatics, //WABI 2001//, volume 2149 of LNCS, pages 252-263, Berlin, 2001. Springer Verlag.
   - A. Bergeron, S. Corteel, and M. Raffinot. [[https://doi.org/10.1007/3-540-45784-4_36|The algorithmic of gene teams]]. In R. Guigó and D. Gusfield,editors, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, //WABI 2002//, volume 2452 of LNCS, pages 464-476, Berlin, 2002. Springer Verlag.
   - N. Luc, J.-L. Risler, A. Bergeron, and M. Raffinot. [[https://doi.org/10.1016/S1476-9271(02)00097-X|Gene teams: a new formalization of gene clusters for comparative genomics]]. //Comp. Biol. Chem.// **27**:59-67, 2003.
+  - G. M. Landau, L. Parida, and O. Weimann. [[https://doi.org/10.1089/cmb.2005.12.1289|Gene proximity analysis across whole genomes via PQ tree]]. //J. Comp. Biol.// **12**(10):1289–1306, 2005.
+  - A. Bergeron, C. Chauve, F. de Montgolfier, and M. Raffinot. [[https://doi.org/10.1137/060651331|Computing common intervals of K permutations, with applications to modular decomposition of graphs]]. //SIAM J. Discrete Mathematics// **22**(3):1022–1039, 2008.
+  - S. Heber, R. Mayr, J. Stoye. [[https://doi.org/10.1007/s00453-009-9332-1|Common Intervals of Multiple Permutations]]. //Algorithmica// **60**(2):175-206, 2011.
+(b.) Common intervals of sequences:
   - T. Schmidt and J. Stoye. [[https://doi.org/10.1007/978-3-540-27801-6_26|Quadratic time algorithms for finding common intervals in two and more sequences]]. In S. C. Sahinalp, S. Muthukrishnan, and U. Dogrusoz, editors, Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, //CPM 2004//, volume 3109 of LNCS, pages 347-358, Berlin, 2004. Springer Verlag.
+  - G. Didier, T. Schmidt, J. Stoye, D. Tsur. [[https://doi.org/10.1016/j.jda.2006.03.021|Character Sets of Strings]]. //J. Discr. Alg.// **5**(2):330-340, 2007.
   - X. He and M. H. Goldwasser. [[https://doi.org/10.1089/cmb.2005.12.638|Identifying conserved gene clusters in the presence of homology families]]. //J. Comp. Biol.// **12**(6):638-656, 2005.
-  - G. M. Landau, L. Parida, and O. Weimann. [[https://doi.org/10.1089/cmb.2005.12.1289|Gene proximity analysis across whole genomes via PQ tree]]. //J. Comp. Biol.// **12**(10):1289–1306, 2005.
-  - A. Bergeron, C. Chauve, F. de Montgolfier, and M. Raffinot. [[https://doi.org/10.1137/060651331|Computing common intervals of K permutations, with applications to modular decomposition of graphs.]] //SIAM J. Discrete Mathematics// **22**(3):1022–1039, 2008.
+(c.) Approximate common intervals of sequences:
   - S. Böcker, K. Jahn, J. Mixtacki, J. Stoye. [[https://doi.org/10.1089/cmb.2009.0098|Computation of Median Gene Clusters]]. //J. Comp. Biol.// **16**(8):1085-1099, 2009.
   - F. Hufsky, L. Kuchenbecker, K. Jahn, J. Stoye, S. Böcker. [[https://doi.org/10.1186/1471-2105-12-106|Swiftly Computing Center Strings]]. //BMC Bioinformatics// **12**:106, 2011.
+(d.) Common intervals of indeterminate strings:
   - D. Doerr, J. Stoye, S. Böcker, K. Jahn. [[https://doi.org/10.1186/1471-2164-15-S6-S2|Identifying Gene Clusters by Discovering Common Intervals in Indeterminate Strings]]. //BMC Genomics// **15**(Suppl. 6): S2, 2014.
@@ Line 264: / Line 311: @@
 A great overview of the combinatorial problems and algorithms in the following book chapter:
-  - D. Gusfield, S. Hecht Orzack. [[https://doi.org/10.1201/9781420036275.ch18|Haplotype Inference]]. In: Handbook of Computational Molecular Biology (Chapter 18), edited by S. Aluru, Chapman & Hall/CRC Computer and Information Science Series, 2006.
+  - D. Gusfield, S. Hecht Orzack. [[https://doi.org/10.1201/9781420036275|Haplotype Inference]]. In: Handbook of Computational Molecular Biology (Chapter 18), edited by S. Aluru, Chapman & Hall/CRC Computer and Information Science Series, 2006.
-A more recent paper on the topic is:
+A more recent works on the topic, focussing on molecular haplotyping:
   - G. W. Klau, T. Marschall. [[https://doi.org/10.1007/978-3-319-58741-7_6|A guided tour to computational haplotyping]]. In: Proc. of CiE 2017, LNCS 10307, Springer Verlag, 2017.
+  - M. Patterson, T. Marschall, N. Pisanti, L. v. Iersel, L. Stougie, G. W. Klau, A. Schönhuth. [[https://doi.org/10.1089/cmb.2014.0157|WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads]]. //Journal of Computational Biology// **22**(6), 498-509, 2015.
+The ILP discussed in class is from the following textbook, Section 20.2:
+  - Dan Gusfield. [[https://doi.org/10.1017/9781108377737|Integer Linear Programming in Computational and Systems Biology]]. Cambridge University Press, 2019.
 ==== SNP-disease associations ====

Genome Informatics

Differences