Genome Informatics

Applied Comparative Genomics

Cedric Chauve – guest lecturer (Internationales Gastdozentenprogramm)
Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
cedric.chauve@sfu.ca

Block seminar

16.–27.09.2019
10.–14.02.2020

Course Materials

… can be found in this github repository.

Schedule

First block (16.–27.09.2019)
Mon	16.09.	10-12h	Overview: The genomes of 16 Anopheles mosquitoes
Tue	17.09.	10-12h	Data quality
Wed	18.09.	10-12h	The gambiae phylogeny
Thu	19.09.	10-12h	Use synteny for gene family analysis
Fri	20.09.	10-12h	Open reading

Mon	23.09.	10-12h	Open reading
Tue	24.09.	10-12h	Gene-tree species-tree reconciliation
Wed	25.09.	10-12h	Fabian: Synteny blocks
Thu	26.09.	10-12h	Jan: Improving Scaffolding
			Andreas: OrthoDB (BUSCO)
.		13-15h	Feedback
Fri	27.09.	—

Second block (10.–14.02.2020)
Mon	10.02.	10-12h	The genome rearrangement landscape
Tue	11.02.	10-12h	Handling duplications / Small parsimony problem (SPP)
Wed	12.02.	10-12h	Ancestral gene orders in a model-free framework
Thu	13.02.	10-12h	Joint scaffolding and ancestral gene order reconstruction
Fri	14.02.	10-12h	Student talks

Course Outline

This seminar is composed of two blocks and can ideally be combined with the lecture "Algorithms in Comparative Genomics" that is part of the same module (Special Algorithms in Bioinformatics).

In a first block, methodological gaps will be closed between basic aspects covered in the module “Sequence Analysis” and topics of the above mentioned lecture “Algorithms in Comparative Genomics” that will take place after this seminar block during WS 19/20. In the second block after the lecture, advanced, applied topics will be discussed.

Block 1

With the increase in the number of available genomes sequenced with the Illumina technology, comparative genomics projects consider large groups of species, whose genomes are often provided in the form of very fragmented assemblies. This creates challenges to apply classical comparative genomics algorithms, especially for the analysis of genome rearrangements that often assume the considered genomes are provided fully assembled. A typical recent example of this is provided by a recently sequenced group of 21 genomes of mosquitoes of the genus Anopheles that include the major malaria vectors. In this group of genomes, of primary importance from a public health point of view in many tropical countries, understanding the evolution of gene order is crucial especially toward associating gene order genotypes to ecological phenotypes, such as the resistance of insecticides. The purpose of the proposed course is to illustrate the various challenges posed by such data within comparative genomics projects and to introduce protocols, algorithms and tools that address these challenges, using the Anopheles dataset. We will provide first-hand experience on the analysis of these motivating data (Science, 2015; BMC Genomics, 2018).

In the detailed descriptions below, topics correspond to one scientific article each that is to be prepared, presented (30-45 minutes talk) and summarized (5-10 pages) by a student as required by the module requirements, and discussed by the group. Further individual topics are available if required.

In addition to the above standard procedure, we will provide computational results on actual experimental data, which the students can analyze and use to enrich both the content of their presentation as well as their experience with working on real data.

In the first block of the seminar (in preparation of the lecture), we will consider the problem of generating, from the provided genome assemblies and sequence data, the gene orders necessary in order to study the evolution of the gene order of the considered genomes. This will cover the following aspects:

Genome assembly. The first step from the sequencing data to gene orders is to assemble the data into contigs and scaffolds. We will discuss the de Bruijn graph assembler ALLPATHS-LG used to assemble the Anopheles as well as metrics commonly used to evaluate the quality of genome assemblies.
Comparative scaffolding. To address the issue of fragmented assemblies, a classical approach relies on the use of one (or a few) reference genome(s) to scaffold contigs. The Anopheles dataset contains a fully assembled genome that can be used as such a reference. We will discuss the recent RAGOUT 2 comparative scaffolder and evaluate its result on two Anopheles genomes (experimental results provided ahead of time).
Gene annotation. Once a genome has been assembled, an important step is to annotate its genes, especially the protein-coding genes. This is important for functional genomics obviously, but also for evolutionary studies as protein-coding genes provide natural genomic markers. We will study the annotation protocol followed for the Anopheles dataset described in Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes.
Gene families. A gene family is a set of genes, covering several species, that evolved from a unique ancestral gene through speciation, duplication and transfer. Clustering annotated genes into families is crucial toward defining single-copy genes that can be used as markers with most genome rearrangement algorithms. We will discuss the OrthoDB algorithm, used within the Anopheles project, and its sister method BUSCO for benchmarking genome assemblies based on single-copy orthologous genes.
Synteny blocks. An alternative to using single-copy orthologous genes as markers is to detect single-copy genome segments, with unaltered, up to small variations, gene or DNA content, often referred to as synteny blocks. We will study a recent synteny blocks construction protocol, and compare its results on Anopheles data (provided ahead of time) with single-copy gene markers.

At the end of this block, students will have worked through the major steps required to process input genomes toward the analysis of genome rearrangements as required by the models and methods that are taught in the subsequent, complementary lecture.

Block 2

The second block will address advanced applied topics, especially towards the fact that the provided genomes contain relatively few single-copy orthologs and are highly fragmented. We will study the following aspects.

Ancestral gene order reconstruction using single-copy markers. We will discuss the PATHGROUPS, MGRA and ANGES methods all used to reconstruct Anopheles ancestral gene orders, and compare their results on the Anopheles dataset.
Reconciled gene trees. In order to handle duplicated genes, the computation of reconciled gene trees allows to refine the orthology/paralogy relations between genes. We will review the concept of gene tree-species tree reconciliation and discuss the ecceTERA algorithm used to compute such trees with Anopheles data.
DeCoSTAR. Provided with reconciled gene trees, it is possible to reconstruct parsimonious evolutionary histories for individual gene adjacencies using the DeCoSTAR algorithm. We will discuss this algorithm and how to apply it on our data to reconstruct ancestral Anopheles gene adjacencies.
Ancestral gene order reconstruction using duplicated markers. The result of DeCoSTAR might often not be fully compatible with the expected structure of ancestral gene orders as the resulting assembly graph might contain cycles and branchings. We will discuss approaches to handle this and clear these conflicts through a variant of the Small Parsimony Problem that can handle duplications. We will also see how the same algorithm can also be used for comparative scaffolding through provided results on Anopheles data.