Mini-Workshop on Storage of and Search in Viral Quasispecies 2018

Organized by:	Faculty of Technology, GRK 1906 DiDy
Place:	Bielefeld University, main building, room U10-146
Date:	April 16, 2018

Introduction

One of the biggest challenges when dealing with viruses is their high mutation rates. These mutation rates make it impossible to reduce a certain virus to a single reference sequence and, at the same time, complicate de novo assembly of full viral genomes. Due to their high sequence variability, viruses are often seen as quasispecies which consist of many viral haplotypes surrounding a master virus.

Schedule

One purpose of this mini-workshop is informal exchange, therefore the schedule is rather relaxed. Times in the following table are tentative.

11h00	Welcome (Jens Stoye)
11h15	Manja Marz	Nanopore quasispezies reconstruction
Lunch Break
13h00	Tizian Schulz	Efficient querying of viral quasispecies
13h45	Kassian Kobert	Emergence of variants in viral quasispecies
Coffee Break
15h30	Dominik Heider	Machine learning models for molecular diagnostics

End of workshop: approximately 16h30.

Talks

Nanopore quasispezies reconstruction

by Manja Marz

Introduction: Third generation sequencing techniques have made it possible to generate read data of unprecedented length at a low cost. Relatively short sequenes such as viral genomes can now be fully covered by few or even singular long reads. Due to their error-prone replication mechanism, virus populations are made up of a diverse spectrum of haplotypes. Characterizing this so called quasispecies is an important task in virus discovery and diagnostics.

Objectives: We aim to demonstrate that long read sequencing of viral genomes allows fast identification and almost obsoletes assembly when single reads cover the whole genome. Furthermore, we aim to determine the viral haplotypes contained in a sample. Utilizing the long read information, it is possible to determine which specific sequence features (e.g. co-occuring SNPs, recombination events) occur in each haplotype.

Methods: Using the recently released Direct RNA kit for the Oxford Nanopore Technologies MinION platform we sequenced a human coronavirus 229E (HCoV-229E) sample from human cell culture. As a work-in-progress approach, a de Bruijn graph is constructed from the overlapping k-mers of the long read data while preserving the information denoting which k-mers originated from a single long read. After tip- and bulge removal, potential haplotype consensus sequences are produced by assembly of overlapping subgraph consensus sequences.

Results: Sequencing the coronavirus sample yielded 293k reads, of which 27% are of virus origin. Median read length was 2.5kb and two reads exceeded 26kb, thus covering almost the full HCoV-229E genome of 27.3kb. The de Bruijn graph reveals the haplotype subgraph structure, but subgraph separation and consensus generation is hindered by the high amount of sequencing errors of the platform (15% insertions and deletions). This could be rectified by an error correction step, but might indicate that other approaches prove more suitable.

Conclusion: Long read sequencing enables viral full genome sequencing with minimal assembly, and also captures haplotype information. However, sequencing errors disrupt haplotype separation and necessitate appropriate error correction steps.

Efficient querying of viral quasispecies

by Tizian Schulz

According to Marschall et al. (2016), a quasispecies can be considered as a viral pan-genome which, in turn, can be represented as a colored de-Bruijn graph (CDBG). Such a CDBG is well suited to represent viral pan-genomes as it directly exposes differences and commonalities of all sequences even without a preceding assembly. However, the development of efficient methods to search within the graph is challenging. In this talk, a new method is introduced that queries a colored de-Bruijn graph following the idea of the prominent database search algorithm BLAST.

Emergence of variants in viral quasispecies

by Kassian Kobert

The high mutation rates of RNA viruses make it possible to study the adaptation of a viral quasispecies in a laboratory setting in “real time”. Here we have a look at the study of adaptation and the emergence of co-existing variants that can be observed when starting from a single clonal sequence.

Machine learning models for molecular diagnostics

by Dominik Heider

The development of computational approaches for predictive modeling of drug resistance, e.g., in HIV, has opened a new era in precision medicine. Clinical decision-support-systems have been designed for assistance in molecular diagnostics (MDx) or companion diagnostics (CDx) to enhance therapeutic success. These systems are typically based on statistical or machine learning models that were built based on clinical data. Nevertheless, the models rely on the data used for building up these models and it can be shown that the current clinical routine, i.e., Sanger sequencing, is not able to capture the whole viral quasispecies and thus can lead to false predictions.

Genome Informatics

Mini-Workshop on Storage of and Search in Viral Quasispecies 2018

Introduction

Schedule

Talks