Detection of Genomic Structural Variation

Dr. Roland Wittler
Seminar: Wednesday, 10.15-11.45 in M3-115
Office hours: by arrangement
Office: U10-145

Content

In addition to small mutations in the genome, like the deletion, insertion or substitution of single bases, larger, so called structural variations, like the deletion, insertion, rearrangement, inversion or duplication of whole segments of the genome sequence, play an important role, e.g., in the development of cancer. High-throughput whole-genome sequencing enables detecting structural variations in several ways.

This is a classical literature seminar, i.e., on the first day, topics are introduced and selected by the students. In following sessions, students give a presentation on their topic and afterwards write an essay (“Hausarbeit”). Aspects of scientific writing and presenting will be covered as well.

Talks and essays can be done in German or English.

Literature

A collection of publications discussed in the seminar is provided in the “Lernraum” in the eKVV, including some review articles on structural variation detection.

https://bis.uni-bielefeld.de/sites/8358/Start.aspx (You have to register for this seminar in the eKVV by including it into your eKVV schedule.)

Requirements

Recommended prior knowledge: Sequence Analysis
Oral presentation (20-45 minutes)
Essay (8-15 pages)

Topics

Array, Array CGH
Read alignment (BWA, BWA-sw, Bowtie2, MrFast)
Representation and handling of mappings and call sets (samtools, VCF, IGV)
Genome Analysis Tool Kit (GATK)
Copy number variation approaches (CNVnator, SegSeq)
Split-read methods (Pindel, LASER)
Paired-end mapping approaches, probabilistic (Breakdancer, MoDil)
Paired-end mapping approaches, combinatorial (CLEVER, GASV)
Assembly-based approaches (SOAP denovo(2))
Phasing (WhatsHAP, review)
Long-read mapping (Chaisson, Pendleton)
Big genome projects (1000 Genomes Project, Genome of the Netherlands)

Timeline

Date	Topic	Who
11.10.2017	administratives, overview on topics and selection
18.10.2017
25.10.2017
01.11.2017	– national holiday –
08.11.2017
15.11.2017
22.11.2017	Scientific Writing / Read alignment	Roland / Dennis
29.11.2017	Practical session: Mappings and handling of BAM files	Roland
06.12.2017	Split-read methods / Genome Analysis Tool Kit	Lena / Paul B.
13.12.2017
20.12.2017	Paired-end mapping approaches	Timo / Fabienne
– X-Mas break –
10.01.2018	Assembly-based approaches / Phasing	Manuel / Ilja
17.01.2018	Long-read mapping	Pia
24.01.2018	Big genome projects / Copy number variation tools	Matthias / Dennis
31.01.2018

Hands on

Once you are added to the CeBiTec user group “seqan” you have access to the volume:

/vol/seqan/svseminar

In the subfolder HG00514, you find Illumina paired-end sequencing data. To be precise, you will find one file for each mate (suffix _1.fastq.gz and _2.fastq.gz) as well as a short extract of each (suffixes head.fastq.gz) which is easier to handle for test purposes. In the subfolder hg38, you find a reference genome (that has already been indexed to be used by BWA). There is also a folder TEST which you should use to play with the data. Here you find an example script runBWA.sh that runs BWA on the head-version of the read data and also does some SAM/BAM conversion. Please make your own copy of this script before you modify it. Do not do any heavy computations on a standard terminal! Instead submit the job to the compute cluster:

qsub -cwd -P seqan -l idle=1 -pe multislot 4 runBWA.sh

You can check the status of your job with qstat and kill it with qdel <job id>. The output of the job can be found in files called <scriptname>.o<job id> and <scriptname>.e<job id>, where the first should be empty and the second contains output and/or error messages of the tools used.

If you want to do, say, medium weight computations interactively, login on a compute hist with qlogin -P seqan.

Once you have your BAM file, you could, e.g. do the following things.

Have a first look at the mappings: samtools view roland.bam | head
Extract all mappings on a certain chromosome: samtools view -o roland.chr1.bam roland.bam chr1
Which chromosome has been hit how many times? samtools view roland.bam | cut -f 3 | sort | uniq -c | sort -n
Extract the fragment lengths: samtools view roland.bam | cut -f 9 > roland.fragmentlengths.tsv
Use R to plot the fragment length statistic:

# read data
fl<-read.table(file="roland.fragmentlengths.tsv", header=FALSE)
# take absolute values from first (and only) column
fla<-abs(fl$V1)
# filter for outliers by using quantiles
flaf<-subset(fla, fla>quantile(fla,0.01) & fla<quantile(fla,0.99))
# plot histogram
hist(flaf,breaks=50,ylab="fragment length")
# quit R
quit()

If you want to apply one of “your” tools, create an individual subfolder MYTOOL (e.g. LASER, GATK etc.) and make it group writable (chmod g+w <folder>).

Back to Teaching

Genome Informatics