Detection of Genomic Structural Variation

Dr. Roland Wittler
Seminar: Wednesday, 10.15-11.45 in M3-115
Office hours: by arrangement
Office: U10-145

Content

In addition to small mutations in the genome, like the deletion, insertion or substitution of single bases, larger, so called structural variations, like the deletion, insertion, rearrangement, inversion or duplication of whole segments of the genome sequence, play an important role, e.g., in the development of cancer. High-throughput whole-genome sequencing enables detecting structural variations in several ways.

This is a classical literature seminar, i.e., on the first day, topics are introduced and selected by the students. In following sessions, students give a presentation on their topic and afterwards write an essay (“Hausarbeit”). Aspects of scientific writing and presenting will be covered as well.

Talks and essays can be done in German or English.

Literature

A collection of publications discussed in the seminar is provided in the “Lernraum” in the eKVV, including some review articles on structural variation detection.

https://bis.uni-bielefeld.de/sites/8358/Start.aspx (You have to register for this seminar in the eKVV by including it into your eKVV schedule.)

Requirements

  • Recommended prior knowledge: Sequence Analysis
  • Oral presentation (20-45 minutes)
  • Essay (8-15 pages)

Topics

  • Array, Array CGH
  • Read alignment (BWA, BWA-sw, Bowtie2, MrFast)
  • Representation and handling of mappings and call sets (samtools, VCF, IGV)
  • Genome Analysis Tool Kit (GATK)
  • Copy number variation approaches (CNVnator, SegSeq)
  • Split-read methods (Pindel, LASER)
  • Paired-end mapping approaches, probabilistic (Breakdancer, MoDil)
  • Paired-end mapping approaches, combinatorial (CLEVER, GASV)
  • Assembly-based approaches (SOAP denovo(2))
  • Phasing (WhatsHAP, review)
  • Long-read mapping (Chaisson, Pendleton)
  • Big genome projects (1000 Genomes Project, Genome of the Netherlands)

Timeline

Date Topic Who
11.10.2017 administratives, overview on topics and selection
18.10.2017
25.10.2017
01.11.2017 – national holiday –
08.11.2017
15.11.2017
22.11.2017 Scientific Writing / Read alignment Roland / Dennis
29.11.2017 Practical session: Mappings and handling of BAM files Roland
06.12.2017 Split-read methods / Genome Analysis Tool Kit Lena / Paul B.
13.12.2017
20.12.2017 Paired-end mapping approaches Timo / Fabienne
– X-Mas break –
10.01.2018 Assembly-based approaches / Phasing Manuel / Ilja
17.01.2018 Long-read mapping Pia
24.01.2018 Big genome projects / Copy number variation tools Matthias / Dennis
31.01.2018

Hands on

Once you are added to the CeBiTec user group “seqan” you have access to the volume:

/vol/seqan/svseminar

In the subfolder HG00514, you find Illumina paired-end sequencing data. To be precise, you will find one file for each mate (suffix _1.fastq.gz and _2.fastq.gz) as well as a short extract of each (suffixes head.fastq.gz) which is easier to handle for test purposes. In the subfolder hg38, you find a reference genome (that has already been indexed to be used by BWA). There is also a folder TEST which you should use to play with the data. Here you find an example script runBWA.sh that runs BWA on the head-version of the read data and also does some SAM/BAM conversion. Please make your own copy of this script before you modify it. Do not do any heavy computations on a standard terminal! Instead submit the job to the compute cluster:

qsub -cwd -P seqan -l idle=1 -pe multislot 4 runBWA.sh

You can check the status of your job with qstat and kill it with qdel <job id>. The output of the job can be found in files called <scriptname>.o<job id> and <scriptname>.e<job id>, where the first should be empty and the second contains output and/or error messages of the tools used.

If you want to do, say, medium weight computations interactively, login on a compute hist with qlogin -P seqan.

Once you have your BAM file, you could, e.g. do the following things.

  • Have a first look at the mappings: samtools view roland.bam | head
  • Extract all mappings on a certain chromosome: samtools view -o roland.chr1.bam roland.bam chr1
  • Which chromosome has been hit how many times? samtools view roland.bam | cut -f 3 | sort | uniq -c | sort -n
  • Extract the fragment lengths: samtools view roland.bam | cut -f 9 > roland.fragmentlengths.tsv
  • Use R to plot the fragment length statistic:
# read data
fl<-read.table(file="roland.fragmentlengths.tsv", header=FALSE)
# take absolute values from first (and only) column
fla<-abs(fl$V1)
# filter for outliers by using quantiles
flaf<-subset(fla, fla>quantile(fla,0.01) & fla<quantile(fla,0.99))
# plot histogram
hist(flaf,breaks=50,ylab="fragment length")
# quit R
quit()

If you want to apply one of “your” tools, create an individual subfolder MYTOOL (e.g. LASER, GATK etc.) and make it group writable (chmod g+w <folder>).

Back to Teaching