====== Detection of Genomic Structural Variation ====== Dr. Roland Wittler\\ Seminar: Wednesday, 10.15-11.45 in M3-115\\ Office hours: by arrangement \\ Office: U10-145\\ ===== Content ===== In addition to small mutations in the genome, like the deletion, insertion or substitution of single bases, larger, so called //structural variations//, like the deletion, insertion, rearrangement, inversion or duplication of whole segments of the genome sequence, play an important role, e.g., in the development of cancer. High-throughput whole-genome sequencing enables detecting structural variations in several ways. This is a classical literature seminar, i.e., on the first day, topics are introduced and selected by the students. In following sessions, students give a presentation on their topic and afterwards write an essay ("Hausarbeit"). Aspects of scientific writing and presenting will be covered as well. Talks and essays can be done in German or English. ===== Literature ===== A collection of publications discussed in the seminar is provided in the "Lernraum" in the [[https://ekvv.uni-bielefeld.de/kvv_publ/publ/vd?id=103587990|eKVV]], including some review articles on structural variation detection. https://bis.uni-bielefeld.de/sites/8358/Start.aspx (You have to register for this seminar in the eKVV by including it into your eKVV schedule.) ===== Requirements ===== * Recommended prior knowledge: Sequence Analysis * Oral presentation (20-45 minutes) * Essay (8-15 pages) ===== Topics ===== * Array, Array CGH * Read alignment (BWA, BWA-sw, Bowtie2, MrFast) * Representation and handling of mappings and call sets (samtools, VCF, IGV) * Genome Analysis Tool Kit (GATK) * Copy number variation approaches (CNVnator, SegSeq) * Split-read methods (Pindel, LASER) * Paired-end mapping approaches, probabilistic (Breakdancer, MoDil) * Paired-end mapping approaches, combinatorial (CLEVER, GASV) * Assembly-based approaches (SOAP denovo(2)) * Phasing (WhatsHAP, review) * Long-read mapping (Chaisson, Pendleton) * Big genome projects (1000 Genomes Project, Genome of the Netherlands) ===== Timeline ===== ^ Date ^ Topic ^ Who ^ | 11.10.2017 | administratives, overview on topics and selection | | | 18.10.2017 | | | | 25.10.2017 | | | | 01.11.2017 | -- national holiday --|| | 08.11.2017 | | | | 15.11.2017 | | | | 22.11.2017 | Scientific Writing / Read alignment | Roland / Dennis | | 29.11.2017 | Practical session: Mappings and handling of BAM files | Roland | | 06.12.2017 | Split-read methods / Genome Analysis Tool Kit | Lena / Paul B. | | 13.12.2017 | | | | 20.12.2017 | Paired-end mapping approaches | Timo / Fabienne | | -- X-Mas break -- ||| | 10.01.2018 | Assembly-based approaches / Phasing | Manuel / Ilja | | 17.01.2018 | Long-read mapping | Pia | | 24.01.2018 | Big genome projects / Copy number variation tools | Matthias / Dennis | | 31.01.2018 | | | ===== Hands on ===== Once you are added to the CeBiTec user group "seqan" you have access to the volume: /vol/seqan/svseminar In the subfolder ''HG00514'', you find Illumina paired-end sequencing data. To be precise, you will find one file for each mate (suffix ''_1.fastq.gz'' and ''_2.fastq.gz'') as well as a short extract of each (suffixes ''head.fastq.gz'') which is easier to handle for test purposes. In the subfolder ''hg38'', you find a reference genome (that has already been indexed to be used by BWA). There is also a folder ''TEST'' which you should use to play with the data. Here you find an example script ''runBWA.sh'' that runs BWA on the head-version of the read data and also does some SAM/BAM conversion. Please make your own copy of this script before you modify it. **Do not do any heavy computations on a standard terminal!** Instead submit the job to the compute cluster: qsub -cwd -P seqan -l idle=1 -pe multislot 4 runBWA.sh You can check the status of your job with ''qstat'' and kill it with ''qdel ''. The output of the job can be found in files called ''.o'' and ''.e'', where the first should be empty and the second contains output and/or error messages of the tools used. If you want to do, say, medium weight computations interactively, login on a compute hist with ''qlogin -P seqan''. Once you have your BAM file, you could, e.g. do the following things. * Have a first look at the mappings: ''samtools view roland.bam | head'' * Extract all mappings on a certain chromosome: ''samtools view -o roland.chr1.bam roland.bam chr1'' * Which chromosome has been hit how many times? ''samtools view roland.bam | cut -f 3 | sort | uniq -c | sort -n'' * Extract the fragment lengths: ''samtools view roland.bam | cut -f 9 > roland.fragmentlengths.tsv'' * Use ''R'' to plot the fragment length statistic: # read data fl<-read.table(file="roland.fragmentlengths.tsv", header=FALSE) # take absolute values from first (and only) column fla<-abs(fl$V1) # filter for outliers by using quantiles flaf<-subset(fla, fla>quantile(fla,0.01) & fla''). Back to [[:teaching|Teaching]]