Intense Course on Data Mining and Visualization

  • October 19 to November 6, 2015
  • Martin Ester (SFU), Tim Nattkemper, Barabara Hammer
  • Mon-Fri 10-12 (lecture) and 14-16 (seminar/discussions)
  • Room: V2-105/115 (exceptions: 19./22.10.: V2-121; 3.11.,morning: V4-112; 3.11., afternoon and 4.11.: U10-146)

The morning sessions will consist of lectures introducing the topics whereas the afternoon sessions will cover special problems and methods (mostly for genomics), featuring presentations of selected research papers and giving the participants the opportunity to discuss issues related to their own research projects.

Week 1: Data Mining for Genomics - Martin Ester

Oct. 19-23

The lectures will introduce the area of data mining, focusing on algorithms, particularly for cluster analysis and classification.

  1. Introduction
  2. Cluster analysis
    • Cluster validation
    • Representative-based clustering
    • Hierarchical clustering
    • Probabilistic model-based clustering
    • Density-based clustering
    • Non-negative matrix factorization
    • Consensus clustering
    • High-dimensional clustering
    • Semi-supervised clustering
  3. Classification
    • Classifier evaluation
    • Decision trees
    • Naïve Bayes classifier
    • Logistic regression
    • Bayesian networks
    • Support vector machines
    • Nearest neighbor classifier
    • Ensemble methods
    • Regression analysis
  4. Conclusion
Monday Tuesday Wednesday Thursday Friday
Lecture Introduction, Clustering (to slide 49) Clustering (slides 50-95) Clustering (until end), Classification (to slide 125) Classification (to slide 164 ) Classification (to the end), Conclusion
Seminar Hofree et al. (2013) Network-based stratification of tumor mutations. Nat. Methods, 10, 1108–15. Lawrence et al.(1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–14. Gardy et al. (2003) PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res., 31, 3613–3617. Vazquez et al. (2003) Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol., 21, 697–700. Bleakley et al. (2007) Supervised reconstruction of biological networks with local models. Bioinformatics, 23, i57–65. Yu et al. Using Bayesian network inference algorithms to recover molecular genetic regulatory networks.

Week 2: Principles in Data Visualization - Tim Nattkemper

Oct. 26-30

In this part of the course, principles of data visualization will be introduced. First, the history of visualization is reviewed to develop a definition of the terms “visualization”, “scientific visualization” and “information visualization”. Second, categories of data and metadata will be defined and basic visualization techniques will be described for each category. Third, basic aspects of human visual perception and cognition will be reviewed in the context of data visualization.

Week 3: Nonlinear Dimensionality Reduction and Metric Learning - Barbara Hammer

Nov. 2-6

The last part of the course will center on two recent developments in advanced data analysis (a) nonlinear dimensionality reduction technologies for intuitive data inspection, which focuses on the question how to map high dimensional data points to low dimensions such that the structure of the data becomes visible and (b) metric learning techniques, which adjust the metric, i.e. the data representation according to auxiliary knowledge. Both technologies constitute matured fields of research with a variety of methods being readily available and quite some advanced applications in the field of bioinformatics and beyond. We will give an overview of the underlying concepts, where we try to provide a clear classification of the differences of the current most popular technologies.