Applied Unsupervised Machine Learning
This course is under construction. None of the information should be used until this message has been removed.
Overview
This course on unsupervised learning provides a systematic introduction to dimensionality reduction and clustering techniques. The course covers fundamental concepts of unsupervised learning and data normalization, then progresses through the practical applications of Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and hierarchical clustering algorithms.
The course emphasizes both theoretical understanding and hands-on application, teaching students to recognize when different techniques are appropriate and when they may fail. A key learning objective is understanding the limitations of linear methods like PCA. Students learn to evaluate the performance of unsupervised learning methods across diverse data types, with the ultimate goal of generating meaningful hypotheses for further research.
- The learning objectives and course outline are detailed below.
Session 1 (half‑day) Introduction to unsupervised learning
Introduction to unsupervised learning and normalization:
Understand the fundamental principles of unsupervised learning and recognize the role that data normalization plays in preparing datasets for analysis.Why normalization is required:
Explain why normalization is necessary to ensure that features with different scales do not unduly influence unsupervised learning algorithms.Why dimensionality reduction is required
Why you need dimensionality reduction.
Basics of dimensionality reduction:
Describe the core concepts of dimensionality reduction. Then describe Principal Component Analysis (PCA), including how it reduces dimensionality by identifying directions of maximum variance.Evaluating unsupervised learning results
How to check the performance and quality of your unsupervised learning results.
Session 2 (half‑day) Basics of dimensionality reduction
Basic applications of PCA:
Apply PCA to real datasets, interpret the resulting principal components, and discuss how these components can reveal underlying structure.Curse of dimensionality:
Explain the concept of the curse of dimensionality and its implications for the performance and interpretability of clustering and dimensionality‑reduction algorithms.PCA and t‑SNE:
Compare and contrast PCA and t‑Distributed Stochastic Neighbor Embedding (t‑SNE) as two popular techniques for dimensionality reduction and data visualization.Basics of t‑SNE:
Explain how t‑SNE projects high‑dimensional data into two or three dimensions while preserving local similarities between points.Applications to data:
Demonstrate the use of both PCA and t‑SNE on sample datasets to visualize clustering tendencies and uncover hidden patterns.
Session 3 (half‑day) Basics of Clustering
Clustering:
Define clustering in the context of unsupervised learning and outline its importance in discovering groupings within data.Basics of k‑means:
Describe the k‑means clustering algorithm, including how cluster centroids are initialized and updated to minimize within‑cluster variance.Basics of hierarchical clustering:
Explain the steps of hierarchical clustering, heatmaps, agglomerative approaches, and interpret dendrograms.Deciding on your clustering approach:
Situations in which you would want to apply hierarchical clustering. Discuss specific use cases: such as when the number of clusters is unknown or when a tree‑based representation is desired—where hierarchical clustering is advantageous.
Session 4 (half‑day) Practical applications (hands-on)
When not to apply PCA and t‑SNE:
Identify situations where PCA or t‑SNE may produce misleading results or be computationally infeasible, and propose alternative strategies.Practical applications:
Explore real‑world scenarios where unsupervised learning methods provide actionable insights across various domains.Practical applications of PCA, t‑SNE and hierarchical clustering to biological data:
Apply PCA, t‑SNE, and hierarchical clustering to biological datasets (e.g., gene expression or single‑cell data), interpret the results, and discuss biological insights gained.Evaluating unsupervised learning methods
How to evaluate these techniques on different kinds of data (single-cell data, electronic healthcare records, social sciences data): these are used to generate hypotheses. Motivations for next steps.
Target Audience
Students who have some basic familiarity with Python. There are no prerequisites for knowledge of biology or statistics. The course is designed for those who want to learn how to apply unsupervised machine learning techniques to real-world datasets.
Prerequisites
Basic familiarity with Python. Course webpage is here: Introduction to Python
Exercises
Exercises in these materials are labelled according to their level of difficulty:
Level | Description |
---|---|
Exercises in level 1 are simpler and designed to get you familiar with the concepts and syntax covered in the course. | |
Exercises in level 2 combine different concepts together and apply it to a given task. | |
Exercises in level 3 require going beyond the concepts and syntax introduced to solve new problems. |
Acknowledgements
- We thank Martin van Rongen, Vicki Hodgson, Hugo Tavares, Paul Fannon, Matt Castle and the Bioinformatics Facility Training Team for their support and guidance.
- Introduction to Statistical Learning in Python