Applied Unsupervised Machine Learning
This course is under construction. None of the information should be used until this message has been removed.
Overview
Session 1 (half‑day) Introduction to unsupervised learning
Introduction to unsupervised learning and normalization:
Understand the fundamental principles of unsupervised learning and recognize the role that data normalization plays in preparing datasets for analysis.Why normalization is required:
Explain why normalization is necessary to ensure that features with different scales do not unduly influence unsupervised learning algorithms.Why dimensionality reduction is required
Why you need dimensionality reduction.
Basics of dimensionality reduction:
Describe the core concepts of dimensionality reduction. Then describe Principal Component Analysis (PCA), including how it reduces dimensionality by identifying directions of maximum variance.Evaluating unsupervised learning results
How to check the performance and quality of your unsupervised learning results.
Session 2 (half‑day) Basics of dimensionality reduction
Basic applications of PCA:
Apply PCA to real datasets, interpret the resulting principal components, and discuss how these components can reveal underlying structure.Curse of dimensionality:
Explain the concept of the curse of dimensionality and its implications for the performance and interpretability of clustering and dimensionality‑reduction algorithms.PCA and t‑SNE:
Compare and contrast PCA and t‑Distributed Stochastic Neighbor Embedding (t‑SNE) as two popular techniques for dimensionality reduction and data visualization.Basics of t‑SNE:
Explain how t‑SNE projects high‑dimensional data into two or three dimensions while preserving local similarities between points.Applications to data:
Demonstrate the use of both PCA and t‑SNE on sample datasets to visualize clustering tendencies and uncover hidden patterns.
Session 3 (half‑day) Basics of Clustering
Clustering:
Define clustering in the context of unsupervised learning and outline its importance in discovering groupings within data.Basics of k‑means:
Describe the k‑means clustering algorithm, including how cluster centroids are initialized and updated to minimize within‑cluster variance.Basics of hierarchical clustering:
Explain the steps of hierarchical clustering, heatmaps, agglomerative approaches, and interpret dendrograms.Deciding on your clustering approach:
Situations in which you would want to apply hierarchical clustering. Discuss specific use cases: such as when the number of clusters is unknown or when a tree‑based representation is desired—where hierarchical clustering is advantageous.
Session 4 (half‑day) Practical applications (hands-on)
When not to apply PCA and t‑SNE:
Identify situations where PCA or t‑SNE may produce misleading results or be computationally infeasible, and propose alternative strategies.Practical applications:
Explore real‑world scenarios where unsupervised learning methods provide actionable insights across various domains.Practical applications of PCA, t‑SNE and hierarchical clustering to biological data:
Apply PCA, t‑SNE, and hierarchical clustering to biological datasets (e.g., gene expression or single‑cell data), interpret the results, and discuss biological insights gained.Evaluating unsupervised learning methods
How to evaluate these techniques on different kinds of data (single-cell data, electronic healthcare records, social sciences data): these are used to generate hypotheses. Motivations for next steps.
- List course learning objectives here.
- These describe concepts the learners should grasp and techniques they should be able to use by the end of the course.
- You can think of these as completing the phrase “after this course, the participant should be able to…”
- They are not supposed to be as detailed as the learning objectives of each section, but more high-level.
Target Audience
Students who have some basic familiarity with Python.
Prerequisites
Basic familiarity with Python. Course webpage is here: Introduction to Python
Exercises
Exercises in these materials are labelled according to their level of difficulty:
Level | Description |
---|---|
Exercises in level 1 are simpler and designed to get you familiar with the concepts and syntax covered in the course. | |
Exercises in level 2 combine different concepts together and apply it to a given task. | |
Exercises in level 3 require going beyond the concepts and syntax introduced to solve new problems. |
Acknowledgements
- We thank Martin van Rongen, Vicki Hodgson, Hugo Tavares, Paul Fannon, Matt Castle and the Bioinformatics Facility Training Team for their support and guidance.
- Introduction to Statistical Learning in Python