4 Introduction to Unsupervised Learning
By the end of this module, learners will be able to:
Define unsupervised learning and explain how it differs from supervised learning in terms of inputs, outputs, and goals.
Identify common unsupervised techniques, including clustering (e.g., kâmeans, hierarchical) and dimensionality reduction (e.g., PCA), and describe when each is appropriate.
Discuss realâworld applications of unsupervised learning, such as customer segmentation, anomaly detection, and image compression.
Explain the role of unsupervised learning in exploratory data analysis.
Interpret principal component analysis (PCA) intuitively to understand how PCA finds the directions of greatest variance in data.
Apply dimensionality reduction to a simple multivariate dataset (e.g., crime rates and population by state) to visualize highâdimensional data in two or three dimensions.
Differentiate unsupervised from supervised problems by examining datasets and deciding whether the task is to uncover patterns versus predict a known target variable.
Articulate the value of unsupervised learning in uncovering hidden structure in unlabelled data and its importance as data complexity grow.
4.1 Introduction
Unsupervised learning is a branch of machine learning that deals with finding hidden patterns or intrinsic structures in data without the use of labeled responses. Unlike supervised learning, where the model learns from labeled data to predict outcomes, unsupervised learning works with input data that does not have any corresponding output variables. The primary goal is to explore the underlying structure, groupings, or features in the data.
One of the most common applications of unsupervised learning is clustering, where the algorithm groups similar data points together based on their characteristics. This is particularly useful in scenarios such as customer segmentation, anomaly detection, and image compression. Another key technique is dimensionality reduction, which aims to reduce the number of variables under consideration, making it easier to visualize and interpret large datasets.
Unsupervised learning is valuable because it can reveal insights that may not be immediately apparent, uncovering relationships and patterns that might otherwise go unnoticed. It is commonly used in exploratory data analysis and as a preprocessing step for other algorithms. As data continues to grow in complexity and volume, unsupervised learning plays a critical role in making sense of unstructured information.
4.1.1 Motivation
Here is a picture I took of a pavement in Cambridge the day after Valentineâs Day. Why did this picture capture my attention? The starkness of the grey pavement contrasted with the bright red rose. It may have triggered some unsupervised learning mechanism in my brain that allows me to pick anomalies!
Unsupervised learning is all about discovering structure in data without any explicit âright answersâ to guide you. Your roseâonâpavement photo is a perfect realâworld illustration of a few core ideas:
Anomaly (or Outlier) Detection
What happened in your brain:
When you look at a uniform grey pavement, your visual system builds an internal âmodelâ of whatâs normalâflat, textureârepeating, monochrome. The bright red rose doesnât fit that model, so it âpops,â drawing your attention.In machine learning:
Algorithms likeâŻIsolation Forests, OneâClass SVMs, or autoencoderâbased detectors learn a representation of ânormalâ data (e.g. patches of pavement) and then flag anything that deviates significantly (e.g. the rose) as an anomaly.Feature Extraction & Saliency
Human vision analogy:
Early in the visual cortex, neurons respond to edges, color contrasts, textures. A red circle on grey evokes strong responses in âcolorâ and âshapeâedgeâ channels.ML counterpart:
Techniques like PCA or deep autoencoders learn lowâdimensional âfeaturesâ (color histograms, texture filters). Dimensions where the rose is extreme (high redâchannel value) are exactly the ones that give us the âanomalyâ score.Clustering & Pattern Discovery You might not only notice the rose, but if there were lots of petals scattered around, your brain could start grouping (clustering) regions of similar color/shape.
Unsupervised clustering algorithms (kâmeans, DBSCAN) would partition image patches into clustersââpavement patches,â ârose petals,â maybe even âshadows.â Anything that doesnât belong to a big cluster may again be flagged as rare.
- Dimensionality Reduction & Visualization In a highâdimensional feature space (e.g. each 10Ă10 pixel patch â a 300âdim vector), you canât âseeâ clusters easily. Algorithms like tâSNE or UMAP compress that down to 2D so you can actually plot and see the roseâpatches separate from pavement.
This is why, for instance, visual analytics tools will show outliers as distant points on a scatterplotâjust as you instantly spot the rose on the pavement.
4.1.2 Resources
In unsupervised learning, the bottleneck concept refers to a deliberate architectural constraint in a modelâtypically an autoencoderâwhere information is compressed through a narrow intermediate representation, often called a latent code or embedding. The model is trained to reconstruct the input data after passing it through this low-dimensional bottleneck, forcing it to learn a compact and informative representation of the underlying structure of the data. Since there are no labels guiding the learning process, the model relies solely on reconstructing its input as accurately as possible, using only the limited information passed through this narrow channel. This compression encourages the model to capture essential features while discarding noise or redundancy.
The bottleneck acts as an inductive bias that promotes dimensionality reduction, feature learning, and denoising. By minimizing reconstruction error while constrained by a reduced latent space, the model implicitly discovers patterns, clusters, and hierarchies within the input data. In practical terms, this is a foundational principle behind many unsupervised representation learning methods, including classical autoencoders, variational autoencoders (VAEs), and self-supervised learning systems that rely on contrastive or generative objectives. The learned low-dimensional codes can then be used for downstream tasks such as clustering, visualization (e.g., with t-SNE or PCA), or as inputs to supervised models in a semi-supervised setting.
magine you have a huge library of biological imagesâsay, pictures of different cell types under a microscopeâand you want to teach a computer to recognize patterns in those images without telling it what any of the cells are. A âbottleneckâ in this context is like asking the computer to summarize each image using only a few key words instead of the entire picture. By forcing it to compress all the rich detail down to a small summary, the computer has to figure out which featuresâlike cell shape, size, or textureâare truly important. This is similar to how a biologist might sketch a simplified diagram of a cell, highlighting its nucleus and membrane but leaving out every ribosome and microtubule.
Because the computer must recreate the original image from that strippedâdown summary, it learns to ignore random noise or unimportant quirks (like slight variations in lighting) and focus on the core characteristics shared by similar cell types. In other words, the bottleneck helps the machine discover the hidden âessenceâ of the data. Once you have those concise summaries, you can use them to cluster cells into groups, visualize how different cell types relate, or even feed them into a second analysisâjust as you might reduce a complex DNA dataset to a handful of genetic markers before drawing a phylogenetic tree. This approach lets you explore and interpret large biological datasets more effectively, all without ever providing explicit labels.
4.1.3 Example
Given the data below, how should we reduce the number of features and/or visualize it? This is an unsupervised machine learning problem.
NOTE (IMPORTANT CONCEPT): The columns of this data are the features.
State | Murder (per 100k) | Robbery (per 100k) | Population |
---|---|---|---|
California | 9.1 | 45.3 | 39,512,223 |
Texas | 7.8 | 38.6 | 28,995,881 |
Florida | 5.9 | 31.7 | 21,477,737 |
New York | 3.4 | 26.4 | 19,453,561 |
Illinois | 6.4 | 35.1 | 12,671,821 |
Pennsylvania | 4.8 | 22.9 | 12,801,989 |
Importantly, we are not trying to predict anything. For example, say in the data below we can try to predict the number of people who moved to that state last year. This is a supervised machine learning problem (Gareth et al. 2017).
State | Murder (per 100k) | Robbery (per 100k) | Population | People Who Moved (per 100k) |
---|---|---|---|---|
California | 9.1 | 45.3 | 39,512,223 | 5,400 |
Texas | 7.8 | 38.6 | 28,995,881 | 4,100 |
Florida | 5.9 | 31.7 | 21,477,737 | 6,200 |
New York | 3.4 | 26.4 | 19,453,561 | 3,800 |
Illinois | 6.4 | 35.1 | 12,671,821 | 2,900 |
Pennsylvania | 4.8 | 22.9 | 12,801,989 | 2,500 |
4.2 What PCA does to the data
4.2.1 Projection of 3D Data
We generate three clusters of synthetic 3âdimensional points, compute the first two principal components using scikitâlearnâs PCA, and then create a twoâpanel figure:
- Left panel: A 3D scatter of the original points, the bestâfit plane defined by the first two principal components, and projection lines from each point down onto that plane.
- Right panel: A 2D scatter of the projected coordinates (the principal component scores) along the first two components, colored by cluster.
Use this visualization to understand how PCA finds the plane that maximizes variance and how the data look when reduced to two dimensions.
4.3 PCA is lossy
PCA does lose some information. But it can capture some/most of the salient aspects of your data.
NOTE (IMPORTANT CONCEPT)
Dimensionality reduction techniques (such as PCA) always lose some information. In other words, it is lossy.
4.3.1 Lesson on lossy compression (PCA applied to image)
4.3.2 Learning Objectives
- Understand how Principal Component Analysis (PCA) can be applied to images.
- Observe how PCA captures the most significant patterns in image data.
- Visualize how the number of principal components affects image reconstruction.
- Appreciate the trade-off between compression and information loss.
4.3.3 Key Concepts
- PCA is a dimensionality reduction technique that identifies directions (principal components) along which the variance in the data is maximized.
- Images can be viewed as high-dimensional data (each pixel as a feature), and PCA helps reduce that dimensionality while preserving key patterns.
4.3.4 Procedure Overview
- Load and display an image from a URL.
- Apply PCA to each RGB channel of the image separately.
- Reconstruct the image using an increasing number of principal components.
- Visualize the reconstructions to show how few components capture most of the imageâs structure.
4.3.5 Takeaway Message
PCA can significantly reduce image data dimensionality while preserving salient features, making it a powerful tool for image compression and understanding. However, perfect reconstruction is only possible with all components, revealing the balance between efficiency and fidelity.
4.4 Visual explanations of PCA
- PCA maximimizes the variance captured
- App to explain the intuition behind PCA
4.5 đ Key Concepts
4.5.1 1. Scores and Loadings
What is being plotted on the axes (PC1 and PC2) are the scores
.
The scores
for each principal component are calculated as follows:
\[ PC_{1} = \alpha X + \beta Y + \gamma Z + .... \]
where \(X\), \(Y\) and \(Z\) are the normalized features.
The constants \(\alpha\), \(\beta\), \(\gamma\) are determined by the PCA algorithm. They are called the loadings
.
4.5.2 2. Variance
- Variance = how spread out the data is.
- PCA finds directions (principal components) that maximize variance.
Formula for variance of variable \(x\):
\[ \text{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
4.6 đŹ Example: Gene Expression Data
- Rows = samples (patients)
- Columns = gene expression levels
4.6.1 Goal:
- Reduce dimensionality from 20,000 genes to 2-3 PCs
- Visualize patterns between patient groups (e.g., healthy vs. cancer)
# Sample Python code (requires numpy, sklearn, matplotlib)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
= ... # gene expression matrix
X = StandardScaler().fit_transform(X)
X_scaled
= PCA(n_components=2)
pca = pca.fit_transform(X_scaled)
X_pca
0], X_pca[:, 1])
plt.scatter(X_pca[:, 'PC1')
plt.xlabel('PC2')
plt.ylabel('PCA of Gene Expression')
plt.title( plt.show()
4.7 Lesson Summary
Basics of unsupervised learning
Useful for visualization, outlier detection and making sense of your data if there are many features
What it is: Discover hidden patterns or groupings in unlabeled data, without predicting a specific target.
Key techniques:
- Clustering (e.g. kâmeans, hierarchical) for grouping similar observations
- Dimensionality reduction (e.g. PCA) for compressing and visualizing highâdimensional data
- Clustering (e.g. kâmeans, hierarchical) for grouping similar observations
Why it matters:
- Reveals structure in customer segmentation, anomaly detection, image compression, etc.
- Serves as exploratory analysis and preprocessing for downstream tasks
- Reveals structure in customer segmentation, anomaly detection, image compression, etc.
Information bottleneck: Forcing models (like autoencoders) to squeeze data through a narrow âlatent codeâ uncovers the most essential features and removes noise
Handsâon example: Apply PCA to crimeâandâpopulation data by state to project three features into two dimensions for visualization
Unsupervised vs. supervised:
- Unsupervised: No labels, focus on pattern discovery
- Supervised: With labels, focus on predicting a known outcome
- Unsupervised: No labels, focus on pattern discovery
4.8 Acknowledgements
We thank Martin van Rongen, Vicki Hodgson, Hugo Tavares, Paul Fannon, Matt Castle and the Bioinformatics Facility Training Team for their support and guidance.
4.9 Resources
Video lectures by the authors of the book Introduction to Statistical Learning in Python
https://github.com/neelsoumya/public_teaching_unsupervised_learning