4  Introduction to Unsupervised Learning

Learning Objectives

By the end of this module, learners will be able to:

  • Define unsupervised learning and explain how it differs from supervised learning in terms of inputs, outputs, and goals.

  • Identify common unsupervised techniques, including clustering (e.g., k‑means, hierarchical) and dimensionality reduction (e.g., PCA), and describe when each is appropriate.

  • Discuss real‑world applications of unsupervised learning, such as customer segmentation, anomaly detection, and image compression.

  • Explain the role of unsupervised learning in exploratory data analysis.

  • Interpret principal component analysis (PCA) intuitively to understand how PCA finds the directions of greatest variance in data.

  • Apply dimensionality reduction to a simple multivariate dataset (e.g., crime rates and population by state) to visualize high‑dimensional data in two or three dimensions.

  • Differentiate unsupervised from supervised problems by examining datasets and deciding whether the task is to uncover patterns versus predict a known target variable.

  • Articulate the value of unsupervised learning in uncovering hidden structure in unlabelled data and its importance as data complexity grow.

4.1 Introduction

Unsupervised learning is a branch of machine learning that deals with finding hidden patterns or intrinsic structures in data without the use of labeled responses. Unlike supervised learning, where the model learns from labeled data to predict outcomes, unsupervised learning works with input data that does not have any corresponding output variables. The primary goal is to explore the underlying structure, groupings, or features in the data.

One of the most common applications of unsupervised learning is clustering, where the algorithm groups similar data points together based on their characteristics. This is particularly useful in scenarios such as customer segmentation, anomaly detection, and image compression. Another key technique is dimensionality reduction, which aims to reduce the number of variables under consideration, making it easier to visualize and interpret large datasets.

Unsupervised learning is valuable because it can reveal insights that may not be immediately apparent, uncovering relationships and patterns that might otherwise go unnoticed. It is commonly used in exploratory data analysis and as a preprocessing step for other algorithms. As data continues to grow in complexity and volume, unsupervised learning plays a critical role in making sense of unstructured information.

4.1.1 Motivation

Here is a picture I took of a pavement in Cambridge the day after Valentine’s Day. Why did this picture capture my attention? The starkness of the grey pavement contrasted with the bright red rose. It may have triggered some unsupervised learning mechanism in my brain that allows me to pick anomalies!

Rose after Valentine’s Day

Unsupervised learning is all about discovering structure in data without any explicit “right answers” to guide you. Your rose‑on‑pavement photo is a perfect real‑world illustration of a few core ideas:

  • Anomaly (or Outlier) Detection

  • What happened in your brain:
    When you look at a uniform grey pavement, your visual system builds an internal “model” of what’s normal—flat, texture‑repeating, monochrome. The bright red rose doesn’t fit that model, so it “pops,” drawing your attention.

  • In machine learning:
    Algorithms like Isolation Forests, One‑Class SVMs, or autoencoder‑based detectors learn a representation of “normal” data (e.g. patches of pavement) and then flag anything that deviates significantly (e.g. the rose) as an anomaly.

  • Feature Extraction & Saliency

  • Human vision analogy:
    Early in the visual cortex, neurons respond to edges, color contrasts, textures. A red circle on grey evokes strong responses in “color” and “shape‑edge” channels.

  • ML counterpart:
    Techniques like PCA or deep autoencoders learn low‑dimensional “features” (color histograms, texture filters). Dimensions where the rose is extreme (high red‑channel value) are exactly the ones that give us the “anomaly” score.

  • Clustering & Pattern Discovery You might not only notice the rose, but if there were lots of petals scattered around, your brain could start grouping (clustering) regions of similar color/shape.

Unsupervised clustering algorithms (k‑means, DBSCAN) would partition image patches into clusters—“pavement patches,” “rose petals,” maybe even “shadows.” Anything that doesn’t belong to a big cluster may again be flagged as rare.

  • Dimensionality Reduction & Visualization In a high‑dimensional feature space (e.g. each 10×10 pixel patch → a 300‑dim vector), you can’t “see” clusters easily. Algorithms like t‑SNE or UMAP compress that down to 2D so you can actually plot and see the rose‑patches separate from pavement.

This is why, for instance, visual analytics tools will show outliers as distant points on a scatterplot—just as you instantly spot the rose on the pavement.

4.1.2 Resources

PCA intuition

Key Concept

Information bottleneck

In unsupervised learning, the bottleneck concept refers to a deliberate architectural constraint in a model—typically an autoencoder—where information is compressed through a narrow intermediate representation, often called a latent code or embedding. The model is trained to reconstruct the input data after passing it through this low-dimensional bottleneck, forcing it to learn a compact and informative representation of the underlying structure of the data. Since there are no labels guiding the learning process, the model relies solely on reconstructing its input as accurately as possible, using only the limited information passed through this narrow channel. This compression encourages the model to capture essential features while discarding noise or redundancy.

The bottleneck acts as an inductive bias that promotes dimensionality reduction, feature learning, and denoising. By minimizing reconstruction error while constrained by a reduced latent space, the model implicitly discovers patterns, clusters, and hierarchies within the input data. In practical terms, this is a foundational principle behind many unsupervised representation learning methods, including classical autoencoders, variational autoencoders (VAEs), and self-supervised learning systems that rely on contrastive or generative objectives. The learned low-dimensional codes can then be used for downstream tasks such as clustering, visualization (e.g., with t-SNE or PCA), or as inputs to supervised models in a semi-supervised setting.

magine you have a huge library of biological images—say, pictures of different cell types under a microscope—and you want to teach a computer to recognize patterns in those images without telling it what any of the cells are. A “bottleneck” in this context is like asking the computer to summarize each image using only a few key words instead of the entire picture. By forcing it to compress all the rich detail down to a small summary, the computer has to figure out which features—like cell shape, size, or texture—are truly important. This is similar to how a biologist might sketch a simplified diagram of a cell, highlighting its nucleus and membrane but leaving out every ribosome and microtubule.

Because the computer must recreate the original image from that stripped‑down summary, it learns to ignore random noise or unimportant quirks (like slight variations in lighting) and focus on the core characteristics shared by similar cell types. In other words, the bottleneck helps the machine discover the hidden “essence” of the data. Once you have those concise summaries, you can use them to cluster cells into groups, visualize how different cell types relate, or even feed them into a second analysis—just as you might reduce a complex DNA dataset to a handful of genetic markers before drawing a phylogenetic tree. This approach lets you explore and interpret large biological datasets more effectively, all without ever providing explicit labels.

4.1.3 Example

Given the data below, how should we reduce the number of features and/or visualize it? This is an unsupervised machine learning problem.

Tip

NOTE (IMPORTANT CONCEPT): The columns of this data are the features.

State Murder (per 100k) Robbery (per 100k) Population
California 9.1 45.3 39,512,223
Texas 7.8 38.6 28,995,881
Florida 5.9 31.7 21,477,737
New York 3.4 26.4 19,453,561
Illinois 6.4 35.1 12,671,821
Pennsylvania 4.8 22.9 12,801,989

Importantly, we are not trying to predict anything. For example, say in the data below we can try to predict the number of people who moved to that state last year. This is a supervised machine learning problem (Gareth et al. 2017).

State Murder (per 100k) Robbery (per 100k) Population People Who Moved (per 100k)
California 9.1 45.3 39,512,223 5,400
Texas 7.8 38.6 28,995,881 4,100
Florida 5.9 31.7 21,477,737 6,200
New York 3.4 26.4 19,453,561 3,800
Illinois 6.4 35.1 12,671,821 2,900
Pennsylvania 4.8 22.9 12,801,989 2,500

4.2 What PCA does to the data

4.2.1 Projection of 3D Data

We generate three clusters of synthetic 3‑dimensional points, compute the first two principal components using scikit‑learn’s PCA, and then create a two‑panel figure:

  1. Left panel: A 3D scatter of the original points, the best‐fit plane defined by the first two principal components, and projection lines from each point down onto that plane.
  2. Right panel: A 2D scatter of the projected coordinates (the principal component scores) along the first two components, colored by cluster.

Use this visualization to understand how PCA finds the plane that maximizes variance and how the data look when reduced to two dimensions.

4.3 PCA is lossy

PCA does lose some information. But it can capture some/most of the salient aspects of your data.

Tip

NOTE (IMPORTANT CONCEPT)

Dimensionality reduction techniques (such as PCA) always lose some information. In other words, it is lossy.

4.3.1 Lesson on lossy compression (PCA applied to image)

4.3.2 Learning Objectives

  • Understand how Principal Component Analysis (PCA) can be applied to images.
  • Observe how PCA captures the most significant patterns in image data.
  • Visualize how the number of principal components affects image reconstruction.
  • Appreciate the trade-off between compression and information loss.

4.3.3 Key Concepts

  • PCA is a dimensionality reduction technique that identifies directions (principal components) along which the variance in the data is maximized.
  • Images can be viewed as high-dimensional data (each pixel as a feature), and PCA helps reduce that dimensionality while preserving key patterns.

4.3.4 Procedure Overview

  1. Load and display an image from a URL.
  2. Apply PCA to each RGB channel of the image separately.
  3. Reconstruct the image using an increasing number of principal components.
  4. Visualize the reconstructions to show how few components capture most of the image’s structure.

4.3.5 Takeaway Message

PCA can significantly reduce image data dimensionality while preserving salient features, making it a powerful tool for image compression and understanding. However, perfect reconstruction is only possible with all components, revealing the balance between efficiency and fidelity.

Activity: Playable version of PCA in browser

4.4 Visual explanations of PCA

Intuition
  • PCA maximimizes the variance captured

  • App to explain the intuition behind PCA

Animation

Animation

4.5 📊 Key Concepts

4.5.1 1. Scores and Loadings

What is being plotted on the axes (PC1 and PC2) are the scores.

The scores for each principal component are calculated as follows:

\[ PC_{1} = \alpha X + \beta Y + \gamma Z + .... \]

where \(X\), \(Y\) and \(Z\) are the normalized features.

The constants \(\alpha\), \(\beta\), \(\gamma\) are determined by the PCA algorithm. They are called the loadings.

4.5.2 2. Variance

  • Variance = how spread out the data is.
  • PCA finds directions (principal components) that maximize variance.

Formula for variance of variable \(x\):

\[ \text{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]


4.6 🔬 Example: Gene Expression Data

  • Rows = samples (patients)
  • Columns = gene expression levels

4.6.1 Goal:

  • Reduce dimensionality from 20,000 genes to 2-3 PCs
  • Visualize patterns between patient groups (e.g., healthy vs. cancer)
# Sample Python code (requires numpy, sklearn, matplotlib)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X = ...  # gene expression matrix
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Gene Expression')
plt.show()

4.7 Lesson Summary

  • Basics of unsupervised learning

  • Useful for visualization, outlier detection and making sense of your data if there are many features

  • What it is: Discover hidden patterns or groupings in unlabeled data, without predicting a specific target.

  • Key techniques:

    • Clustering (e.g. k‑means, hierarchical) for grouping similar observations
    • Dimensionality reduction (e.g. PCA) for compressing and visualizing high‑dimensional data
  • Why it matters:

    • Reveals structure in customer segmentation, anomaly detection, image compression, etc.
    • Serves as exploratory analysis and preprocessing for downstream tasks
  • Information bottleneck: Forcing models (like autoencoders) to squeeze data through a narrow “latent code” uncovers the most essential features and removes noise

  • Hands‑on example: Apply PCA to crime‑and‑population data by state to project three features into two dimensions for visualization

  • Unsupervised vs. supervised:

    • Unsupervised: No labels, focus on pattern discovery
    • Supervised: With labels, focus on predicting a known outcome

4.8 Acknowledgements

We thank Martin van Rongen, Vicki Hodgson, Hugo Tavares, Paul Fannon, Matt Castle and the Bioinformatics Facility Training Team for their support and guidance.

4.9 Resources