Number of pairwise plots for 5 features: 10
Number of pairwise plots for 14 features: 91
Number of pairwise plots for 1000000 features: 499999500000
4 Introduction to Unsupervised Learning
By the end of this module, learners will be able to:
Define unsupervised learning and explain how it differs from supervised learning in terms of inputs, outputs, and goals.
Identify common unsupervised techniques, including clustering (e.g., kâmeans, hierarchical) and dimensionality reduction (e.g., PCA), and describe when each is appropriate.
Discuss realâworld applications of unsupervised learning, such as customer segmentation, anomaly detection, and image compression.
Explain the role of unsupervised learning in exploratory data analysis.
Interpret principal component analysis (PCA) intuitively to understand how PCA finds the directions of greatest variance in data.
Apply dimensionality reduction to a simple multivariate dataset (e.g., crime rates and population by state) to visualize highâdimensional data in two or three dimensions.
Differentiate unsupervised from supervised problems by examining datasets and deciding whether the task is to uncover patterns versus predict a known target variable.
Articulate the value of unsupervised learning in uncovering hidden structure in unlabelled data and its importance as data complexity grow.
4.1 Introduction
Unsupervised learning is a branch of machine learning that deals with finding hidden patterns or intrinsic structures in data without the use of labeled responses. Unlike supervised learning, where the model learns from labeled data to predict outcomes, unsupervised learning works with input data that does not have any corresponding output variables. The primary goal is to explore the underlying structure, groupings, or features in the data.
One of the most common applications of unsupervised learning is clustering, where the algorithm groups similar data points together based on their characteristics. This is particularly useful in scenarios such as customer segmentation, anomaly detection, and image compression. Another key technique is dimensionality reduction, which aims to reduce the number of variables under consideration, making it easier to visualize and interpret large datasets.
Unsupervised learning is valuable because it can reveal insights that may not be immediately apparent, uncovering relationships and patterns that might otherwise go unnoticed. It is commonly used in exploratory data analysis and as a preprocessing step for other algorithms. As data continues to grow in complexity and volume, unsupervised learning plays a critical role in making sense of unstructured information.
4.1.1 Motivation
Here is a picture I took of a pavement in Cambridge the day after Valentineâs Day. Why did this picture capture my attention? The starkness of the grey pavement contrasted with the bright red rose. It may have triggered some unsupervised learning mechanism in my brain that allows me to pick anomalies!
Unsupervised learning is all about discovering structure in data without any explicit âright answersâ to guide you. The roseâonâpavement photo is a perfect realâworld illustration of a few core ideas:
Anomaly (or Outlier) Detection
What happened in your brain:
When you look at a uniform grey pavement, your visual system builds an internal âmodelâ of what is normal: flat, textureârepeating, monochrome. The bright red rose doesnât fit that model, so it âpops,â drawing your attention.In machine learning:
Algorithms can learn a representation of ânormalâ data (e.g. patches of pavement) and then flag anything that deviates significantly (e.g. the rose) as an anomaly.Human vision analogy:
Early in the visual cortex, neurons respond to edges, color contrasts, textures. A red circle on grey evokes strong responses in âcolorâ and âshapeâedgeâ channels.
4.1.2 Example
Given the data below, how should we reduce the number of features and/or visualize it? This is an unsupervised machine learning problem.
NOTE (IMPORTANT CONCEPT): The columns of this data are the features.
State | Murder (per 100k) | Robbery (per 100k) | Population |
---|---|---|---|
California | 9.1 | 45.3 | 39,512,223 |
Texas | 7.8 | 38.6 | 28,995,881 |
Florida | 5.9 | 31.7 | 21,477,737 |
New York | 3.4 | 26.4 | 19,453,561 |
Illinois | 6.4 | 35.1 | 12,671,821 |
Pennsylvania | 4.8 | 22.9 | 12,801,989 |
NOTE (IMPORTANT CONCEPT): Importantly, in unsupervised machine learning we are not trying to predict anything. For example, say in the data below we can try to predict the number of people who moved to that state last year. This would be a supervised machine learning problem (Gareth et al. 2017).
State | Murder (per 100k) | Robbery (per 100k) | Population | People Who Moved (per 100k) |
---|---|---|---|---|
California | 9.1 | 45.3 | 39,512,223 | 5,400 |
Texas | 7.8 | 38.6 | 28,995,881 | 4,100 |
Florida | 5.9 | 31.7 | 21,477,737 | 6,200 |
New York | 3.4 | 26.4 | 19,453,561 | 3,800 |
Illinois | 6.4 | 35.1 | 12,671,821 | 2,900 |
Pennsylvania | 4.8 | 22.9 | 12,801,989 | 2,500 |
NOTE (IMPORTANT CONCEPTS):
If there is a column in the table that we are trying to predict, this would be called a label. Supervised machine learning (such as linear regression) tries to predict the label given the features (Gareth et al. 2017).
However in unsupervised machine learning, we only deal with features and do not try to predict anything.
Hence there are no labels in unsupervised machine learning.
NOTE (IMPORTANT CONCEPTS):
The number of columns in the data, is called the dimensions of the data.
For example, if there are 3 columns, this is 3-dimensional data.
We can visualize it in 3 dimensions in a 3D plot.
Hence the columns of the data or dimensions form a co-ordinate system. We can visualize it in a plot. The X axis would represent the value in the first column, the Y axis would represent the values in the second column, and so on.
How would you visualize a table that has 14 columns/dimensions?
4.2 Curse of dimensionality
How would you visualize data that has 14 dimensions? How about 1 million dimensions (can happen in the age of big data)?
Would you remove one column at a time?
Would you plot each feature/column vs the other exhaustively? How many pairwise plots would you need to plot for exhaustive visualization?
NOTE (IMPORTANT CONCEPT)
It is not possible to exhaustively plot one feature vs. another or remove any feature willy-nilly. For example, in a dataset from patient healthcare records, we cannot just randomly throw away data on blood pressure.
This is why we use dimensionality reduction to reduce the dimensions of the data. This yields fewer new features.
4.3 What PCA does to the data
4.3.1 Projection of 3D Data
We generate three clusters of synthetic 3âdimensional points, compute the first two principal components using scikitâlearnâs PCA, and then create a twoâpanel figure:
- Left panel: A 3D scatter of the original points, the bestâfit plane defined by the first two principal components, and projection lines from each point down onto that plane.
- Right panel: A 2D scatter of the projected coordinates (the principal component scores) along the first two components, colored by cluster.
Use this visualization to understand how PCA finds the plane that maximizes variance and how the data look when reduced to two dimensions.
4.4 Another intuition behind PCA
Principal Component Analysis (PCA) reduces the number of dimensions in your data by finding the directions where the data varies the most and keeping only those directions.
Imagine you have a book: it has length, width, and height. But if the book is very thin, almost all of its size or volume is in the length and widthâthe height is so small that you can almost ignore it. If you wanted to describe where the book is on a table, you could just use two numbers (length and width), and you wouldnât lose much information by ignoring the height.
PCA works the same way: it finds the thin directions in your data that do not add much information, and lets you focus on the most important dimensions. This makes it easier to visualize and analyze your data without losing the main patterns.
NOTE (IMPORTANT CONCEPT)
Say your data has too many columns/dimensions/features. Dimensionality reduction techniques (such as PCA) reduce the number of dimensions or the number of columns in your data.
4.5 PCA is lossy
PCA does lose some information. But it can capture some/most of the salient aspects of your data.
NOTE (IMPORTANT CONCEPT)
Dimensionality reduction techniques (such as PCA) always lose some information. In other words, it is lossy.
4.5.1 Lesson on lossy compression (PCA applied to image)
4.5.2 Learning Objectives
- Understand how Principal Component Analysis (PCA) can be applied to images.
- Observe how PCA captures the most significant patterns in image data.
- Visualize how the number of principal components affects image reconstruction.
- Appreciate the trade-off between compression and information loss.
4.5.3 Key Concepts
- PCA is a dimensionality reduction technique that identifies directions (principal components) along which the variance in the data is maximized.
- Images can be viewed as high-dimensional data (each pixel as a feature), and PCA helps reduce that dimensionality while preserving key patterns.
4.5.4 Procedure Overview
- Load and display an image from a URL.
- Apply PCA to each RGB channel of the image separately.
- Reconstruct the image using an increasing number of principal components.
- Visualize the reconstructions to show how few components capture most of the imageâs structure.
4.5.5 Takeaway Message
PCA can significantly reduce image data dimensionality while preserving salient features, making it a powerful tool for image compression and understanding. However, perfect reconstruction is only possible with all components, revealing the balance between efficiency and fidelity.
In unsupervised learning, the bottleneck concept refers to a deliberate constraint where information is compressed through a narrow intermediate step. The model is trained to reconstruct the input data after passing it through this low-dimensional bottleneck, forcing it to learn a compact and informative representation of the underlying structure of the data. Since there are no labels guiding the learning process, the model relies solely on reconstructing its input as accurately as possible, using only the limited information passed through this narrow channel. This compression encourages the model to capture essential features while discarding noise or redundancy.
Because the computer must recreate the original image from that strippedâdown summary, it learns to ignore random noise.
4.6 Visual explanations of PCA
- PCA maximimizes the variance captured
- App to explain the intuition behind PCA
4.7 Differences between PCA and linear regression
Does the figure above look similar to linear regression? Is PCA the same as linear regression?
NOTE (IMPORTANT CONCEPT): PCA is not linear regression. It looks similar though, does it not?
Linear regression is a predictive model. PCA is not. You cannot use PCA to predict anything. You can use PCA to only pick out patterns in your data.
4.8 đ Key Concepts
4.8.1 1. Scores and Loadings
What is being plotted on the axes (PC1 and PC2) are the scores
.
The scores
for each principal component are calculated as follows:
\[ PC_{1} = \alpha X + \beta Y + \gamma Z + .... \]
where \(X\), \(Y\) and \(Z\) are the normalized features.
The constants \(\alpha\), \(\beta\), \(\gamma\) are determined by the PCA algorithm. They are called the loadings
.
4.8.2 2. Linear combinations
NOTE (IMPORTANT CONCEPT): The principal components are linear combinations of the original features. Hence they can be a bit difficult to interpret.
4.8.3 3. Variance
- Variance = how spread out the data is.
- PCA finds directions (principal components) that maximize variance.
4.9 đŹ Example: Gene Expression Data
- Rows = samples (patients)
- Columns = gene expression levels
4.9.1 Goal:
- Reduce dimensionality from 20,000 genes to 2-3 PCs
- Visualize patterns between patient groups (e.g., healthy vs. cancer)
# Sample Python code (requires numpy, sklearn, matplotlib)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
= ... # gene expression matrix
X = StandardScaler().fit_transform(X)
X_scaled
= PCA(n_components=2)
pca = pca.fit_transform(X_scaled)
X_pca
0], X_pca[:, 1])
plt.scatter(X_pca[:, 'PC1')
plt.xlabel('PC2')
plt.ylabel('PCA of Gene Expression')
plt.title( plt.show()
4.10 Lesson Summary
Basics of unsupervised learning
Useful for visualization, outlier detection and making sense of your data if there are many features
What it is: Discover hidden patterns or groupings in unlabeled data, without predicting a specific target.
Key techniques:
- Clustering for grouping similar observations
- Dimensionality reduction (e.g. PCA) for compressing and visualizing highâdimensional data
- Clustering for grouping similar observations
Why it matters:
- Reveals structure in customer segmentation, anomaly detection, image compression, etc.
- Serves as exploratory analysis and preprocessing for downstream tasks
- Reveals structure in customer segmentation, anomaly detection, image compression, etc.
Information bottleneck: Forcing models to squeeze data through a narrow bottleneck uncovers the most essential features and removes noise
Handsâon example: Apply PCA to crimeâandâpopulation data by state to project three features into two dimensions for visualization
Unsupervised vs. supervised:
- Unsupervised: No labels, focus on pattern discovery
- Supervised: With labels, focus on predicting a known outcome
- Unsupervised: No labels, focus on pattern discovery