8  Clustering

Learning Objectives
  • By the end of this lesson, students will be able to:
  1. Explain the goal and applications of clustering
    • Define unsupervised learning and distinguish clustering from classification.
    • Cite real‑world uses (e.g., customer segmentation, anomaly detection).
  2. Describe and execute hierarchical clustering methods
    • Compute and interpret a dendrogram.
    • Apply various linkage criteria (single, complete, average, Ward’s, etc.) and distance metrics (Euclidean, Manhattan, cosine).
  3. Implement and analyze K‑Means clustering
    • Articulate the iterative assignment–update steps and convergence conditions.
    • Write or follow pseudocode for K‑Means, including centroid initialization strategies.
    • Compute and interpret the within‑cluster variation objective.
  4. Compare clustering techniques and select appropriately
    • Identify strengths and weaknesses of hierarchical vs. K‑Means approaches.
    • Choose between methods based on data characteristics (number of clusters, scalability, hierarchy needs).

8.1 Warm-up Puzzle

  • Is the picture below fake or real?

  • How can a computer determine if the picture below is fake or real?

Taj Mahal bathed in the Northern Lights. Generated using the DALL-E tool.

8.2 Clustering Overview

Clustering is an unsupervised learning technique used to group similar data points together. Unlike classification, there are no pre-defined labels. Instead, the algorithm tries to discover structure in the data by maximizing intra-cluster similarity and minimizing inter-cluster similarity.

Key points:

  • Objective: Identify natural groupings in the data.

  • Applications: Customer segmentation, image compression, anomaly detection, document clustering.

8.3 Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of clusters using either a bottom‑up (agglomerative) or top‑down (divisive) approach.

8.3.1 Agglomerative (Bottom‑Up)

  1. Initialization: Start with each data point as its own cluster.

  2. Merge Steps:

    • Compute distance between every pair of clusters.
    • Merge the two closest clusters.
    • Update the distance matrix.
  3. Termination: Repeat until all points are in a single cluster or a stopping criterion (e.g., desired number of clusters) is met.

8.3.2 Dendrogram

        [ALL POINTS]
         /      \
    Cluster A   Cluster B
     /    \       /    \
    …      …     …      …
  • Cutting the tree at different levels yields different numbers of clusters.
  • Linkage methods determine how distance between clusters is computed:
    • Single linkage: Minimum pairwise distance
    • Complete linkage: Maximum pairwise distance
    • Average linkage: Average pairwise distance

8.3.3 Important Concepts

Tip
  • Metric
    The metric (or distance function or dissimilarity function) defines how you measure the distance between individual data points. Common choices include Euclidean, Manhattan (cityblock), or cosine distance. This metric determines the raw pairwise distances.

Manhattan distance

Manhattan distance. Image created using DALL-E.

Manhattan distance

Manhattan distance

Euclidean distance

How would a crow navigate in Manhattan? (I have never been to Manhattan, but the internet says there are crows in Manhattan, so it must be true).

A crow in Manhattan. Image created using DreamUp.

Euclidean distance
  • Linkage
    The linkage method defines how to compute the distance between two clusters based on the pairwise distances of their members. Examples:
    • Single: the distance between the closest pair of points (one from each cluster).
    • Complete: the distance between the farthest pair of points.
    • Average: the average of all pairwise distances.
    • Ward: the merge that minimizes the increase in total within‑cluster variance.

Linkage function

Linkage function
Linkage Method How It Works Intuition
Single Distance = minimum pairwise distance between points in the two clusters “Friends‑of‑friends” – clusters join if any two points are close, yielding chain‑like clusters
Complete Distance = maximum pairwise distance between points in the two clusters “Everyone must be close” – only merge when all points are relatively near, producing compact clusters
Average (UPGMA) Distance = average of all pairwise distances between points in the two clusters Balances single and complete by averaging close and far pairs
Weighted (WPGMA) Distance = average of the previous cluster’s distance to the new cluster (equal weight per cluster) Prevents large clusters from dominating, giving equal say to each cluster
Centroid Distance = distance between the centroids (mean vectors) of the two clusters Merges based on “centers of mass,” but centroids can shift non‑monotonically
Median (WPGMC) Distance = distance between the medians of the two clusters More robust to outliers than centroid linkage, but can also invert dendrogram order
Ward’s Merge that minimizes the increase in total within‑cluster sum of squares (variance) Keeps clusters as tight and homogeneous as possible, often resulting in evenly sized groups

8.3.4 Single Linkage

  • How it works: Measures the distance between two clusters as the smallest distance between any single point in one cluster and any single point in the other.
  • Intuition: “Friends‑of‑friends” clustering—if any two points (one from each cluster) are close, the clusters join. Can produce long, straggly chains of points.

8.3.5 Complete Linkage

  • How it works: Measures the distance between two clusters as the largest distance between any point in one cluster and any point in the other.
  • Intuition: “Everyone must be close”—clusters merge only when all their points are relatively near each other, leading to tight, compact groups.

8.3.6 Average Linkage (UPGMA)

  • How it works: Takes the average of all pairwise distances between points in the two clusters.
  • Intuition: A middle‑ground between single and complete linkage—balances the effect of very close and very far pairs by averaging them.

8.3.7 Weighted Linkage (WPGMA)

  • How it works: Similar to average linkage, but treats each cluster as a single entity by averaging the distance from each original cluster to the target cluster, regardless of cluster size.
  • Intuition: Prevents larger clusters from dominating the average—gives each cluster equal say in how far apart they are.

8.3.8 Centroid Linkage

  • How it works: Computes the distance between the centroids (mean vectors) of the two clusters.
  • Intuition: Clusters merge based on whether their “centers of mass” are close. Can sometimes lead to non‑monotonic merges if centroids shift oddly.

8.3.9 Median Linkage (WPGMC)

  • How it works: Uses the median point of each cluster instead of the mean when computing distance between clusters.
  • Intuition: Like centroid linkage but more robust to outliers, since the median isn’t pulled by extreme values—though can also cause inversion issues.

8.3.10 Ward’s Method

  • How it works: At each step, merges the two clusters whose union leads to the smallest possible increase in total within‑cluster variance (sum of squared deviations).
  • Intuition: Always chooses the merge that keeps clusters as tight and homogeneous as possible, often yielding groups of similar size and shape.
Concept about distances

There is no single “best” distance metric for clustering—what works well for one dataset or problem may not work for another. The choice of distance metric (such as Euclidean, or Manhattan) depends on the nature of your data and what you want to capture about similarity.

For example, Euclidean distance works well when the scale of features is meaningful and differences are linear, while cosine distance is better for text data or situations where the direction of the data matters more than its magnitude.

It is important to experiment with different distance metrics and see which one produces clusters that make sense for your specific problem. Always check the results and, if possible, use domain knowledge to guide your choice.

8.4 Practical

  • n_clusters in AgglomerativeClustering specifies the number of clusters you want the algorithm to find. After building the hierarchical tree, the algorithm will cut the tree so that exactly n_clusters groups are formed. For example, n_clusters=3 will result in 3 clusters in your data.

  • The default value for n_clusters in AgglomerativeClustering is 2. The default value for linkage is ward. So if you do not specify these parameters, the algorithm will produce 2 clusters using Ward linkage.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

# Load data
iris = load_iris()
X = iris.data

# Fit Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X)

print(labels)  # Cluster assignments for each sample

from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Compute linkage matrix
Z = linkage(X, method='ward')

# Plot dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram (Iris Data)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 0 2 2 2 0 2 2 2 0 2 2 2 0 2
 2 0]

8.4.1 Scale the data

# Load data
iris = load_iris()
X = iris.data

# scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

8.4.2 Perform hierarchical clustering

# Fit Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3)#, linkage='ward')
labels = agg.fit_predict(X_scaled)

print(labels)  # Cluster assignments for each sample

# Compute linkage matrix
Z = linkage(X_scaled)#, method='ward')

# Plot dendrogram
plt.figure()
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram (Iris Data)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 2 1 1 1 1 1 1 1 1 0 0 0 2 0 2 0 2 0 2 2 0 2 0 2 0 2 2 2 2 0 0 0 0
 0 0 0 0 0 2 2 2 2 0 2 0 0 2 2 2 2 0 2 2 2 2 2 0 2 2 0 0 0 0 0 0 2 0 0 0 0
 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]

8.5 Exercise (changing the linkage function)

  • Let us try another linkage function

  • Change the linkage function in Z = linkage(X_scaled), method='ward')

  • How does the clustering change as you change this to another function?

  • How does this change if you do not scale the data?

# Fit Agglomerative Clustering on unscaled data with 'average' linkage
agg = AgglomerativeClustering(linkage='average')
labels = agg.fit_predict(X)

# Compute linkage matrix on unscaled data with 'average' linkage
Z = linkage(X, method='average')

# Plot dendrogram
plt.figure()
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram (Iris Data, Unscaled, Average Linkage)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

8.6 Exercise (trying a different dissimilarity metric)

# Fit Agglomerative Clustering on unscaled data with 'average' linkage and 'manhattan' distance
agg = AgglomerativeClustering(
    linkage='average',
    metric='manhattan'      # use metric instead of deprecated affinity
)


# Compute linkage matrix on unscaled data with 'average' linkage and 'cityblock' (manhattan) distance
Z = linkage(X, method='average', metric='cityblock')

# Plot dendrogram
plt.figure()
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram (Iris Data, Unscaled, Average Linkage, Manhattan Distance)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

8.7 Practical on cancer data

  • You have been given some data on cancer cell lines

  • Team up with someone and perform hierarchical clustering on this data

  • You have been given some starter code to help you load the data

  • The data has been downloaded and processed for you (after you run the code below).

  • The data is in the variable named X

import numpy as np
import requests
import io
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# URLs for the raw data and labels from GitHub repository
data_url = 'https://raw.githubusercontent.com/neelsoumya/python_machine_learning/main/data/NCI60data.npy'
labs_url = 'https://raw.githubusercontent.com/neelsoumya/python_machine_learning/main/data/NCI60labs.csv'

# Load Data from GitHub
response = requests.get(data_url)
response.raise_for_status()
X = np.load(io.BytesIO(response.content))
print("NCI60 data loaded successfully from GitHub.")

# Load Labels from GitHub
print("Fetching labels from GitHub...")
response = requests.get(labs_url)
response.raise_for_status()
# Read the raw text and split into lines.
all_lines = response.text.strip().splitlines()

# Skip the first line (the header) to match the data dimensions.
labs = all_lines[1:]

# The labels in the file are quoted (e.g., "CNS"), so we remove the quotes.
labs = [label.strip('"') for label in labs]

# The data is in the variable named X
print(X[:5])
print(X.shape)


# Your code below ......
NCI60 data loaded successfully from GitHub.
Fetching labels from GitHub...
[[ 0.3       1.18      0.55     ...  0.28     -0.34     -1.93    ]
 [ 0.679961  1.289961  0.169961 ... -0.770039 -0.390039 -2.000039]
 [ 0.94     -0.04     -0.17     ... -0.12     -0.41      0.      ]
 [ 0.28     -0.31      0.68     ...  1.09     -0.26     -1.1     ]
 [ 0.485    -0.465     0.395    ...  0.745     0.425     0.145   ]]
(64, 6830)
  • Write your code and work in pairs
# Hierarchical Clustering
agg = AgglomerativeClustering(linkage='average', metric='manhattan')
cluster_labels = agg.fit_predict(X)

# Compute linkage matrix for the dendrogram
Z = linkage(X, method='average', metric='cityblock')

# Plot Dendrogram
plt.figure()
dendrogram(Z, labels=labs)
plt.title('Hierarchical Clustering Dendrogram (NCI60, Average Linkage, Manhattan Distance)')
plt.xlabel('Cell Line')
plt.ylabel('Distance')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

8.7.1 Exercise: try another linkage method and another distance metric

Important Concept (recall)
  • There is not “correct” answer in unsupervised machine learning!

  • So how do you know when you are done?

8.8 Evaluating the Quality of Clusters

Evaluating the quality of clusters is a crucial step in any unsupervised learning task. Since we do not have a single correct answer, we use several methods that fall into three main categories:

8.8.1 1. Internal Evaluation

Measures how good the clustering is based only on the data itself (e.g., how dense and well-separated the clusters are).

8.8.2 2. External Evaluation

Measures how well the clustering results align with known, ground-truth labels. This is possible here because the NCI60 dataset has known cancer cell line types, which we loaded as labs.

8.8.3 3. Visual Evaluation

Inspecting plots (like the dendrogram or PCA) to see if the groupings seem logical.


Let us add the two most common metrics: one internal and one external.


8.8.4 Internal Evaluation: Silhouette Score

The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

  • Score Range: -1 to +1
  • Interpretation:
    • +1: The sample is far away from the neighboring clusters (very good).
    • 0: The sample is on or very close to the decision boundary between two neighboring clusters.
    • -1: The sample is assigned to the wrong cluster.

8.8.5 External Evaluation: Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) measures the similarity between the true labels (labs) and the labels assigned by our clustering algorithm (cluster_labels). It accounts for chance groupings.

  • Score Range: -1 to +1
  • Interpretation:
    • +1: Perfect agreement between true and predicted labels.
    • 0: Random labeling (no correlation).
    • < 0: Worse than random labeling.
  • Here is how you would implement this
from sklearn.metrics import silhouette_score, adjusted_rand_score # Import evaluation metrics

# Hierarchical Clustering
agg = AgglomerativeClustering(linkage='average', metric='manhattan')
cluster_labels = agg.fit_predict(X)

# 1. Internal Evaluation: Silhouette Score
# Measures how well-separated clusters are based on the data itself.
silhouette = silhouette_score(X, cluster_labels, metric='manhattan')
print("Silhouette score")
print(silhouette)
print("Score is from -1 to 1. Higher is better")

# 2. External Evaluation: Adjusted Rand Index
# Compares our cluster labels to the true cancer type labels.
ari = adjusted_rand_score(labs, cluster_labels)
print("Adjusted Rand Index")
print(ari)
print("Compares to true labels. Score is from -1 to 1. Higher is better")
Silhouette score
0.1436950300449066
Score is from -1 to 1. Higher is better
Adjusted Rand Index
0.0554671516253694
Compares to true labels. Score is from -1 to 1. Higher is better
  • Plain old visual evaluation

  • compare to labels of what these cell lines are

9 TODO: XX what do these labels mean?

dendrogram(Z, labels=labs)

print(labs)

plt.figure()
plt.title('Hierarchical Clustering Dendrogram (NCI60)')
plt.xlabel('Cell Line (True Labels)')
plt.ylabel('Distance')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
['CNS', 'CNS', 'CNS', 'RENAL', 'BREAST', 'CNS', 'CNS', 'BREAST', 'NSCLC', 'NSCLC', 'RENAL', 'RENAL', 'RENAL', 'RENAL', 'RENAL', 'RENAL', 'RENAL', 'BREAST', 'NSCLC', 'RENAL', 'UNKNOWN', 'OVARIAN', 'MELANOMA', 'PROSTATE', 'OVARIAN', 'OVARIAN', 'OVARIAN', 'OVARIAN', 'OVARIAN', 'PROSTATE', 'NSCLC', 'NSCLC', 'NSCLC', 'LEUKEMIA', 'K562B-repro', 'K562A-repro', 'LEUKEMIA', 'LEUKEMIA', 'LEUKEMIA', 'LEUKEMIA', 'LEUKEMIA', 'COLON', 'COLON', 'COLON', 'COLON', 'COLON', 'COLON', 'COLON', 'MCF7A-repro', 'BREAST', 'MCF7D-repro', 'BREAST', 'NSCLC', 'NSCLC', 'NSCLC', 'MELANOMA', 'BREAST', 'BREAST', 'MELANOMA', 'MELANOMA', 'MELANOMA', 'MELANOMA', 'MELANOMA', 'MELANOMA']

  • Also compare to clusterings of other cancer cell lines

  • Does the cell line also show up in other datasets? (external validation)

9.1 K‑Means Clustering

K‑Means is a partitional clustering algorithm that aims to partition the data into K disjoint clusters.

9.1.1 Algorithm Steps

  1. Choose K, the number of clusters.

  2. Initialization: Randomly select K initial centroids (or use k‑means++ for better seeding).

  3. Assignment Step:

    for each data point x_i:
        assign x_i to cluster j whose centroid μ_j is nearest (minimize ||x_i - μ_j||²)
  4. Update Step:

    for each cluster j:
        μ_j = (1 / |C_j|) * sum_{x_i in C_j} x_i
  5. Convergence Check:

    • Stop when assignments no longer change, OR
    • The change in centroids is below a threshold, OR
    • A maximum number of iterations is reached.

9.1.2 Animation

Animation of k-means

9.1.3 Within–Cluster Variation

In \(K\)‑means clustering, we partition our \(n\) observations into \(K\) disjoint clusters \(\{C_1, C_2, \dots, C_K\}\). A “good” clustering is one for which the within‑cluster variation is minimized.

9.1.4 Elbow point

When using k‑means clustering, one of the key questions is: how many clusters (k) should I choose? The elbow method is a simple, visual way to pick a reasonable k by looking at how the “within‑cluster” variation decreases as k increases.


1. The Within‑Cluster Sum of Squares (WCSS)

For each choice of k, you run k‑means and compute the within‑cluster sum of squares (WCSS), also called inertia or distortion. This is the sum of squared Euclidean distances between each point and the centroid of its cluster:

WCSS
  • \(C_{i}\) is cluster i
  • \(mu_{i}\) is the centroid of cluster i

As k increases, WCSS will always decrease (or stay the same), because more centroids can only reduce distances.

2. Plotting WCSS versus k

  1. Choose a range for k (e.g. 1 to 10).
  2. For each k, fit k‑means and record WCSS(k).
  3. Plot WCSS(k) on the y-axis against k on the x-axis.

You will get a curve that starts high at k = 1 and steadily goes down as k increases.


3. Identifying the “Elbow”

  • At first, adding clusters dramatically reduces WCSS, because you are splitting large, heterogeneous clusters into more homogeneous groups.
  • After some point, adding more clusters yields diminishing returns—each new cluster only slightly reduces WCSS.

The elbow point is the value of k at which the decrease in WCSS “bends” most sharply—like an elbow in your arm. It balances model complexity (more clusters) against improved fit (lower WCSS).

An elbow point

9.1.5 Exercise (k-means)

  • NOTE: The c parameter in plt.scatter() is used to specify the color of the scatter plot points.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Load data
iris = load_iris()
X = iris.data

# Perform k means with k = 2
kmeans = KMeans(n_clusters=2, random_state=2, n_init=20)
kmeans.fit(X)

# The cluster assignments of the observations are contained in kmeans.labels_
kmeans.labels_

# Plot the data, with each observation colored according to its cluster assignment.
plt.figure()
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_)
plt.title("K-means on Iris data")
plt.show()

9.1.6 Exercise (Evaluate k-means clusters using the within-cluster similarity)

  • Find the optimal value of the number of clusters \(K\)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Load data
iris = load_iris()
X = iris.data

# Plot WCSS (inertia) as a function of the number of clusters
wcss = []

# try a few different values of k
k_range = range(1,11)

for k_var in k_range:

   # fit kmeans
   kmeans = KMeans(n_clusters=k_var, random_state=2)
   kmeans.fit(X)

   # append WCSS to a list
   wcss.append(kmeans.inertia_)


# plot
plt.figure()
plt.scatter( k_range , wcss )
plt.xlabel('Number of clusters')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.title('Elbow Method For Optimal')
plt.show()

9.2 Choosing Between Methods

9.2.1 Hierarchical Clustering

  • No need to pre-specify number of clusters (can decide by cutting dendrogram).
  • Produces a full hierarchy of clusters.

9.2.2 K‑Means

  • Requires pre-specifying \(K\).

9.3 Summary

Key Points
  • Hierarchical clustering builds a tree-like structure (dendrogram) to group similar data points
    • Distance Metrics How we measure similarity between points:
    • Euclidean distance (straight-line distance)
    • Manhattan distance (city-block distance)
    • Cosine distance (for directional data)
  • Linkage Methods
    • How we measure distance between clusters:
    • Single: Uses closest pair of points (can create chains)
    • Complete: Uses farthest pair of points (creates compact clusters)
    • Average: Uses average of all pairwise distances (balanced approach)
    • Ward’s: Minimizes increase in variance (creates similar-sized clusters)
  • Dendrogram
    • Visual representation showing how clusters merge:
    • Height shows distance when clusters merged
    • Cutting at different heights gives different numbers of clusters
  • Key Code Patterns:
# Basic hierarchical clustering
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X)

# Create dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, method='ward')
dendrogram(Z)
  • Important Takeaways

  • No “Correct” Answer

    • Unsupervised learning requires interpretation and domain knowledge
    • Multiple Methods - Try different linkage methods and distance metrics
    • Evaluation is Key - Use both internal (silhouette) and external (ARI) metrics

9.4 Further Reading