12 Evaluation for unsupervised machine learning

Learning Objectives

How to evaluate unsupervised machine learning techniques
Know the difference between internal, external, and biological evaluation.
Be able to compute and interpret common metrics (silhouette).
Be able to choose metrics depending on whether you have ground truth, partial labels, or purely exploratory goals.
Communicate results in biologically meaningful ways (marker genes).

12.1 Conceptual framing

Internal metrics: use only the data + clustering labels. Measure compactness vs separation (e.g., silhouette).
External metrics: require ground truth/labels (experimental groups, annotated cell types).
Biological validation: compare clusters to known marker genes, pathways, experimental metadata (batch, donor), or enrichment tests. Often the most important for biologists.

12.2 Quick metrics cheat-sheet (what they tell you)

Explained variance (PCA) — fraction of variance captured by components (useful for dimensionality reduction decisions).
Silhouette score (−1..1) — how well each sample fits its cluster vs nearest other cluster; good general-purpose internal metric.
Stability / reproducibility — how consistent cluster assignments are under parameter changes.

12.3 (Optional) Exercise on cancer data

You have been given some data on cancer cell lines
Team up with someone and perform hierarchical clustering on this data
You have been given some starter code to help you load the data
The data has been downloaded and processed for you (after you run the code below).
The data is in the variable named X

import numpy as np
import pandas as pd
import os
import requests
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Load data
X = pd.read_csv("https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/cancer_data_saved_NC160.csv", index_col=0)


print("Fetching labels from GitHub...")
labs_url = 'https://raw.githubusercontent.com/neelsoumya/python_machine_learning/main/data/NCI60labs.csv'
response = requests.get(labs_url)
response.raise_for_status()
# Read the raw text and split into lines.
all_lines = response.text.strip().splitlines()

# Skip the first line (the header) to match the data dimensions.
labs = all_lines[1:]

# The labels in the file are quoted (e.g., "CNS"), so we remove the quotes.
labs = [label.strip('"') for label in labs]

# Your code below ......

Fetching labels from GitHub...

Write your code while working in pairs or a group

Click to expand

# Hierarchical Clustering
agg = AgglomerativeClustering(linkage='average', metric='manhattan')
cluster_labels = agg.fit_predict(X)

# Compute linkage matrix for the dendrogram
Z = linkage(X, method='average', metric='cityblock')

# Plot Dendrogram
plt.figure()
dendrogram(Z, labels=labs)
plt.title('Hierarchical Clustering Dendrogram (NCI60, Average Linkage, Manhattan Distance)')
plt.xlabel('Cell Line')
plt.ylabel('Distance')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

12.3.1 Exercise: try another linkage method and another distance metric

Important Concept (recall)

There is no “correct” answer in unsupervised machine learning!
So how do you know when you are done?

12.4 Evaluating the Quality of Clusters

Evaluating the quality of clusters is a crucial step in any unsupervised learning task. Since we do not have a single correct answer, we use several methods that fall into three main categories:

12.4.1 1. Internal Evaluation

Measures how good the clustering is based only on the data itself (e.g., how dense and well-separated the clusters are).

12.4.2 2. External Evaluation

Measures how well the clustering results align with known, ground-truth labels. This is possible here because the NCI60 dataset has known cancer cell line types, which we loaded as labs.

12.4.3 3. Visual Evaluation

Inspecting plots (like the dendrogram or PCA) to see if the groupings seem logical.

Let us add the two most common metrics: one internal and one external.

12.4.4 Internal Evaluation: Silhouette Score

The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

Score Range: -1 to +1
Interpretation:
- +1: The sample is far away from the neighboring clusters (very good).
- 0: The sample is on or very close to the decision boundary between two neighboring clusters.
- -1: The sample is assigned to the wrong cluster.

12.4.5 External Evaluation: Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) measures the similarity between the true labels (labs) and the labels assigned by our clustering algorithm (cluster_labels). It accounts for chance groupings.

Score Range: -1 to +1
Interpretation:
- +1: Perfect agreement between true and predicted labels.
- 0: Random labeling (no correlation).
- < 0: Worse than random labeling.
Here is how you would implement this

from sklearn.metrics import silhouette_samples, silhouette_score, adjusted_rand_score # Import evaluation metrics
import numpy as np
import matplotlib.pyplot as plt

# Hierarchical Clustering
agg = AgglomerativeClustering(linkage='average', metric='manhattan')
cluster_labels = agg.fit_predict(X)

# 1. Internal Evaluation: Silhouette Score
# Measures how well-separated clusters are based on the data itself.
silhouette = silhouette_score(X, cluster_labels, metric='manhattan')
print("Silhouette score")
print(silhouette)
print("Score is from -1 to 1. Higher is better")

# 2. External Evaluation: Adjusted Rand Index
# Compares our cluster labels to the true cancer type labels.
ari = adjusted_rand_score(labs, cluster_labels)
print("Adjusted Rand Index")
print(ari)
print("Compares to true labels. Score is from -1 to 1. Higher is better")

Silhouette score
0.1436950300449066
Score is from -1 to 1. Higher is better
Adjusted Rand Index
0.0554671516253694
Compares to true labels. Score is from -1 to 1. Higher is better

12.4.6 Intuition of silhouette score

Think of each data point as asking two questions:

How close am I to points in my own cluster? (call this a)
How close would I be, on average, to the nearest other cluster? (call this b)

The silhouette value for the point compares those two answers:

If a is much smaller than b (you are much closer to your own cluster than to any other), the silhouette is close to +1 → great fit.
If a ≈ b, silhouette is near 0 → on the boundary between clusters.
If a > b, silhouette is negative → probably misassigned (you are closer to another cluster than your own).

Numerically:

[ s = ]

[ s ]

Compare to literature and what others have done
Plain old visual evaluation
compare to labels of what these cell lines are (assuming this is available)

How to interpret (practical heuristics)

Mean silhouette ≳ 0.5 → strong structure (good clustering).
Mean silhouette ≈ 0.25–0.5 → weak to moderate structure; inspect clusters individually.
Mean silhouette ≲ 0.25 → little structure; clustering may be unreliable. (These are rules of thumb — context and domain knowledge matter.)

Also compare to clusterings of other cancer cell lines
Does the cell line also show up in other datasets? (external validation)

12.5 Summary

Key Points

We learnt evaluation is difficult in unsupervised machine learning!