How to evaluate unsupervised machine learning techniques
Know the difference between internal, external, and biological evaluation.
Be able to compute and interpret common metrics (silhouette, ARI).
Learn practical diagnostics: silhouette plot .
Be able to choose metrics depending on whether you have ground truth, partial labels, or purely exploratory goals.
Communicate results in biologically meaningful ways (marker genes).
9.1 Conceptual framing
Internal metrics: use only the data + clustering labels. Measure compactness vs separation (e.g., silhouette).
External metrics: require ground truth/labels (experimental groups, annotated cell types). Use ARI, NMI, precision/recall on pairwise same/different labels.
Biological validation: compare clusters to known marker genes, pathways, experimental metadata (batch, donor), or enrichment tests. Often the most important for biologists.
9.2 Quick metrics cheat-sheet (what they tell you)
Explained variance (PCA) — fraction of variance captured by components (useful for dimensionality reduction decisions).
Silhouette score (−1..1) — how well each sample fits its cluster vs nearest other cluster; good general-purpose internal metric.
Calinski–Harabasz — ratio of between/within dispersion (higher = better).
Davies–Bouldin (lower = better) — average similarity between each cluster and its most similar one.
Adjusted Rand Index (ARI) — similarity between two labelings corrected for chance (commonly 0..1).
Normalized Mutual Information (NMI) — information overlap between labelings (0..1).
Trustworthiness (for embeddings like UMAP/t-SNE) — how well local neighborhoods are preserved.
Stability / reproducibility — how consistent cluster assignments are under parameter changes.
9.3 (Optional) Exercise on cancer data
You have been given some data on cancer cell lines
Team up with someone and perform hierarchical clustering on this data
You have been given some starter code to help you load the data
The data has been downloaded and processed for you (after you run the code below).
The data is in the variable named X
import numpy as npimport pandas as pdimport osimport requestsfrom sklearn.cluster import AgglomerativeClusteringfrom scipy.cluster.hierarchy import dendrogram, linkageimport matplotlib.pyplot as plt# Load dataX = pd.read_csv("https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/cancer_data_saved_NC160.csv", index_col=0)print("Fetching labels from GitHub...")labs_url ='https://raw.githubusercontent.com/neelsoumya/python_machine_learning/main/data/NCI60labs.csv'response = requests.get(labs_url)response.raise_for_status()# Read the raw text and split into lines.all_lines = response.text.strip().splitlines()# Skip the first line (the header) to match the data dimensions.labs = all_lines[1:]# The labels in the file are quoted (e.g., "CNS"), so we remove the quotes.labs = [label.strip('"') for label in labs]# Your code below ......
Fetching labels from GitHub...
Write your code while working in pairs or a group
NoteClick to expand
# Hierarchical Clusteringagg = AgglomerativeClustering(linkage='average', metric='manhattan')cluster_labels = agg.fit_predict(X)# Compute linkage matrix for the dendrogramZ = linkage(X, method='average', metric='cityblock')# Plot Dendrogramplt.figure()dendrogram(Z, labels=labs)plt.title('Hierarchical Clustering Dendrogram (NCI60, Average Linkage, Manhattan Distance)')plt.xlabel('Cell Line')plt.ylabel('Distance')plt.xticks(rotation=90)plt.tight_layout()plt.show()
9.3.1 Exercise: try another linkage method and another distance metric
TipImportant Concept (recall)
There is not “correct” answer in unsupervised machine learning!
So how do you know when you are done?
9.4 Evaluating the Quality of Clusters
Evaluating the quality of clusters is a crucial step in any unsupervised learning task. Since we do not have a single correct answer, we use several methods that fall into three main categories:
9.4.1 1. Internal Evaluation
Measures how good the clustering is based only on the data itself (e.g., how dense and well-separated the clusters are).
9.4.2 2. External Evaluation
Measures how well the clustering results align with known, ground-truth labels. This is possible here because the NCI60 dataset has known cancer cell line types, which we loaded as labs.
9.4.3 3. Visual Evaluation
Inspecting plots (like the dendrogram or PCA) to see if the groupings seem logical.
Let us add the two most common metrics: one internal and one external.
9.4.4 Internal Evaluation: Silhouette Score
The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
Score Range: -1 to +1
Interpretation:
+1: The sample is far away from the neighboring clusters (very good).
0: The sample is on or very close to the decision boundary between two neighboring clusters.
-1: The sample is assigned to the wrong cluster.
9.4.5 External Evaluation: Adjusted Rand Index (ARI)
The Adjusted Rand Index (ARI) measures the similarity between the true labels (labs) and the labels assigned by our clustering algorithm (cluster_labels). It accounts for chance groupings.
Score Range: -1 to +1
Interpretation:
+1: Perfect agreement between true and predicted labels.
0: Random labeling (no correlation).
< 0: Worse than random labeling.
Here is how you would implement this
from sklearn.metrics import silhouette_samples, silhouette_score, adjusted_rand_score # Import evaluation metricsimport numpy as npimport matplotlib.pyplot as plt# Hierarchical Clusteringagg = AgglomerativeClustering(linkage='average', metric='manhattan')cluster_labels = agg.fit_predict(X)# 1. Internal Evaluation: Silhouette Score# Measures how well-separated clusters are based on the data itself.silhouette = silhouette_score(X, cluster_labels, metric='manhattan')print("Silhouette score")print(silhouette)print("Score is from -1 to 1. Higher is better")# 2. External Evaluation: Adjusted Rand Index# Compares our cluster labels to the true cancer type labels.ari = adjusted_rand_score(labs, cluster_labels)print("Adjusted Rand Index")print(ari)print("Compares to true labels. Score is from -1 to 1. Higher is better")
Silhouette score
0.1436950300449066
Score is from -1 to 1. Higher is better
Adjusted Rand Index
0.0554671516253694
Compares to true labels. Score is from -1 to 1. Higher is better
Silhouette plots
A silhouette plot is a visual diagnostic for clustering quality that (1) computes a silhouette value for each sample and (2) shows the distribution of those values for every cluster. It helps you see which clusters are tight and well-separated and which contain ambiguous or poorly assigned samples.
Notes on interpreting silhouette plots
Each horizontal block is a cluster; the width at a given vertical position is the silhouette value of a sample.
Values close to +1 → sample is well matched to its own cluster and poorly matched to neighbors.
Values near 0 → sample lies between clusters.
Negative values → sample is likely assigned to the wrong cluster.
The red dashed line is the average silhouette score; use it as a quick summary, but always inspect per-cluster distributions — a high average can hide poorly-formed small clusters.
def plot_silhouette_simple(X, labels, metric='manhattan'):""" Minimal silhouette-bar plot. X : array-like, shape (n_samples, n_features) labels : array-like, cluster labels (integers) metric : distance metric for silhouette (use same metric as clustering) """ unique_labels = np.unique(labels) n_clusters =len(unique_labels)if n_clusters <2:print("Need at least 2 clusters to compute silhouette.")return# overall score s_score = silhouette_score(X, labels, metric=metric)print(f"Silhouette score: {s_score:.3f} (range -1 to 1; higher is better)")# per-sample silhouette values sample_vals = silhouette_samples(X, labels, metric=metric) plt.figure() y_lower =10# starting y position for first clusterfor i, cl inenumerate(unique_labels): vals = sample_vals[labels == cl] vals.sort() size = vals.shape[0] y_upper = y_lower + size# draw horizontal filled bars for this cluster plt.fill_betweenx(np.arange(y_lower, y_upper),0, vals, alpha=0.7)# cluster label at left plt.text(-0.05, y_lower +0.5* size, str(cl), va='center', fontsize=9) y_lower = y_upper +10# 10-pixel spacing between groups plt.axvline(x=s_score, color='k', linestyle='--', label=f'avg = {s_score:.3f}') plt.xlabel('Silhouette coefficient') plt.xlim(-0.1, 1.0) plt.ylabel('Samples (clusters stacked)') plt.title(f'Silhouette plot (n_clusters = {n_clusters})') plt.legend(loc='lower right') plt.ylim(0, y_lower) plt.tight_layout() plt.show()# Usage# from sklearn.cluster import AgglomerativeClustering# agg = AgglomerativeClustering(linkage='average', metric='manhattan')# cluster_labels = agg.fit_predict(X)plot_silhouette_simple(X, cluster_labels, metric='manhattan')
Silhouette score: 0.144 (range -1 to 1; higher is better)
Notes on interpreting silhouette plots
Each horizontal block is a cluster; the width at a given vertical position is the silhouette value of a sample.
Values close to +1 → sample is well matched to its own cluster and poorly matched to neighbors.
Values near 0 → sample lies between clusters.
Negative values → sample is likely assigned to the wrong cluster.
The red dashed line is the average silhouette score; use it as a quick summary, but always inspect per-cluster distributions — a high average can hide poorly-formed small clusters.
Compare to literature and what others have done
Plain old visual evaluation
compare to labels of what these cell lines are (assuming this is available)
How to interpret (practical heuristics)
Mean silhouette ≳ 0.5 → strong structure (good clustering).
Mean silhouette ≈ 0.25–0.5 → weak to moderate structure; inspect clusters individually.
Mean silhouette ≲ 0.25 → little structure; clustering may be unreliable. (These are rules of thumb — context and domain knowledge matter.)
Also compare to clusterings of other cancer cell lines
Does the cell line also show up in other datasets? (external validation)
9.5 Summary
TipKey Points
We learnt evaluation is difficult in unsupervised machine learning!