9  Evaluation for unsupervised machine learning

TipLearning Objectives
  • How to evaluate unsupervised machine learning techniques
  • Know the difference between internal, external, and biological evaluation.
  • Be able to compute and interpret common metrics (silhouette, ARI).
  • Learn practical diagnostics: silhouette plot .
  • Be able to choose metrics depending on whether you have ground truth, partial labels, or purely exploratory goals.
  • Communicate results in biologically meaningful ways (marker genes).

9.1 Conceptual framing

  • Internal metrics: use only the data + clustering labels. Measure compactness vs separation (e.g., silhouette).
  • External metrics: require ground truth/labels (experimental groups, annotated cell types). Use ARI, NMI, precision/recall on pairwise same/different labels.
  • Biological validation: compare clusters to known marker genes, pathways, experimental metadata (batch, donor), or enrichment tests. Often the most important for biologists.

9.2 Quick metrics cheat-sheet (what they tell you)

  • Explained variance (PCA) — fraction of variance captured by components (useful for dimensionality reduction decisions).
  • Silhouette score (−1..1) — how well each sample fits its cluster vs nearest other cluster; good general-purpose internal metric.
  • Calinski–Harabasz — ratio of between/within dispersion (higher = better).
  • Davies–Bouldin (lower = better) — average similarity between each cluster and its most similar one.
  • Adjusted Rand Index (ARI) — similarity between two labelings corrected for chance (commonly 0..1).
  • Normalized Mutual Information (NMI) — information overlap between labelings (0..1).
  • Trustworthiness (for embeddings like UMAP/t-SNE) — how well local neighborhoods are preserved.
  • Stability / reproducibility — how consistent cluster assignments are under parameter changes.

9.3 (Optional) Exercise on cancer data

  • You have been given some data on cancer cell lines

  • Team up with someone and perform hierarchical clustering on this data

  • You have been given some starter code to help you load the data

  • The data has been downloaded and processed for you (after you run the code below).

  • The data is in the variable named X

import numpy as np
import pandas as pd
import os
import requests
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Load data
X = pd.read_csv("https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/cancer_data_saved_NC160.csv", index_col=0)


print("Fetching labels from GitHub...")
labs_url = 'https://raw.githubusercontent.com/neelsoumya/python_machine_learning/main/data/NCI60labs.csv'
response = requests.get(labs_url)
response.raise_for_status()
# Read the raw text and split into lines.
all_lines = response.text.strip().splitlines()

# Skip the first line (the header) to match the data dimensions.
labs = all_lines[1:]

# The labels in the file are quoted (e.g., "CNS"), so we remove the quotes.
labs = [label.strip('"') for label in labs]

# Your code below ......
Fetching labels from GitHub...
  • Write your code while working in pairs or a group
# Hierarchical Clustering
agg = AgglomerativeClustering(linkage='average', metric='manhattan')
cluster_labels = agg.fit_predict(X)

# Compute linkage matrix for the dendrogram
Z = linkage(X, method='average', metric='cityblock')

# Plot Dendrogram
plt.figure()
dendrogram(Z, labels=labs)
plt.title('Hierarchical Clustering Dendrogram (NCI60, Average Linkage, Manhattan Distance)')
plt.xlabel('Cell Line')
plt.ylabel('Distance')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

9.3.1 Exercise: try another linkage method and another distance metric

TipImportant Concept (recall)
  • There is not “correct” answer in unsupervised machine learning!

  • So how do you know when you are done?

9.4 Evaluating the Quality of Clusters

Evaluating the quality of clusters is a crucial step in any unsupervised learning task. Since we do not have a single correct answer, we use several methods that fall into three main categories:

9.4.1 1. Internal Evaluation

Measures how good the clustering is based only on the data itself (e.g., how dense and well-separated the clusters are).

9.4.2 2. External Evaluation

Measures how well the clustering results align with known, ground-truth labels. This is possible here because the NCI60 dataset has known cancer cell line types, which we loaded as labs.

9.4.3 3. Visual Evaluation

Inspecting plots (like the dendrogram or PCA) to see if the groupings seem logical.


Let us add the two most common metrics: one internal and one external.


9.4.4 Internal Evaluation: Silhouette Score

The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

  • Score Range: -1 to +1
  • Interpretation:
    • +1: The sample is far away from the neighboring clusters (very good).
    • 0: The sample is on or very close to the decision boundary between two neighboring clusters.
    • -1: The sample is assigned to the wrong cluster.

9.4.5 External Evaluation: Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) measures the similarity between the true labels (labs) and the labels assigned by our clustering algorithm (cluster_labels). It accounts for chance groupings.

  • Score Range: -1 to +1
  • Interpretation:
    • +1: Perfect agreement between true and predicted labels.
    • 0: Random labeling (no correlation).
    • < 0: Worse than random labeling.
  • Here is how you would implement this
from sklearn.metrics import silhouette_samples, silhouette_score, adjusted_rand_score # Import evaluation metrics
import numpy as np
import matplotlib.pyplot as plt

# Hierarchical Clustering
agg = AgglomerativeClustering(linkage='average', metric='manhattan')
cluster_labels = agg.fit_predict(X)

# 1. Internal Evaluation: Silhouette Score
# Measures how well-separated clusters are based on the data itself.
silhouette = silhouette_score(X, cluster_labels, metric='manhattan')
print("Silhouette score")
print(silhouette)
print("Score is from -1 to 1. Higher is better")

# 2. External Evaluation: Adjusted Rand Index
# Compares our cluster labels to the true cancer type labels.
ari = adjusted_rand_score(labs, cluster_labels)
print("Adjusted Rand Index")
print(ari)
print("Compares to true labels. Score is from -1 to 1. Higher is better")
Silhouette score
0.1436950300449066
Score is from -1 to 1. Higher is better
Adjusted Rand Index
0.0554671516253694
Compares to true labels. Score is from -1 to 1. Higher is better
  • Silhouette plots

A silhouette plot is a visual diagnostic for clustering quality that (1) computes a silhouette value for each sample and (2) shows the distribution of those values for every cluster. It helps you see which clusters are tight and well-separated and which contain ambiguous or poorly assigned samples.

  • Notes on interpreting silhouette plots

  • Each horizontal block is a cluster; the width at a given vertical position is the silhouette value of a sample.

  • Values close to +1 → sample is well matched to its own cluster and poorly matched to neighbors.

  • Values near 0 → sample lies between clusters.

  • Negative values → sample is likely assigned to the wrong cluster.

  • The red dashed line is the average silhouette score; use it as a quick summary, but always inspect per-cluster distributions — a high average can hide poorly-formed small clusters.

def plot_silhouette_simple(X, labels, metric='manhattan'):
    """
    Minimal silhouette-bar plot.
    X : array-like, shape (n_samples, n_features)
    labels : array-like, cluster labels (integers)
    metric : distance metric for silhouette (use same metric as clustering)
    """
    unique_labels = np.unique(labels)
    n_clusters = len(unique_labels)
    if n_clusters < 2:
        print("Need at least 2 clusters to compute silhouette.")
        return

    # overall score
    s_score = silhouette_score(X, labels, metric=metric)
    print(f"Silhouette score: {s_score:.3f}  (range -1 to 1; higher is better)")

    # per-sample silhouette values
    sample_vals = silhouette_samples(X, labels, metric=metric)

    plt.figure()
    y_lower = 10  # starting y position for first cluster
    for i, cl in enumerate(unique_labels):
        vals = sample_vals[labels == cl]
        vals.sort()
        size = vals.shape[0]
        y_upper = y_lower + size

        # draw horizontal filled bars for this cluster
        plt.fill_betweenx(np.arange(y_lower, y_upper),
                          0, vals,
                          alpha=0.7)

        # cluster label at left
        plt.text(-0.05, y_lower + 0.5 * size, str(cl), va='center', fontsize=9)

        y_lower = y_upper + 10  # 10-pixel spacing between groups

    plt.axvline(x=s_score, color='k', linestyle='--', label=f'avg = {s_score:.3f}')
    plt.xlabel('Silhouette coefficient')
    plt.xlim(-0.1, 1.0)
    plt.ylabel('Samples (clusters stacked)')
    plt.title(f'Silhouette plot (n_clusters = {n_clusters})')
    plt.legend(loc='lower right')
    plt.ylim(0, y_lower)
    plt.tight_layout()
    plt.show()

# Usage
# from sklearn.cluster import AgglomerativeClustering
# agg = AgglomerativeClustering(linkage='average', metric='manhattan')
# cluster_labels = agg.fit_predict(X)
plot_silhouette_simple(X, cluster_labels, metric='manhattan')
Silhouette score: 0.144  (range -1 to 1; higher is better)

  • Notes on interpreting silhouette plots

  • Each horizontal block is a cluster; the width at a given vertical position is the silhouette value of a sample.

  • Values close to +1 → sample is well matched to its own cluster and poorly matched to neighbors.

  • Values near 0 → sample lies between clusters.

  • Negative values → sample is likely assigned to the wrong cluster.

  • The red dashed line is the average silhouette score; use it as a quick summary, but always inspect per-cluster distributions — a high average can hide poorly-formed small clusters.

  • Compare to literature and what others have done

  • Plain old visual evaluation

  • compare to labels of what these cell lines are (assuming this is available)

How to interpret (practical heuristics)

  • Mean silhouette ≳ 0.5 → strong structure (good clustering).

  • Mean silhouette ≈ 0.25–0.5 → weak to moderate structure; inspect clusters individually.

  • Mean silhouette ≲ 0.25 → little structure; clustering may be unreliable. (These are rules of thumb — context and domain knowledge matter.)

  • Also compare to clusterings of other cancer cell lines

  • Does the cell line also show up in other datasets? (external validation)

9.5 Summary

TipKey Points
  • We learnt evaluation is difficult in unsupervised machine learning!