9  Hands-on exercises (Applications of unsupervised machine learning)

Learning Objectives
  • Understand real-world scenarios where unsupervised learning is applied
  • Identify situations where PCA and other dimensionality reduction techniques may not be effective
  • Practical examples of data that you try unsupervised learning techniques on
  • Learn how to evaluate the performance of unsupervised learning methods
  • Interpret and communicate the results of these models to each other

9.1 When PCA may not work

9.1.1 Non-linear data

  • Non-linearity: Data that lies on curved surfaces or when data has non-linear relationships.
  • Single-cell data: Biological data where cell types form non-linear clusters in high-dimensional space

9.1.2 Categorical Features

  • PCA may work poorly with categorical data unless properly encoded
  • One-hot encoding categorical features can create sparse, high-dimensional data where PCA may not capture meaningful structure

9.2 Alternatives

9.2.1 t-SNE (t-Distributed Stochastic Neighbor Embedding)

  • Best for: Non-linear dimensionality reduction and visualization
  • Key parameter: Perplexity (try values 5-50)
  • Use case: Single-cell data, biological expression data, any non-linear clustering
Tip

NOTE (IMPORTANT CONCEPT): Sometimes tSNE may not work as well! It is hard to predict which unsupervised machine learning technique will work best.

You just need to try a bunch of different techniques.

9.2.2 Hierarchical Clustering + Heatmaps

  • Best for: Categorical data and understanding relationships between samples
  • Use case: When you want to see how samples group together based on multiple features

9.2.3 Demonstrating how PCA or tSNE may not work well

  • Generate synthetic biological expression data: matrix of 200 samples × 10 genes, where Gene_1 and Gene_2 follow a clustering (four corner clusters) and the remaining genes are just Gaussian noise. You can see from the scatter of Gene_1 vs Gene_2 that the true structure is non-linear and not aligned with any single variance direction: PCA (or tSNE) may fail to unfold these clusters into separate principal components.

  • Perform PCA on this data
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Apply PCA
pca = PCA()
pcs = pca.fit_transform(df) # where df is a dataframe with your data


# Scatter plot of the first two principal components
plt.figure()
plt.scatter(pcs[:, 0], pcs[:, 1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA on Synthetic Biological Dataset')
plt.show()

  • Let us try tSNE on this data
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE()
tsne_results = tsne.fit_transform(df)

# plot
plt.figure()
plt.scatter(tsne_results[:,0], tsne_results[:,1])
plt.xlabel('t-SNE component 1')
plt.ylabel('t-SNE component 2')
plt.title('t-SNE on Synthetic Biological Dataset')
plt.show()

  • What if we try different values of perplexity?

What if data has categorical features?

  • PCA may not work if you have categorical features

For example, if you have data that looks like this ….

  species tissue condition
0   human  liver  diseased
1   mouse  brain  diseased
2   human  liver  diseased
3   human  brain  diseased
4   mouse  brain   healthy

  • We can split by disease/healthy, or other features.

  • Hierarchical clustering

Recall:

Leaves: Each leaf at the bottom of the dendrogram represents one sample from your dataset.

Branches: The branches connect the samples and groups of samples. The height of the branch represents the distance (dissimilarity) between the clusters being merged.

Height of Merges: Taller branches indicate that the clusters being merged are more dissimilar, while shorter branches indicate more similar clusters.

Clusters: By drawing a horizontal line across the dendrogram at a certain distance, you can define clusters. All samples below that line that are connected by branches form a cluster.

  • In the context of your one-hot encoded categorical data (species, tissue, condition), the dendrogram shows how samples are grouped based on their combinations of these categorical features.

  • Samples with the same or very similar combinations of categories will be closer together in the dendrogram and merge at lower distances.

  • The structure of the dendrogram reflects the relationships and similarities between the different combinations of species, tissue, and condition present in your synthetic dataset.

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
import seaborn as sns

# Assume 'encoded_data' exists from the previous one-hot encoding step
linked = linkage(y = encoded_data,
    method = 'ward',
    metric = 'euclidean',
    optimal_ordering=True
    )

# plot dendrogram
plt.figure()
dendrogram(linked, 
            orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram on One-Hot Encoded Categorical Data')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# or use sns.clustermap()
sns.clustermap(data=encoded_data,
method = "ward",
metric = "euclidean",
row_cluster = True,
col_cluster = True,
cmap = "vlag"
)

  • Heatmaps

Heatmaps are a great way to visualize data and clustering

import seaborn as sns
import matplotlib.pyplot as plt

# Assume 'encoded_df' exists from the previous one-hot encoding step

plt.figure()
sns.heatmap(encoded_df.T, cmap='viridis', cbar_kws={'label': 'Encoded Value (0 or 1)'}) # Transpose for features on y-axis

plt.title('Heatmap of One-Hot Encoded Categorical Data')
plt.xlabel('Sample Index')
plt.ylabel('Encoded Feature')
plt.tight_layout()
plt.show()

9.3 Exercises

  • Break up into small groups and work on any one of the following small projects.

9.3.1 Project using electronic healthcare records data

Exercise 1 - Electronic healthcare records data

Level:

For this exercise we will be using some data from hospital electronic healthcare records (EHR). No knowledge of biology/healthcare is required for this.

Project briefing

Here is a brief code snippet to help you load the data and get started.

You have to follow the following steps:

  • Data Loading and Preprocessing: Loading a diabetes dataset and normalizing numerical features.

  • Dimensionality Reduction: Applying PCA and t-SNE to reduce the dimensions of the data for visualization and analysis.

  • Clustering: Performing K-Means clustering on the reduced data to identify potential patient subgroups.

  • Visualization: Visualizing the data in lower dimensions and the identified clusters to gain insights.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

#######################
# Load diabetes data 
#######################

url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/diabetes_kaggle.csv"
df = pd.read_csv(url)

######################################
# Perform data munging and filtering
######################################
print(df.head())

# Normalize numeric columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
scaler = MinMaxScaler()
df_normalized = df.copy() # make a copy
df_normalized[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# ALTERNATIVE CODE (repeat for each numeric column)
# Make a copy so the original DataFrame stays unchanged
# df_normalized = df.copy()

# Create the scaler
# scaler = MinMaxScaler()

# Select the 'Glucose' column as a DataFrame (double brackets keep 2D shape)
# glucose_values = df_normalized[['Glucose']]

# Fit the scaler and transform the values
# glucose_scaled = scaler.fit_transform(glucose_values)

# Put the scaled values back into the copy
# df_normalized[['Glucose']] = glucose_scaled

# Filter: Glucose > 0.5 and BMI < 0.3 (normalized values)
filtered_df = df_normalized[
    (df_normalized['Glucose'] > 0.5) &
    (df_normalized['BMI'] < 0.3)
]

print(filtered_df.head())
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  
     Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
9       0.470588  0.628141       0.786885       0.000000  0.000000  0.000000   
49      0.411765  0.527638       0.000000       0.000000  0.000000  0.000000   
50      0.058824  0.517588       0.655738       0.111111  0.096927  0.289121   
145     0.000000  0.512563       0.614754       0.232323  0.000000  0.000000   
239     0.000000  0.522613       0.622951       0.000000  0.000000  0.274218   

     DiabetesPedigreeFunction       Age  Outcome  
9                    0.065756  0.550000      1.0  
49                   0.096926  0.050000      0.0  
50                   0.176345  0.016667      0.0  
145                  0.210931  0.000000      0.0  
239                  0.215201  0.100000      0.0  
  • Visualize the data
# Histogram
plt.figure()
sns.histplot(df_normalized['Glucose'], bins=30)
plt.title('Distribution of Normalised Glucose')
plt.xlabel('Normalised Glucose')
plt.ylabel('Frequency')
plt.show()

  • Now visualize the other variables. Do you notice anything interesting/odd about them? Hint: use sns.histplot() as shown above or plt.hist().

  • Data visualization is a key step in machine learning. Make sure to spend some time visualizing all the variables/features. Discuss the plots in your group.

  • Perform PCA

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Exclude the target column for PCA
# We not want to include this because this is something you want to predict.
# You can use this column in supervised machine learning.
features = df_normalized.drop(columns=['Outcome'])

# Apply PCA
# This is where you fill in your code .........
  • Fill in the rest of the code with your group members.

  • Perform PCA and visualize it. Hint: use plt.scatter() or sns.scatterplot().

  • Evaluation (how to interpret the PCA plots?)

  • Reminder: In plt.scatter, the c parameter controls the marker colour (or colours).

  • The alpha parameter controls the transparency (opacity) of the markers.

  • When passing numbers, you can specify cmap (colour map) to control the gradient mapping (otherwise the default colourmap is used)

  • Let us colour by the feature BMI now

# Visualize PCA results colored by BMI
plt.figure()
scatter = plt.scatter(pca_df['PC1'], pca_df['PC2'], c = df_normalized['BMI'],
                     cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label='BMI (normalized)')
plt.title('PCA of Diabetes Dataset - Coloured by BMI')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

  • Do you see any patterns?

  • Now colour by Pregnancies

  • Try other features: Glucose, BloodPressure, SkinThickness, Insulin, DiabetesPedigreeFunction, Age

  • Try spotting any patterns and discuss this in your group.

  • Recall: The primary goal of unsupervised machine learning is to uncover hidden patterns, structures, and relationships within the data.

  • This can lead to the generation of new hypotheses about the underlying phenomena, which can then be tested in follow-up studies using statistical methods or through the application of supervised machine learning techniques with labeled data.

  • Essentially, unsupervised learning helps us explore the data and formulate questions that can be further investigated.

  • However it is never the end of the data science pipeline. It can lead to further investigations.

  • Now try tSNE on this data

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Exclude the target column for t-SNE
features = df_normalized.drop(columns=['Outcome'])

# Apply t-SNE
# This is where you fill in your code .........
  • Perform tSNE on this data

  • Vary the perplexity parameter

  • Now let us colour the tSNE plot by BMI

# Exclude the target column for t-SNE
# Already done (so commenting out)
# features_for_tsne = df_normalized.drop(columns=['Outcome', 'Cluster'])

# Create a DataFrame for the t-SNE results
tsne_df = pd.DataFrame(data=tsne_results, columns=['TSNE1', 'TSNE2'])

# Visualize t-SNE colored by BMI
plt.figure()
scatter = plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], c=df_normalized['BMI'],
                     cmap='viridis', alpha=0.7, s=50)
plt.colorbar(scatter, label='BMI (normalized)')
plt.title('t-SNE of Diabetes Dataset - Colored by BMI')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()

  • Now colour the tSNE plot by some other feature. Try Glucose, BloodPressure, SkinThickness, Insulin, DiabetesPedigreeFunction, Age

  • Do you observe any patterns? Discuss in your group.

  • Perform hierarchical clustering on this data

import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
import seaborn as sns

# Exclude the target column for clustering
features = df_normalized.drop(columns=['Outcome'])

# Perform hierarchical clustering
# This is where you fill in your code .........
  • Alternatively you can use sns.clustermap().

Hint: Here is some code to get you started.

ehr_row_linkage = linkage(features, method="ward")

# plot heatmap using sns.clustermap()
sns.clustermap(data = features,
row_linkage = ehr_row_linkage,
cmap = "vlag",
standard_scale = 0
)

  • Perform k-means on this data.

  • Discuss in your group the outcome of this project.

  • What are your key findings?

  • Do you think we can find partitions of patients/clusters of patients?

  • What can you do with these partitions?

Work in a group!

9.3.2 Project using single-cell sequencing data

Exercise 2 - Single-cell sequencing

Level:

For this exercise we will be using some single-cell sequencing data. No biological expertise is required for this.

Exercise

Here is a brief code snippet to help you load the data and get started.

You have to follow the following steps:

  • Data Loading and Preprocessing: Loading a single-cell sequencing dataset and normalizing features.
  • Install packages
!pip install scanpy scipy matplotlib pandas seaborn
  • Load libraries
import scanpy as sc
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
  • Load and Preprocess Data: Load the pbmc3k dataset using scanpy, normalizes the total counts per cell to 10,000, and then applies a log transformation.

  • Then picks out a few “marker” genes (genes that may be important for the disease based on our prior knowledge).

  • The single cell data is just a table of numbers: the rows are different cells, the columns are genes measured in those cells. Here is what this would look like:

Cell CD3D CD4 CD8A FOXP3 IL2RA
Cell_001 0.5 1.2 0.0 2.1 0.8
Cell_002 1.1 0.3 1.5 0.0 1.9
Cell_003 0.0 2.4 0.7 1.3 0.4
Cell_004 1.8 0.0 2.2 0.9 1.1
Cell_005 0.3 1.7 0.0 1.6 0.2
# 1. Load data and preprocess
adata = sc.datasets.pbmc3k()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# 2. Subset to marker genes
marker_genes = [
    'CD3D','CD3E','CD4','CD8A',
    'CD14','LYZ',
    'MS4A1',
    'GNLY','NKG7'
]

genes = [g for g in marker_genes if g in adata.var_names]

expr = pd.DataFrame(
    adata[:, genes].X.toarray(),
    index=adata.obs_names,
    columns=genes
)

print(expr.head())
                      CD3D      CD3E  CD4      CD8A  CD14       LYZ     MS4A1  \
index                                                                           
AAACATACAACCAC-1  2.863463  2.225817  0.0  1.635208   0.0  1.635208  0.000000   
AAACATTGAGCTAC-1  0.000000  0.000000  0.0  0.000000   0.0  1.962726  2.583047   
AAACATTGATCAGC-1  3.489089  1.994867  0.0  0.000000   0.0  1.994867  0.000000   
AAACCGTGCTTCCG-1  0.000000  0.000000  0.0  0.000000   0.0  4.521174  0.000000   
AAACCGTGTATGCG-1  0.000000  0.000000  0.0  0.000000   0.0  0.000000  0.000000   

                      GNLY      NKG7  
index                                 
AAACATACAACCAC-1  0.000000  0.000000  
AAACATTGAGCTAC-1  0.000000  1.111715  
AAACATTGATCAGC-1  1.429261  0.000000  
AAACCGTGCTTCCG-1  0.000000  1.566387  
AAACCGTGTATGCG-1  3.452557  4.728542  
  • We now have a table of numbers: the rows are cells, and columns are genes measured in those cells.

  • Visualize the data. Use plt.hist() or sns.histplot().

  • Now perform PCA on this data (Hint: expr.values has all the values. Perform PCA on this.)

  • Now colour this PCA plot by one marker gene CD3D. The CD3D gene is crucial for immune response. Mutations in this gene can lead to disease. Hint: expr["CD3D"] will get you all the values of the gene. Use that in the c = option in plt.scatter().

  • Discuss in your group: what do you think the plot means?

  • Now try the other marker genes: CD3E,CD4,CD8A, CD14, LYZ, MS4A1, GNLY, NKG7

  • Discuss in your group: what do you think the plot means?

  • Now perform tSNE on this. Hint: expr.values has all the values. Perform tSNE on this.

  • Now colour this PCA plot by one marker gene CD3D. Hint: expr["CD3D"] will get you all the values of the gene. Use that in the c = option in plt.scatter().

  • Discuss in your group: what do you think the plot means?

  • Now try the other marker genes: CD3E,CD4,CD8A, CD14, LYZ, MS4A1, GNLY, NKG7

  • Discuss in your group: what do you think the plot means?

  • Reminder: tSNE is stochastic.

  • Run tSNE again. Do the clusters remain the same? Can you see the same patterns?

  • Run tSNE with a different perplexity value. Do the clusters remain the same?

  • Discuss in your group your key findings. What can you say about these clusters?

  • Now perform hierarchical clustering on this data.

  • Try a few distance functions and linkage functions.

  • Plot heatmaps or clustermaps (Hint: seaborn clustermap does both dendrograms + heatmap in one shot). A representative plot is shown below. Can you try to get a plot similar to this?

  • Hint: Here is some code to get you started.

from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
import seaborn as sns

# compute linkage
cell_link = linkage( pdist(expr.T, metric="euclidean"), method="ward" )
gene_link = linkage( pdist(expr,   metric="euclidean"), method="ward" )

# seaborn clustermap does both dendrograms + heatmap in one shot
# Fill in the code below .......
sns.clustermap(.......)

TODO: XX refine and introduce sns.clustermap elsewhere also

  • Perform k-means on this data.

  • Discuss in your group the outcome of this project.

  • What are your key findings?

  • Do you think we can find partitions of cells/clusters of cells?

  • What can you do with these partitions?

Work in a group!

9.3.3 Project using GapMinder data

Exercise 3 - GapMinder data

Level:

For this exercise we will be using sociological data.

Exercise

In this exercise you will explore the Gapminder dataset, focusing on life expectancy, GDP per capita, and population data. You will perform the following steps initially:

  1. Data Loading and Setup: The gapminder dataset is loaded, and necessary libraries for data manipulation, visualization, and dimensionality reduction are imported.
  2. Feature Selection: The features lifeExp, gdpPercap, and pop are selected for analysis.

Here is a brief code snippet to help you load the data and get started.

  • Install packages
!pip install scipy matplotlib pandas seaborn
  • Load libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import dendrogram, linkage
  • Load data
# Download Gapminder data
url = "https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv"
gap = pd.read_csv(url)

print(gap.head())
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3  Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4  Afghanistan  1972  13079460.0      Asia   36.088  739.981106
  • Subset to countries in Asia and aggregate
# Aggregate by country: mean of features for each Asian country
features = ['lifeExp', 'gdpPercap']
asia_gap_unique = gap[gap['continent'] == 'Asia'].groupby('country')[features].mean().reset_index()

print(asia_gap_unique.head())
       country    lifeExp     gdpPercap
0  Afghanistan  37.478833    802.674598
1      Bahrain  65.605667  18077.663945
2   Bangladesh  49.834083    817.558818
3     Cambodia  47.902750    675.367824
4        China  61.785140   1488.307694
  • Visualize the features by using plt.hist() or sns.histplot()

  • Then perform PCA on it. Hint: you need to normalize your data also.

Does your plot look like this?

  • Is there anything “odd” about this plot? Discuss this in your group.

  • Now label each point on the PCA biplot this by country names

Hint: The following code will not work (since “country” is categorical). You will have to a bit creative!

plt.figure()
plt.scatter(pcs[:,0], pcs[:,1]), c=asia_gap_unique["country"])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Plot of PCA on Gapminder data for Asian countries")
plt.show()

Here are some code hints to help you.

plt.figure()
plt.scatter(pcs[:,0], pcs[:,1])

# add country labels
for i, country in enumerate( asia_gap_unique["country"] ):
    # fill in your code here 
    plt.annotate(.....)

plt.show()

Your plot may look like this:

  • What does PC1 mean? Are there any features that are correlated with PC1?

Hint: Perform a scatterplot (plt.scatter()) for each feature vs. PC1

  • Perform tSNE on this data

  • Do you know notice anything “odd”/“interesting” about this plot?

  • Change the perplexity parameter and observe what it does to the plot.

  • Now add the labels of the countries to this tSNE plot. Here is some code to give you a hint.

Here are some code hints to help you.

plt.figure()
plt.scatter(asia_tsne[:,0], asia_tsne[:,1])

# add country labels
for i, country in enumerate( asia_gap_unique["country"] ):
    # fill in your code here 
    plt.annotate(.....)

plt.show()
  • Now perform hierarchical clustering on this data.

  • Discuss the outcomes of your project in your group. Explain your key outcomes (in a few minutes) to everyone in the class.

Work in a group!

9.4 Summary

Key Points
  • Understand real-world scenarios where unsupervised learning is applied
  • Identify situations where PCA and other dimensionality reduction techniques may not be effective
  • Practical examples of data that you try unsupervised learning techniques on