9 Hands-on exercises (Applications of unsupervised machine learning)
- Understand real-world scenarios where unsupervised learning is applied
- Identify situations where PCA and other dimensionality reduction techniques may not be effective
- Practical examples of data that you try unsupervised learning techniques on
- Learn how to evaluate the performance of unsupervised learning methods
- Interpret and communicate the results of these models to each other
9.1 When PCA may not work
9.1.1 Non-linear data
- Non-linearity: Data that lies on curved surfaces or when data has non-linear relationships.
- Single-cell data: Biological data where cell types form non-linear clusters in high-dimensional space
9.1.2 Categorical Features
- PCA may work poorly with categorical data unless properly encoded
- One-hot encoding categorical features can create sparse, high-dimensional data where PCA may not capture meaningful structure
9.2 Alternatives
9.2.1 t-SNE (t-Distributed Stochastic Neighbor Embedding)
- Best for: Non-linear dimensionality reduction and visualization
- Key parameter: Perplexity (try values 5-50)
- Use case: Single-cell data, biological expression data, any non-linear clustering
NOTE (IMPORTANT CONCEPT): Sometimes tSNE may not work as well! It is hard to predict which unsupervised machine learning technique will work best.
You just need to try a bunch of different techniques.
9.2.2 Hierarchical Clustering + Heatmaps
- Best for: Categorical data and understanding relationships between samples
- Use case: When you want to see how samples group together based on multiple features
9.2.3 Demonstrating how PCA or tSNE may not work well
- Generate synthetic biological expression data: matrix of 200 samples × 10 genes, where Gene_1 and Gene_2 follow a clustering (four corner clusters) and the remaining genes are just Gaussian noise. You can see from the scatter of Gene_1 vs Gene_2 that the true structure is non-linear and not aligned with any single variance direction: PCA (or tSNE) may fail to unfold these clusters into separate principal components.
- Perform PCA on this data
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Apply PCA
= PCA()
pca = pca.fit_transform(df) # where df is a dataframe with your data
pcs
# Scatter plot of the first two principal components
plt.figure()0], pcs[:, 1])
plt.scatter(pcs[:, 'PC1')
plt.xlabel('PC2')
plt.ylabel('PCA on Synthetic Biological Dataset')
plt.title( plt.show()
- Let us try tSNE on this data
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
= TSNE()
tsne = tsne.fit_transform(df)
tsne_results
# plot
plt.figure()0], tsne_results[:,1])
plt.scatter(tsne_results[:,'t-SNE component 1')
plt.xlabel('t-SNE component 2')
plt.ylabel('t-SNE on Synthetic Biological Dataset')
plt.title( plt.show()
- What if we try different values of perplexity?
What if data has categorical features?
- PCA may not work if you have categorical features
For example, if you have data that looks like this ….
species tissue condition
0 human liver diseased
1 mouse brain diseased
2 human liver diseased
3 human brain diseased
4 mouse brain healthy
- We can split by disease/healthy, or other features.
- Hierarchical clustering
Recall:
Leaves: Each leaf at the bottom of the dendrogram represents one sample from your dataset.
Branches: The branches connect the samples and groups of samples. The height of the branch represents the distance (dissimilarity) between the clusters being merged.
Height of Merges: Taller branches indicate that the clusters being merged are more dissimilar, while shorter branches indicate more similar clusters.
Clusters: By drawing a horizontal line across the dendrogram at a certain distance, you can define clusters. All samples below that line that are connected by branches form a cluster.
In the context of your one-hot encoded categorical data (species, tissue, condition), the dendrogram shows how samples are grouped based on their combinations of these categorical features.
Samples with the same or very similar combinations of categories will be closer together in the dendrogram and merge at lower distances.
The structure of the dendrogram reflects the relationships and similarities between the different combinations of species, tissue, and condition present in your synthetic dataset.
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
import seaborn as sns
# Assume 'encoded_data' exists from the previous one-hot encoding step
= linkage(y = encoded_data,
linked = 'ward',
method = 'euclidean',
metric =True
optimal_ordering
)
# plot dendrogram
plt.figure()
dendrogram(linked, ='top',
orientation='descending',
distance_sort=True)
show_leaf_counts'Hierarchical Clustering Dendrogram on One-Hot Encoded Categorical Data')
plt.title('Sample Index')
plt.xlabel('Distance')
plt.ylabel(
plt.show()
# or use sns.clustermap()
=encoded_data,
sns.clustermap(data= "ward",
method = "euclidean",
metric = True,
row_cluster = True,
col_cluster = "vlag"
cmap )
- Heatmaps
Heatmaps are a great way to visualize data and clustering
import seaborn as sns
import matplotlib.pyplot as plt
# Assume 'encoded_df' exists from the previous one-hot encoding step
plt.figure()='viridis', cbar_kws={'label': 'Encoded Value (0 or 1)'}) # Transpose for features on y-axis
sns.heatmap(encoded_df.T, cmap
'Heatmap of One-Hot Encoded Categorical Data')
plt.title('Sample Index')
plt.xlabel('Encoded Feature')
plt.ylabel(
plt.tight_layout() plt.show()
9.3 Exercises
- Break up into small groups and work on any one of the following small projects.
9.3.1 Project using electronic healthcare records data
9.3.2 Project using single-cell sequencing data
9.3.3 Project using GapMinder data
9.4 Summary
- Understand real-world scenarios where unsupervised learning is applied
- Identify situations where PCA and other dimensionality reduction techniques may not be effective
- Practical examples of data that you try unsupervised learning techniques on