!pip install umap-learn
12 UMAP (Optional)
- Brief introduction to UMAP
- Code to perform UMAP on data
13 UMAP Intuition
13.1 What is UMAP?
UMAP (Uniform Manifold Approximation and Projection) is another powerful unsupervised machine learning technique that helps visualize high-dimensional data in 2D or 3D. Think of it as a “smart cartographer” that creates a map of your complex data, revealing hidden patterns.
13.2 The Core Idea: Preserve Both Local and Global Structure
13.2.1 The Problem We are Solving
- Say you have cells with 20,000+ gene measurements
- You want to see which cells are similar to each other
- You want to understand both local neighborhoods AND global structure
- But 20,000 dimensions are impossible to visualize!
13.2.2 The Solution
UMAP takes your high-dimensional data and creates a 2D map where: - Similar cells stay close together (local structure preserved) - Different cells stay far apart (global structure preserved) - Both local neighborhoods AND global relationships are maintained
13.3 How UMAP Works: The Intuition
13.3.1 Step 1: Build a Graph of Relationships
Original Space (20,000+ genes):
Cell A: [Gene1=5, Gene2=10, Gene3=2, ... Gene20000=8]
Cell B: [Gene1=6, Gene2=11, Gene3=3, ... Gene20000=9]
Cell C: [Gene1=50, Gene2=100, Gene3=20, ... Gene20000=80]
UMAP creates a "friendship network":
- A and B are close friends (very similar)
- A and C are distant acquaintances (very different)
- B and C are also distant acquaintances
13.3.2 Step 2: Create a 2D Map
UMAP creates a 2D layout where:
- Close friends (A and B) are placed near each other
- Distant acquaintances (A and C, B and C) are placed far apart
- The overall "social network" structure is preserved
13.4 The “Manifold” Concept
13.4.1 What is a Manifold?
Think of a manifold like the surface of a balloon: - From far away, it looks like a simple sphere - Up close, you can see it is actually a 2D surface curved in 3D space - Your high-dimensional biological data might be “curved” in ways we can’t see
13.4.2 Why “Uniform”?
UMAP assumes your data is spread “uniformly” across this curved surface: - No empty regions (uniform coverage) - No overly crowded regions (uniform density) - This helps create a balanced, interpretable map
13.5 UMAP vs t-SNE: Key Differences
13.5.1 What UMAP Does Better
Preserves Global Structure: - t-SNE: Focuses mainly on local neighborhoods - UMAP: Maintains both local AND global relationships
13.6 Key Parameters to Understand
13.6.1 n_neighbors (Default: 15)
- Controls how many “friends” each cell considers
- Low (5-10): Focus on very close neighbors, creates many small clusters
- High (30-50): Consider more distant neighbors, creates fewer, larger clusters
- Default (15): Usually works well for most datasets
13.6.2 min_dist (Default: 0.1)
- Controls how tightly packed points can be in the final map
- Low (0.01): Points can be very close together (tight clusters)
- High (0.5): Points spread out more (looser clusters)
- Default (0.1): Good balance between tightness and readability
13.6.3 metric (Default: ‘euclidean’)
- How to measure distances between cells
- ‘euclidean’: Standard geometric distance (good for most data)
- ‘cosine’: Angle-based distance (good for normalized data)
- ‘manhattan’: City-block distance (good for sparse data)
13.7 UMAP vs Other Methods
13.7.1 UMAP vs PCA
- PCA: Linear method, preserves variance, good for linear relationships
- UMAP: Non-linear method, preserves local structure, good for complex relationships
13.7.2 UMAP vs t-SNE
- t-SNE: Great for local structure, slower, harder to interpret distances
- UMAP: Good for both local and global structure, faster, more interpretable
13.8 Practical Tips
13.8.1 1. Start with Default Parameters
- UMAP’s defaults work well for most biological data
- Don’t over-optimize parameters initially
13.8.2 2. Try Different n_neighbors Values
- 5-10: If you want to see fine-grained subpopulations
- 15-30: For general exploration (recommended)
- 50+: If you want to see only major cell types
13.8.3 3. Adjust min_dist for Readability
- 0.01-0.05: If points are too spread out
- 0.1-0.3: Default range (recommended)
- 0.5+: If clusters are too tight
13.8.4 4. Use Multiple Runs
- UMAP has some randomness
- Run multiple times to ensure results are consistent
- Use
random_state
parameter for reproducible results
13.8.5 5. Validate with Biology
- Always check if UMAP results make biological sense
- Compare with known cell type markers
- Look for expected developmental trajectories
13.9 Summary
UMAP is like a smart cartographer who: 1. Studies your high-dimensional data (20,000+ genes/proteins) 2. Identifies both local neighborhoods AND global relationships 3. Creates a beautiful 2D map that preserves both types of structure 4. Reveals hidden patterns you couldn’t see before
The key insight: UMAP preserves both local and global structure - cells that are similar stay close together, while the distances between different cell types remain meaningful.
This makes UMAP perfect for biologists who want to understand the complete structure of their complex, high-dimensional data - from individual cell relationships to overall tissue organization!
Remember: UMAP is a tool for exploration and visualization, not a replacement for careful analysis!
13.10 Hands-on with UMAP
The way to use UMAP is similar to how we did tSNE.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import umap
import matplotlib.pyplot as plt
import seaborn as sns
set(style="whitegrid") # nice simple plots
sns.
= datasets.load_iris()
iris = iris.data # 150 samples, 4 features (sepal/petal lengths/widths)
X_iris = iris.target # species labels (0,1,2)
y_iris = iris.target_names
labels_iris
# Standardize features
= StandardScaler()
scaler = scaler.fit_transform(X_iris)
X_iris_scaled
# Run UMAP
= umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
umap_model = umap_model.fit_transform(X_iris_scaled)
embedding_iris
# Put into DataFrame for plotting
= pd.DataFrame({
df_iris "UMAP1": embedding_iris[:, 0],
"UMAP2": embedding_iris[:, 1],
"species": [labels_iris[i] for i in y_iris]
})
plt.figure()=df_iris, x="UMAP1", y="UMAP2", hue="species", s=60)
sns.scatterplot(data"UMAP on Iris data")
plt.title(="best")
plt.legend(loc plt.show()
Quick tips:
n_neighbors
(default ~15): how many neighbors UMAP uses to learn local structure.Smaller => captures very local structure (more fragmentation), larger => more global structure.
min_dist
(default ~0.1): how tightly points are packed in the low-dimensional space. Smaller => tighter clusters; larger => more spread out.Always standardize or log-transform expression data before UMAP (depending on data type).
Try different random_state values or parameters to see what changes.
Exercises for learners:
Change
n_neighbors
to 5 and then to 50 and observe how the plot changes.Change
min_dist
to 0.01 and 0.8 and observe clustering differences.Replace synthetic data with a small real gene-expression matrix and try the pipeline: counts -> log1p -> StandardScaler -> UMAP.
13.11 Summary
- A brief introduction to UMAP