7 tSNE
- Bulleted list of learning objectives
- Why PCA does not work sometimes
- The intution around the curse of dimensionality
- What is tSNE?
- How to use tSNE
7.1 The curse of dimensionality
In very high-dimensional spaces, almost all the “volume” of a dataset lives near its corners, and pairwise Euclidean distances between points tend to concentrate around a single value.
7.2 Simplified explanations
Think of each cell as a point in a space where each gene’s activity is its own “axis.” When you have only a few genes (low dimensions), you can tell cells apart by how far apart they sit in that space. But as you add more genes, almost every cell ends up about the same distance from every other cell—so you lose any useful sense of “close” or “far.”
Why k-Means fails:
k-Means tries to draw boundaries around groups by asking “Which centroid (group center) is each cell closest to?” In very high–gene spaces, every cell is nearly the same distance from all centroids. Small moves of the centroids don’t change which cells get assigned to them, so k-Means can’t find real groupings.
Why t-SNE helps:
t-SNE ignores the idea of absolute distance and instead asks, “Which cells are each cell’s few nearest neighbors?” It builds a map that keeps those local neighborhoods intact. In the final 2D picture, cells that were neighbors in the huge gene space stay neighbors on the screen, while cells that weren’t neighbors get pushed apart. This way, you still see meaningful clusters (e.g., cell types) even when dealing with hundreds or thousands of genes.
7.3 TLDR (Simple explanation)
t-SNE (pronounced “tee-snee”) is a tool that helps us look at complex data by making it easier to see patterns.
7.3.1 Imagine this:
- You have a big box of mixed beads. Each bead has many features: color, size, shape, weight, etc.
- It is hard to see how the beads are similar or different just by looking at all these features at once.
7.3.2 What t-SNE does:
- t-SNE takes all those features and creates a simple map (like a 2D picture).
- In this map, beads that are similar to each other are placed close together.
- Beads that are very different are placed far apart.
7.4 Pictorial explanation of tSNE
High-dimensional beads (hard to see groups):
[🔴] [🔵] [🟢] [🟡] [🔴] [🟢] [🔵] [🟡] [🔴] [🟢] [🔵] [🟡]
Each bead has many features (color, size, shape, etc.)
|
v
t-SNE makes a simple 2D map:
[🔴] [🔴] [🔴] | | [🔵] [🔵] [🔵]
[🟢] [🟢] [🟢]
[🟡] [🟡] [🟡]
Now, similar beads are grouped together.
In summary:
t-SNE is like a magic tool that turns complicated data into a simple picture, so we can easily see groups and patterns—even if we do not understand the math behind it!
7.4.1 Why is this useful?
- It helps us see groups or clusters in our data.
- We can spot patterns, like which beads are most alike, or if there are outliers.
- Emphasis on preserving local structure.
7.5 Why t‑SNE Works in High Dimensions
- Bypasses global distance concentration by focusing on nearest neighbors.
7.6 What does tSNE look like compared to PCA?
In very high-dimensional spaces, almost all the “volume” of a dataset lives near its corners, and pairwise Euclidean distances between points tend to concentrate around a single value. As dimension \(n\) grows, the volume of an inscribed ball in the hypercube \([-1,1]^n\) shrinks toward zero, and the ratio
\[ \frac{\max d - \min d}{\min d} \]
for distances \(d\) between random points rapidly approaches zero. Intuitively, “nearest” and “farthest” neighbors become indistinguishable, so any method that relies on global distances (like k-Means) loses its ability to meaningfully separate points into clusters.
k-Means clustering exemplifies this breakdown: it repeatedly assigns each point to its nearest centroid based on squared-distance comparisons. When all inter-point distances look almost the same, tiny shifts in centroid positions barely affect those assignments, leading to noisy labels and flat optimization landscapes with no clear gradients. In practice, k-Means can “get stuck” or fail to discover any meaningful grouping once dimensions rise into the dozens or hundreds.
t-SNE sidesteps these problems by focusing only on local similarities rather than global distances. It first converts pairwise distances in the high-dimensional space into a distribution of affinities \(p_{ij}\) using Gaussian kernels centered on each point. Then it searches for a low-dimensional embedding whose Student-t affinity distribution \(q_{ij}\) best matches \(p_{ij}\). By emphasizing the preservation of each point’s nearest neighbors and using a heavy-tailed low-dimensional kernel to push dissimilar points apart, t-SNE highlights local clusters even when global geometry has become uninformative—making it a far more effective visualization and exploratory tool in very high dimensions.
7.7 Simple code to perform tSNE (hands-on exercise)
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
# Load the Iris dataset
= datasets.load_iris()
iris = iris.data # The features (measurements)
X = iris.target # The species labels (0, 1, 2)
y
# Run t-SNE to reduce the data to 2 dimensions
= TSNE(n_components=2, random_state=0, perplexity=30)
tsne = tsne.fit_transform(X)
X_2d
# Plot the results, one species at a time
=(8, 6))
plt.figure(figsize
# Setosa (label 0)
== 0, 0], X_2d[y == 0, 1], color='red', label='setosa')
plt.scatter(X_2d[y
# Versicolor (label 1)
== 1, 0], X_2d[y == 1, 1], color='green', label='versicolor')
plt.scatter(X_2d[y
# Virginica (label 2)
== 2, 0], X_2d[y == 2, 1], color='blue', label='virginica')
plt.scatter(X_2d[y
"t-SNE feature 1")
plt.xlabel("t-SNE feature 2")
plt.ylabel("t-SNE visualization of the Iris dataset")
plt.title(
plt.legend() plt.show()
7.8 Exercise: tSNE is stochastic
Stochasticity: play around with the
random_state
parameter. Does your tSNE plot look different to the person you are seated next to?Play around with the
perplexity
parameter (pair up with someone)Which value of
perplexity
should you use?
7.9 Exercise: Building intuition on how to use tSNE
Let us read the paper How to use t-SNE effectively (Wattenberg, Viégas, and Johnson 2017).
Distances are not preserved
Normal does not always look normal
7.10 Key Concept
- Recall that unsupervised machine learning can help you come up with new hypotheses
- Vary the perplexity parameter: ideally your patterns or hypotheses should be true even if you change perplexity
7.11 Exercise: hands-on practical applying tSNE to another dataset
Load the US Arrests data in Python and perform tSNE on this (pair up with a person)
How would you evaluate this?
Vary the
perplexity
parameterSome code to help you get started is here:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# Load the US Arrests data
= "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/main/course_files/data/USArrests.csv"
url = pd.read_csv(url, index_col=0)
df
# Prepare the data for t-SNE
= df.values # Convert to numpy array
X
# Fill in your code here ..........
7.12 Summary
- High dimensions make global distances meaningless.
- Methods that leverage local structure (t‑SNE) can still find patterns.
- tSNE is stochastic and can be hard to interpret
- Vary the perplexity parameter (ideally your patterns or hypotheses should be true even if you change perplexity)