6  tSNE

Learning Objectives
  • Bulleted list of learning objectives
  • Why PCA does not work sometimes
  • The intution around the curse of dimensionality
  • What is tSNE?
  • How to use tSNE

6.1 The curse of dimensionality

In very high-dimensional spaces, almost all the “volume” of a dataset lives near its corners, and pairwise Euclidean distances between points tend to concentrate around a single value. As dimension (n) grows, the volume of an inscribed ball in the hypercube ([-1,1]^n) shrinks toward zero, and the ratio ((d - d)/d) for distances (d) between random points rapidly approaches zero. Intuitively, “nearest” and “farthest” neighbors become indistinguishable, so any method that relies on global distances (like k-Means) loses its ability to meaningfully separate points into clusters.

k-Means clustering exemplifies this breakdown: it repeatedly assigns each point to its nearest centroid based on squared-distance comparisons. When all inter-point distances look almost the same, tiny shifts in centroid positions barely affect those assignments, leading to noisy labels and flat optimization landscapes with no clear gradients. In practice, k-Means can “get stuck” or fail to discover any meaningful grouping once dimensions rise into the dozens or hundreds.

t-SNE sidesteps these problems by focusing only on local similarities rather than global distances. It first converts pairwise distances in the high-dimensional space into a distribution of affinities (p_{ij}) using Gaussian kernels centered on each point. Then it searches for a low-dimensional embedding whose Student-t affinity distribution (q_{ij}) best matches (p_{ij}). By emphasizing the preservation of each point’s nearest neighbors and using a heavy-tailed low-dimensional kernel to push dissimilar points apart, t-SNE highlights local clusters even when global geometry has become uninformative—making it a far more effective visualization and exploratory tool in very high dimensions.

6.2 Simplified explanations

Think of each cell as a point in a space where each gene’s activity is its own “axis.” When you have only a few genes (low dimensions), you can tell cells apart by how far apart they sit in that space. But as you add more genes, almost every cell ends up about the same distance from every other cell—so you lose any useful sense of “close” or “far.”

Why k-Means fails:
k-Means tries to draw boundaries around groups by asking “Which centroid (group center) is each cell closest to?” In very high–gene spaces, every cell is nearly the same distance from all centroids. Small moves of the centroids don’t change which cells get assigned to them, so k-Means can’t find real groupings.

Why t-SNE helps:
t-SNE ignores the idea of absolute distance and instead asks, “Which cells are each cell’s few nearest neighbors?” It builds a map that keeps those local neighborhoods intact. In the final 2D picture, cells that were neighbors in the huge gene space stay neighbors on the screen, while cells that weren’t neighbors get pushed apart. This way, you still see meaningful clusters (e.g., cell types) even when dealing with hundreds or thousands of genes.

6.3 Simple code to perform tSNE

6.4 Exercise: Building intuition on how to use tSNE

6.5 Summary

Key Points
  • Last section of the page is a bulleted summary of the key points

6.6 References

[1] How to Use t-SNE Effectively