8  Conceptual and mathematical basis of PCA

TipLearning Objectives
  • Learn the concepts and mathematical basics behind PCA

8.1 Intuitive explanation of PCA

Explanation of PCA (by StatQuest)

8.2 Differences between PCA and linear regression

Does the figure above look similar to linear regression? Is PCA the same as linear regression?

Tip

NOTE (IMPORTANT CONCEPT): PCA is not linear regression. It looks similar though, does it not?

Linear regression is a predictive model. PCA is not. You cannot use PCA to predict anything. You can use PCA to only pick out patterns in your data.

8.3 📊 Key Concepts

8.3.1 1. Scores and Loadings

What is being plotted on the axes (PC1 and PC2) are the scores.

The scores for each principal component are calculated as follows:

\[ PC_{1} = \alpha X + \beta Y + \gamma Z + .... \]

where \(X\), \(Y\) and \(Z\) are the normalized features.

The constants \(\alpha\), \(\beta\), \(\gamma\) are determined by the PCA algorithm. These are called the loadings.

8.3.2 2. Linear combinations

Tip

NOTE (IMPORTANT CONCEPT): The principal components are linear combinations of the original features. Hence they can be a bit difficult to interpret.

8.3.3 3. Variance

  • Variance = how spread out the data is.
  • PCA finds directions (principal components) that maximize variance.

8.4 🔬 Example: Gene Expression Data

  • Rows = samples (patients)
  • Columns = gene expression levels

8.4.1 Goal:

  • Reduce dimensionality from 20,000 genes to 2 to 3 PCs
  • Visualize patterns between patient groups (e.g., healthy vs. cancer)
# Sample Python code (requires numpy, sklearn, matplotlib)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X = ...  # gene expression matrix
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Gene Expression')
plt.show()

8.5 Summary

TipKey Points
  • PCA is not linear regression!
  • scores and loadings