6  Normalizing your data and PCA

7 Introduction

This chapter demonstrates basic unsupervised machine learning concepts using Python.

Learning Objectives
  • Understand the difference between supervised and unsupervised learning.
  • Apply PCA and clustering to example data.
  • Visualize results.

7.1 Normalization (Z-score Standardization)

Normalization, specifically Z-score standardization, is a data scaling technique that transforms your data to have a mean of 0 and a standard deviation of 1. This is useful for many machine learning algorithms that are sensitive to the scale of input features.

The formula for Z-score is:

\[ z = \frac{x - \mu}{\sigma} \]

Where: - \(x\) is the original data point. - \(\mu\) is the mean of the data. - \(\sigma\) is the standard deviation of the data.

For example, say you have two variables or features on very different scales.

Age Weight (grams)
25 65000
30 70000
35 75000
40 80000
45 85000
50 90000
55 95000
60 100000
65 105000
70 110000
75 115000
80 120000

If these are not brought on similar scales, weight will have a dispproportionate influence on whatever machine learning model we build.

Hence we normalize each of the features separately, i.e. age is normalized relative to age and weight is normalized relative to weight.

Original data:
Age: mean=43.6, std=13.1
Weight: mean=69.8, std=9.8

  • In an ideal scenario a feature/variable such as weight might be transformed in the following way after normalization:

  • And here is what it might look like for a feature such as age.
Z-scored mean: -0.00, std: 1.00

Tip

NOTE (IMPORTANT CONCEPT):

  • After normalization, the normalized features are on comparable scales. The features (such as weight and age) no longer have so much variation. They can be used as input to machine learning algorithms.

  • The rule of thumb is to (almost) always normalize your data before you use it in a machine learning algorithm. (There are a few exceptions and we will point this out in due course).

7.1.1 Data visualization before doing PCA

Exercise 1 - exercise_data_visualization

Level:

Discuss in a group. What is wrong with the following plot?

Looking at your data

Always look at your data before you try and machine learning technique on it. There is a 150 year old person in your data!

Tip

NOTE (IMPORTANT CONCEPT):

  • Visualize your data before you do any normalization. If there is anything odd about your data, discuss this with the person who gave you the data or did the experiment. This could be an error in the machine that generated the data or a data entry error. If there is justification, you can remove the data point.

  • Then perform normalization and apply a machine learning technique.

7.2 Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

7.3 Example Data

7.4 PCA Example

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title("PCA Projection")
plt.show()

A simple PCA plot

7.5 Scree plot

A scree plot is a simple graph that shows how much variance (information) each principal component explains in your data after running PCA. The x-axis shows the principal components (PC1, PC2, etc.), and the y-axis shows the proportion of variance explained by each one.

You can use a scree plot to decide how many principal components to keep: look for the point where the plot levels off (the elbow): this tells you that adding more components doesn’t explain much more variance.

# Scree plot: variance explained by each component
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained Ratio")
plt.show()

A scree plot may have an elbow like the plot below.

7.5.1 Hands-on coding

  • Perform PCA on a dataset of US Arrests

Load data and install the pca Python package

!pip install pca
from pca import pca
import pandas as pd

# Load the US Arrests data
# Read the USArrests data directly from the GitHub raw URL
url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/main/course_files/data/USArrests.csv"
df = pd.read_csv(url, index_col=0)

print("US Arrests Data (first 5 rows):")
print(df.head())
print("\nData shape:", df.shape)
US Arrests Data (first 5 rows):
            Murder  Assault  UrbanPop  Rape
State                                      
Alabama       13.2      236        58  21.2
Alaska        10.0      263        48  44.5
Arizona        8.1      294        80  31.0
Arkansas       8.8      190        50  19.5
California     9.0      276        91  40.6

Data shape: (48, 4)

Normalize the data

from sklearn.preprocessing import StandardScaler

scaler_standard = StandardScaler()
df_scaled = scaler_standard.fit_transform(df)

print("\nData shape after normalization:", df_scaled.shape)

Data shape after normalization: (48, 4)

Perform PCA

8 TODO: add labels

model = pca(n_components=4)
out = model.fit_transform(df_scaled)
ax = model.biplot(n_feat=len(df.columns), legend=False)

  • Variance explained plots
model.plot()
(<Figure size 1440x960 with 1 Axes>,
 <Axes: title={'center': 'Cumulative explained variance\n 4 Principal Components explain [100.0%] of the variance.'}, xlabel='Principal Component', ylabel='Percentage explained variance'>)

  • 3D PCA biplots
model.biplot3d()
(<Figure size 3000x2500 with 1 Axes>,
 <Axes3D: title={'center': '4 Principal Components explain [100.0%] of the variance'}, xlabel='PC1 (61.6% expl.var)', ylabel='PC2 (24.7% expl.var)', zlabel='PC3 (9.14% expl.var)'>)

  • Loadings

Recall

What is being plotted on the axes (PC1 and PC2) are the scores.

The scores for each principal component are calculated as follows:

\[ PC_{1} = \alpha X + \beta Y + \gamma Z + .... \]

where \(X\), \(Y\) and \(Z\) are the normalized features.

The constants \(\alpha\), \(\beta\), \(\gamma\) are determined by the PCA algorithm. They are called the loadings.

print(model.results)
{'loadings':             1         2         3         4
PC1  0.533785  0.583489  0.284213  0.542068
PC2 -0.428765 -0.190485  0.865950  0.173225
PC3 -0.331927 -0.267593 -0.386784  0.817690
PC4 -0.648891  0.742732 -0.140542 -0.086823, 'PC':          PC1       PC2       PC3       PC4
0   0.923886 -1.127792 -0.437720 -0.150321
1   1.884005 -1.032585  2.032973  0.451071
2   1.705462  0.730059  0.043498  0.833066
3  -0.198714 -1.092074  0.111217  0.187022
4   2.462479  1.513698  0.585558  0.340560
5   1.453427  0.982671  1.080932 -0.000710
6  -1.406810  1.081895 -0.661238  0.108387
7  -0.003621  0.319738 -0.730442  0.876779
8   2.947649 -0.070435 -0.569823  0.100756
9   1.571384 -1.281416 -0.326932 -1.066904
10 -0.966398  1.557165  0.034386 -0.910657
11 -1.689257 -0.178154  0.241665  0.495788
12  1.320695  0.653978 -0.681444  0.119479
13 -0.561650  0.161720  0.218372 -0.425644
14 -2.302281  0.133259  0.145716 -0.022891
15 -0.850716  0.279295  0.013602 -0.209501
16 -0.808869 -0.934920 -0.029023 -0.667159
17  1.500981 -0.882536 -0.772483 -0.449015
18 -2.444195 -0.340245 -0.083049  0.325837
19  1.702710 -0.431039 -0.158134  0.562825
20 -0.536401  1.454143 -0.626920  0.169570
21  2.044350  0.144860  0.383014 -0.098068
22 -1.742422  0.647555  0.133541 -0.073811
23  0.932617 -2.374555 -0.724196 -0.204393
24  0.637255  0.263934  0.369919 -0.224783
25 -1.239466 -0.507562  0.236769 -0.123520
26 -1.317489  0.212450  0.160150 -0.019531
27  2.806905  0.760007  1.157898 -0.309200
28 -2.431886  0.048021  0.018380  0.027526
29  0.127587  1.417883 -0.775421 -0.251489
30  1.917815 -0.148279  0.181459  0.343651
31  1.623118  0.790157 -0.646164  0.011216
32  1.064086 -2.207350 -0.854340  0.962604
33 -3.038797 -0.548177  0.281399  0.250127
34 -0.281823  0.736114 -0.041732 -0.477516
35 -0.366423  0.292555 -0.026415 -0.012846
36  0.003276  0.556212  0.921912  0.236698
37 -0.941353  0.568486 -0.411608 -0.364358
38 -0.909909  1.464948 -1.387731  0.600087
39  1.257310 -1.914756 -0.290121  0.141580
40 -2.038884 -0.778125  0.375435  0.109319
41  0.935690 -0.851392  0.192734 -0.645743
42  1.293269  0.387317 -0.490484 -0.642740
43 -0.602262  1.466342  0.271830  0.074469
44 -2.851337 -1.332665  0.825094  0.146559
45 -0.153441 -0.190521  0.005751 -0.210783
46 -0.270617  0.975724  0.604878  0.216519
47 -2.160933 -1.375609  0.097337 -0.129911, 'explained_var': array([0.61629429, 0.86387677, 0.95532444, 1.        ]), 'variance_ratio': array([0.61629429, 0.24758248, 0.09144767, 0.04467556]), 'model': PCA(n_components=4), 'scaler': None, 'pcp': np.float64(1.0000000000000002), 'topfeat':     PC feature   loading  type
0  PC1       2  0.583489  best
1  PC2       3  0.865950  best
2  PC3       4  0.817690  best
3  PC4       2  0.742732  best
4  PC4       1 -0.648891  weak, 'outliers':      y_proba     p_raw    y_score  y_bool  y_bool_spe  y_score_spe
0   0.975525  0.664294   5.847636   False       False     1.457903
1   0.708566  0.054815  15.230543   False       False     2.148419
2   0.975525  0.407776   8.267604   False       False     1.855152
3   0.998476  0.904339   3.432838   False       False     1.110005
4   0.708566  0.071188  14.431529   False       False     2.890516
5   0.975525  0.373658   8.639000   False       False     1.754449
6   0.975525  0.457677   7.755841   False       False     1.774715
7   0.998476  0.852925   4.046269   False       False     0.319759
8   0.791047  0.115361  12.899320   False       False     2.948491
9   0.975525  0.224129  10.620723   False       False     2.027628
10  0.975525  0.383534   8.529405   False       False     1.832672
11  0.975525  0.594216   6.474697   False       False     1.698626
12  0.975525  0.614125   6.295892   False       False     1.473744
13  0.998476  0.958706   2.563460   False       False     0.584469
14  0.975525  0.413845   8.203539   False       False     2.306135
15  0.998476  0.949615   2.739714   False       False     0.895390
16  0.998476  0.729799   5.256893   False       False     1.236262
17  0.975525  0.388804   8.471644   False       False     1.741210
18  0.975525  0.278536   9.811090   False       False     2.467763
19  0.975525  0.532463   7.038705   False       False     1.756421
20  0.975525  0.596351   6.455470   False       False     1.549922
21  0.975525  0.482918   7.508219   False       False     2.049476
22  0.975525  0.571498   6.680189   False       False     1.858861
23  0.895284  0.149214  12.044850   False       False     2.551134
24  0.998476  0.944449   2.832091   False       False     0.689750
25  0.998476  0.791489   4.676863   False       False     1.339364
26  0.998476  0.855485   4.018116   False       False     1.334509
27  0.708566  0.049146  15.558941   False       False     2.907976
28  0.975525  0.381779   8.548750   False       False     2.432360
29  0.975525  0.644948   6.020386   False       False     1.423612
30  0.975525  0.538344   6.984148   False       False     1.923539
31  0.975525  0.479709   7.539345   False       False     1.805231
32  0.708566  0.088571  13.748143   False       False     2.450444
33  0.708566  0.079105  14.103578   False       False     3.087845
34  0.998476  0.932461   3.029997   False       False     0.788218
35  0.998476  0.996014   1.259897   False       False     0.468886
36  0.998476  0.893044   3.578107   False       False     0.556221
37  0.998476  0.790790   4.683656   False       False     1.099691
38  0.975525  0.218048  10.720437   False       False     1.724530
39  0.975525  0.288707   9.673319   False       False     2.290659
40  0.975525  0.350924   8.898577   False       False     2.182321
41  0.975525  0.690997   5.608453   False       False     1.265063
42  0.975525  0.621719   6.227916   False       False     1.350022
43  0.975525  0.679982   5.707287   False       False     1.585206
44  0.708566  0.038174  16.308267   False       False     3.147398
45  0.998476  0.998476   0.962234   False       False     0.244627
46  0.998476  0.829952   4.291088   False       False     1.012557
47  0.975525  0.207258  10.902975   False       False     2.561626, 'outliers_params': {'paramT2': (np.float64(-4.625929269271485e-18), np.float64(0.9999999999999999)), 'paramSPE': (array([-9.25185854e-17, -1.38777878e-17]), array([[2.51762774e+00, 6.29946348e-17],
       [6.29946348e-17, 1.01140077e+00]]))}}

8.1 Exercise for normalization in PCA

Exercise 2 - exercise_pca_normalization

Level:

Work in a group.

  • Try the same code above but now without normalisation.

  • What differences do you observe in PCA with and without normalization?

8.2 Exercise (advanced)

Plot prettier publication ready plots for PCA.

Tip

Look into the documentation available here for the PCA package.

8.3 Exercise (theoretical)

Exercise 3 - exercise_theoretical

Level:

Break up into groups and discuss the following problem:

  1. Shown are biological samples with scores

  2. The features are genes

  • Why are Sample 33 and Sample 24 separated from the rest? What can we say about Gene1, Gene 2, Gene 3 and Gene 4?

  • Why is Sample 2 separated from the rest? What can we say about Gene1, Gene 2, Gene 3 and Gene 4?

  • Can we treat Sample 2 as an outlier? Why or why not? Argue your case.

The PCA biplot is shown below:

The table of loadings is shown below:

            PC1       PC2       PC3       PC4
Gene1 -0.535899  0.418181 -0.341233  0.649228
Gene2 -0.583184  0.187986 -0.268148 -0.743075
Gene3 -0.278191 -0.872806 -0.378016  0.133877
Gene4 -0.543432 -0.167319  0.817778  0.089024

8.4 Clustering Example

PCA is different to clustering where you are trying to find patterns in your data. We will encounter clustering later in the course.

8.5 🧠 PCA vs. Other Techniques

  • PCA is unsupervised (no labels used)

  • Works best for linear relationships

  • Alternatives:

    • t-SNE for nonlinear structures

8.6 🧬 In Practice: Tips for Biologists

  • Always standardize data before PCA
  • Be cautious interpreting PCs biologically—PCs are mathematical constructs

8.6.1 Goals of unsupervised learning

  • Finding patterns in data

Here is an example from biological data (single-cell sequencing data) (the plot is from [2])(Aschenbrenner et al. 2020).

Example tSNE

Example heatmaps
  • Finding interesting patterns

You can also use dimensionality reduction techniques (such as PCA) to find interesting patterns in your data.

  • Finding outliers

You can also use dimensionality reduction techniques (such as PCA) to find outliers in your data.

  • Finding hypotheses

All of these can be used to generate hypotheses. These hypotheses can be tested by collecting more data.

Summary
  • Need to normalize data before doing dimensionality reduction
  • PCA reduces dimensionality for visualization.
  • Clustering algorithms finds clusters in unlabeled data.
  • The goal of unsupervised learning is to find patterns and form hypotheses.

8.7 Resources

[1] Article on normalization on Wikipedia

[2] Deconvolution of monocyte responses in inflammatory bowel disease reveals an IL-1 cytokine network that regulates IL-23 in genetic and acquired IL-10 resistance Gut, 2020 link

[3] ISLP book

[4] Video lectures by the authors of the book Introduction to Statistical Learning in Python

[6] Visual explanations of machine learning algorithms