5  Normalizing your data and PCA

6 Introduction

This chapter demonstrates basic unsupervised machine learning concepts using Python.

Learning Objectives
  • Refresher on Python
  • Understand the difference between supervised and unsupervised learning.
  • Apply PCA and clustering to example data.
  • Visualize results.

6.1 Refresher on Python

# ============================================================================
# 1. IMPORTING PACKAGES
# ============================================================================

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# ============================================================================
# 2. READING DATA WITH PANDAS FROM GITHUB
# ============================================================================

# GitHub URL for the diabetes data
# Convert from GitHub web URL to raw data URL
github_url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/main/course_files/data/diabetes_sample_data.csv"

# Read CSV file directly from GitHub
diabetes_data = pd.read_csv(github_url)
    
# Display basic information about the data
print("\nData shape:", diabetes_data.shape)
print("\nFirst 5 rows:")
print(diabetes_data.head())
        
print("\nBasic statistics:")
print(diabetes_data.describe())

# ============================================================================
# 3. PLOTTING WITH MATPLOTLIB
# ============================================================================

# Plot 1: Histogram of Age
plt.figure(figsize=(10, 6))
plt.hist(diabetes_data['age'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
plt.title('Distribution of Age', fontsize=14, fontweight='bold')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
#plt.savefig('age_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

Data shape: (100, 6)

First 5 rows:
   patient_id   age  glucose   bmi  blood_pressure  diabetes
0           1  62.5     97.5  29.8            71.7         0
1           2  52.9    127.4  30.8            74.4         0
2           3  64.7    129.7  33.4            87.5         0
3           4  77.8    115.9  33.3            86.1         1
4           5  51.5    135.2  21.1            79.8         1

Basic statistics:
       patient_id         age     glucose         bmi  blood_pressure  \
count  100.000000  100.000000  100.000000  100.000000      100.000000   
mean    50.500000   53.444000  140.670000   28.322000       81.066000   
std     29.011492   13.625024   28.611669    5.425223        8.842531   
min      1.000000   15.700000   82.400000   11.800000       58.800000   
25%     25.750000   46.000000  115.800000   24.700000       74.350000   
50%     50.500000   53.100000  142.550000   28.500000       80.500000   
75%     75.250000   61.075000  156.175000   31.500000       86.825000   
max    100.000000   82.800000  221.600000   47.300000      101.900000   

         diabetes  
count  100.000000  
mean     0.250000  
std      0.435194  
min      0.000000  
25%      0.000000  
50%      0.000000  
75%      0.250000  
max      1.000000  

6.2 Normalization (Z-score Standardization)

Normalization, specifically Z-score standardization, is a data scaling technique that transforms your data to have a mean of 0 and a standard deviation of 1. This is useful for many machine learning algorithms that are sensitive to the scale of input features.

The formula for Z-score is:

\[ z = \frac{x - \mu}{\sigma} \]

Where: - \(x\) is the original data point. - \(\mu\) is the mean of the data. - \(\sigma\) is the standard deviation of the data.

For example, say you have two variables or features on very different scales.

Age Weight (grams)
25 65000
30 70000
35 75000
40 80000
45 85000
50 90000
55 95000
60 100000
65 105000
70 110000
75 115000
80 120000

If these are not brought on similar scales, weight will have a dispproportionate influence on whatever machine learning model we build.

Hence we normalize each of the features separately, i.e. age is normalized relative to age and weight is normalized relative to weight.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# 1. Generate age and weight data
np.random.seed(42)
age = np.random.normal(45, 15, 100)  # 100 people, mean age 45, std 15
age = np.clip(age, 18, 80)  # Keep ages between 18-80

weight = 70 + (age - 45) * 0.3 + np.random.normal(0, 10, 100)  # Weight correlated with age
weight = np.clip(weight, 45, 120)  # Keep weights between 45-120 kg

print("Original data:")
print(f"Age: mean={age.mean():.1f}, std={age.std():.1f}")
print(f"Weight: mean={weight.mean():.1f}, std={weight.std():.1f}")

# 2. Normalize the data
scaler = StandardScaler()
data = np.column_stack((age, weight))
normalized_data = scaler.fit_transform(data)

age_normalized = normalized_data[:, 0]
weight_normalized = normalized_data[:, 1]

# Histogram: Age (Original)
plt.figure()
plt.hist(age, bins=20, alpha=0.7)
plt.title('Age Distribution (Original)')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

# Histogram: Age (Normalized)
plt.figure()
plt.hist(age_normalized, bins=20, alpha=0.7)
plt.title('Age Distribution (Normalized)')
plt.xlabel('Age (Z-score)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.7)

plt.tight_layout()
plt.show()
Original data:
Age: mean=43.6, std=13.1
Weight: mean=69.8, std=9.8

6.3 Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

6.4 Example Data

6.5 PCA Example

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title("PCA Projection")
plt.show()

A simple PCA plot

6.6 Scree plot

A scree plot is a simple graph that shows how much variance (information) each principal component explains in your data after running PCA. The x-axis shows the principal components (PC1, PC2, etc.), and the y-axis shows the proportion of variance explained by each one.

You can use a scree plot to decide how many principal components to keep: look for the point where the plot levels off (the elbow): this tells you that adding more components doesn’t explain much more variance.

# Scree plot: variance explained by each component
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained Ratio")
plt.show()

A scree plot may have an elbow like the plot below.

6.7 Clustering Example

PCA is different to clustering where you are trying to find patterns in your data. We will encounter clustering later in the course.

6.8 🧠 PCA vs. Other Techniques

  • PCA is unsupervised (no labels used)

  • Works best for linear relationships

  • Alternatives:

    • t-SNE for nonlinear structures

6.9 🧬 In Practice: Tips for Biologists

  • Always standardize data before PCA
  • Be cautious interpreting PCs biologically—PCs are mathematical constructs

6.9.1 Goals of unsupervised learning

  • Finding patterns in data

Here is an example from biological data (single-cell sequencing data) (the plot is from [2])(Aschenbrenner et al. 2020).

Example tSNE

Example heatmaps
  • Finding interesting patterns

You can also use dimensionality reduction techniques (such as PCA) to find interesting patterns in your data.

  • Finding outliers

You can also use dimensionality reduction techniques (such as PCA) to find outliers in your data.

  • Finding hypotheses

All of these can be used to generate hypotheses. These hypotheses can be tested by collecting more data.

6.9.2 Exercise

  • Perform PCA on a dataset of US Arrests

Load data

!pip install pca
from pca import pca
import pandas as pd

# Load the US Arrests data
# Read the USArrests data directly from the GitHub raw URL
url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/main/course_files/data/USArrests.csv"
df = pd.read_csv(url, index_col=0)

print("US Arrests Data (first 5 rows):")
print(df.head())
print("\nData shape:", df.shape)
US Arrests Data (first 5 rows):
            Murder  Assault  UrbanPop  Rape
State                                      
Alabama       13.2      236        58  21.2
Alaska        10.0      263        48  44.5
Arizona        8.1      294        80  31.0
Arkansas       8.8      190        50  19.5
California     9.0      276        91  40.6

Data shape: (48, 4)

Perform PCA

model = pca(n_components=4)
out = model.fit_transform(df)
ax = model.biplot(n_feat=len(df.columns), legend=False)
[05-08-2025 15:56:12] [pca.pca] [INFO] Extracting column labels from dataframe.
[05-08-2025 15:56:12] [pca.pca] [INFO] Extracting row labels from dataframe.
[05-08-2025 15:56:12] [pca.pca] [INFO] The PCA reduction is performed on the 4 columns of the input dataframe.
[05-08-2025 15:56:12] [pca.pca] [INFO] Fit using PCA.
[05-08-2025 15:56:12] [pca.pca] [INFO] Compute loadings and PCs.
[05-08-2025 15:56:12] [pca.pca] [INFO] Compute explained variance.
[05-08-2025 15:56:12] [pca.pca] [INFO] Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[4]
[05-08-2025 15:56:12] [pca.pca] [INFO] Multiple test correction applied for Hotelling T2 test: [fdr_bh]
[05-08-2025 15:56:12] [pca.pca] [INFO] Outlier detection using SPE/DmodX with n_std=[3]
[05-08-2025 15:56:12] [pca.pca] [INFO] Plot PC1 vs PC2 with loadings.
[05-08-2025 15:56:12] [scatterd.scatterd] [INFO] Create scatterplot

  • Variance explained plots
model.plot()
(<Figure size 1440x960 with 1 Axes>,
 <Axes: title={'center': 'Cumulative explained variance\n 4 Principal Components explain [100.0%] of the variance.'}, xlabel='Principal Component', ylabel='Percentage explained variance'>)

  • 3D PCA biplots
model.biplot3d()
[05-08-2025 15:56:13] [pca.pca] [INFO] Plot PC1 vs PC2 vs PC3 with loadings.
[05-08-2025 15:56:13] [scatterd.scatterd] [INFO] Create scatterplot
(<Figure size 3000x2500 with 1 Axes>,
 <Axes3D: title={'center': '4 Principal Components explain [100.0%] of the variance'}, xlabel='PC1 (96.4% expl.var)', ylabel='PC2 (2.88% expl.var)', zlabel='PC3 (0.59% expl.var)'>)

  • Loadings

Recall

What is being plotted on the axes (PC1 and PC2) are the scores.

The scores for each principal component are calculated as follows:

\[ PC_{1} = \alpha X + \beta Y + \gamma Z + .... \]

where \(X\), \(Y\) and \(Z\) are the normalized features.

The constants \(\alpha\), \(\beta\), \(\gamma\) are determined by the PCA algorithm. They are called the loadings.

print(model.results)
{'loadings':        Murder   Assault  UrbanPop      Rape
PC1  0.041584  0.995187  0.048369  0.074397
PC2 -0.045411 -0.060516  0.976922  0.199748
PC3  0.079078 -0.066443 -0.199673  0.974404
PC4  0.994965 -0.039074  0.058436 -0.071436, 'PC':                        PC1        PC2        PC3       PC4
Alabama          62.104695 -11.569854  -2.571036  2.390729
Alaska           90.091422 -18.173579  20.082295 -4.096991
Arizona         121.406646   8.601610  -1.671659 -4.364402
Arkansas         15.629716 -16.741244   0.078286 -0.535736
California      104.776982  22.313750   6.753359 -2.808592
Colorado         32.307655  13.641366  12.194635 -1.713629
Connecticut     -63.532892  13.048830  -8.617371 -0.704265
Delaware         64.066928   1.238885 -11.338389 -3.746809
Florida         162.579815   5.968716  -2.941584  1.232499
Georgia          37.838646  -7.374991   3.505073  7.334706
Hawaii         -126.174440  24.510283   3.462165  3.486970
Idaho           -54.491992  -9.374532  -1.724029 -3.356955
Illinois         76.343221  12.752929  -5.919698  0.357727
Indiana         -60.229054   2.944616   3.534375  1.650429
Iowa           -118.271210  -3.131830  -0.928109 -0.871706
Kansas          -58.463403   3.255756   0.183712  0.651066
Kentucky        -65.084303 -10.565646   2.013888  3.870227
Louisiana        75.594954  -4.441346  -3.883800  4.467733
Maine           -91.955934 -11.321876  -4.942350 -2.126798
Maryland        126.643967  -5.245981  -2.339614 -1.946027
Massachusetts   -23.901058  19.492807  -7.652137 -1.037690
Michigan         82.775518   5.737532   6.429014  0.495857
Minnesota      -101.624283   5.388593  -0.240855 -0.730665
Mississippi      84.132386 -27.589292  -5.069534  3.852208
Missouri          5.310427   5.252112   5.375275  0.679364
Montana         -65.182355  -9.400729   1.619068  0.240149
Nebraska        -71.776593  -0.087644   0.250121 -0.658994
Nevada           80.943626  14.930242  15.859544  0.342969
New Hampshire  -117.462465  -4.524273  -2.556713 -0.940127
New Jersey      -13.444972  23.158469  -6.442013  1.611612
New Mexico      112.185340  -0.553098   2.255854 -1.392284
New York         81.649603  15.768795  -4.749328  0.884121
North Carolina  161.602001 -31.391611 -11.671292 -2.150116
North Dakota   -130.202864 -15.901552  -1.609817 -2.308756
Ohio            -52.745141  12.365579   1.470217  2.032185
Oklahoma        -22.366204   3.403263  -0.611321 -0.184635
Oregon          -13.831882   3.877062   7.984332 -2.911465
Pennsylvania    -67.348024   9.029093  -3.413268  1.873292
Rhode Island      0.438586  18.381176 -17.586862 -2.321153
South Carolina  104.560644 -23.736093  -2.069733  1.227265
South Dakota    -88.817911 -16.443418   1.062809 -1.260376
Tennessee        14.808170  -6.549591   5.972649  3.917549
Texas            28.636397  12.922119  -0.487939  4.239257
Utah            -52.562195  17.735996   1.609241 -1.862148
Vermont        -127.449367 -27.090724   4.497811 -2.012857
Virginia        -17.501029  -1.730386   0.887159  1.168243
Washington      -27.742336  10.007474   4.624674 -2.687825
West Virginia   -94.265438 -22.787766  -0.667106  0.724843, 'explained_var': array([0.9643326 , 0.99313748, 0.99911593, 1.        ]), 'variance_ratio': array([9.64332599e-01, 2.88048813e-02, 5.97845474e-03, 8.84065325e-04]), 'model': PCA(n_components=4), 'scaler': None, 'pcp': np.float64(1.0000000000000002), 'topfeat':     PC   feature   loading  type
0  PC1   Assault  0.995187  best
1  PC2  UrbanPop  0.976922  best
2  PC3      Rape  0.974404  best
3  PC4    Murder  0.994965  best, 'outliers':                  y_proba     p_raw    y_score  y_bool  y_bool_spe  y_score_spe
Alabama         0.999620  0.799234   4.601111   False       False    63.173211
Alaska          0.999620  0.365520   8.730713   False       False    91.906165
Arizona         0.999620  0.167967  11.640753   False       False   121.710975
Arkansas        0.999620  0.993595   1.444244   False       False    22.903216
California      0.999620  0.257562  10.107498   False       False   107.126651
Colorado        0.999620  0.946624   2.793800   False       False    35.069523
Connecticut     0.999620  0.758702   4.989460   False       False    64.859081
Delaware        0.999620  0.781579   4.772620   False       False    64.078905
Florida         0.420843  0.017535  18.538069   False       False   162.689341
Georgia         0.999620  0.950428   2.724717   False       False    38.550663
Hawaii          0.999620  0.104870  13.208618   False       False   128.533043
Idaho           0.999620  0.871111   3.841755   False       False    55.292486
Illinois        0.999620  0.635299   6.106537   False       False    77.401064
Indiana         0.999620  0.850834   4.069115   False       False    60.300993
Iowa            0.999620  0.214721  10.775933   False       False   118.312668
Kansas          0.999620  0.878457   3.755791   False       False    58.553988
Kentucky        0.999620  0.771674   4.867169   False       False    65.936328
Louisiana       0.999620  0.676410   5.739275   False       False    75.725310
Maine           0.999620  0.451018   7.822307   False       False    92.650303
Maryland        0.999620  0.141316  12.228021   False       False   126.752572
Massachusetts   0.999620  0.968244   2.352761   False       False    30.842018
Michigan        0.999620  0.589923   6.513389   False       False    82.974126
Minnesota       0.999620  0.383510   8.529668   False       False   101.767047
Mississippi     0.999620  0.448614   7.846422   False       False    88.540541
Missouri        0.999620  0.999620   0.659862   False       False     7.468957
Montana         0.999620  0.792670   4.665365   False       False    65.856762
Nebraska        0.999620  0.769963   4.883398   False       False    71.776646
Nevada          0.999620  0.519438   7.160325   False       False    82.309068
New Hampshire   0.999620  0.214470  10.780144   False       False   117.549563
New Jersey      0.999620  0.981723   1.976515   False       False    26.778386
New Mexico      0.999620  0.274248   9.870286   False       False   112.186703
New York        0.999620  0.559034   6.793809   False       False    83.158359
North Carolina  0.420843  0.009995  20.091620   False       False   164.622720
North Dakota    0.999620  0.103125  13.262693   False       False   131.170291
Ohio            0.999620  0.877098   3.771862   False       False    54.175248
Oklahoma        0.999620  0.996651   1.199054   False       False    22.623645
Oregon          0.999620  0.996784   1.185440   False       False    14.364977
Pennsylvania    0.999620  0.758811   4.988439   False       False    67.950576
Rhode Island    0.999620  0.988643   1.710246   False       False    18.386407
South Carolina  0.999620  0.271867   9.903461   False       False   107.220942
South Dakota    0.999620  0.482905   7.508345   False       False    90.327224
Tennessee       0.999620  0.995571   1.298396   False       False    16.191944
Texas           0.999620  0.976601   2.134707   False       False    31.416944
Utah            0.999620  0.855304   4.020116   False       False    55.473866
Vermont         0.999620  0.094545  13.540901   False       False   130.296771
Virginia        0.999620  0.998694   0.922137   False       False    17.586366
Washington      0.999620  0.979528   2.047287   False       False    29.492147
West Virginia   0.999620  0.389782   8.460986   False       False    96.980694, 'outliers_params': {'paramT2': (np.float64(5.0330110449673766e-15), np.float64(1777.6058289930552)), 'paramSPE': (array([1.33226763e-14, 5.62512999e-15]), array([[ 7.00270263e+03, -8.77755141e-13],
       [-8.77755141e-13,  2.09172664e+02]]))}}

6.10 Exercise (advanced)

Tip

Look into the documentation available here for the PCA package and plot prettier publication ready plots.

6.11 Exercise (theoretical)

Tip

Break up into groups and discuss the following problem:

  1. Shown are biological samples with scores

  2. The features are genes

  • Why are Sample 33 and Sample 24 separated from the rest? What can we say about Gene1, Gene 2, Gene 3 and Gene 4?

  • Why is Sample 2 separated from the rest? What can we say about Gene1, Gene 2, Gene 3 and Gene 4?

  • Can we treat Sample 2 as an outlier? Why or why not? Argue your case.

The PCA biplot is shown below:

The table of loadings is shown below:

            PC1       PC2       PC3       PC4
Gene1 -0.535899  0.418181 -0.341233  0.649228
Gene2 -0.583184  0.187986 -0.268148 -0.743075
Gene3 -0.278191 -0.872806 -0.378016  0.133877
Gene4 -0.543432 -0.167319  0.817778  0.089024
Summary
  • Need to normalize data before doing dimensionality reduction
  • PCA reduces dimensionality for visualization.
  • KMeans finds clusters in unlabeled data.

6.12 Resources

[1] Article on normalization on Wikipedia

[2] Deconvolution of monocyte responses in inflammatory bowel disease reveals an IL-1 cytokine network that regulates IL-23 in genetic and acquired IL-10 resistance Gut, 2020 link

[3] ISLP book

[4] https://www.statlearning.com/

[5] Video lectures by the authors of the book Introduction to Statistical Learning in Python

[6] https://mlu-explain.github.io