5 Refresher on Python

Learning Objectives

Refresher on Python

5.1 Refresher on Python

See Python setup instructions here: Python Installation.
Walkthrough of getting setup with Google Colab in the web browser.
Install Python packages

!pip install pandas numpy scikit-learn seaborn matplotlib scanpy pca

Loading data and data visualization

Note: Here is an alternative way to read a file

import pandas as pd
import os
# find out which directory is your current working directory
os.getcwd()
# now change directory to where your files are (my files are in the directory shown below)
os.chdir("/Users/soumyabanerjee/soumya_cam_mac/teaching/ml-unsupervised/") 
# now read the file
diabates_data = pd.read_csv("course_files/data/diabetes_sample_data.csv")

# 1. IMPORTING PACKAGES

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

# 2. READING DATA WITH PANDAS FROM GITHUB

# GitHub URL for the diabetes data
# Convert from GitHub web URL to raw data URL
github_url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/main/course_files/data/diabetes_sample_data.csv"

# Read CSV file directly from GitHub
diabetes_data = pd.read_csv(github_url)

# Display basic information about the data
print("\nData shape:", diabetes_data.shape)
print("\nFirst 5 rows:")
print(diabetes_data.head())
        
print("\nBasic statistics:")
print(diabetes_data.describe())

# 3. PLOTTING WITH MATPLOTLIB

# Plot 1: Histogram of Age
plt.figure()
plt.hist(diabetes_data['age'], bins=20, alpha=0.7)
plt.title('Distribution of Age', fontsize=14, fontweight='bold')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.savefig('age_distribution.png', dpi=300)
plt.show()

# 4. NUMPY
a = np.array([17, 13, 78, 901])
print("This is a numpy array:")
print(a)
print("Here is the mean/average:")
print( np.mean(a) )


Data shape: (100, 6)

First 5 rows:
   patient_id   age  glucose   bmi  blood_pressure  diabetes
0           1  62.5     97.5  29.8            71.7         0
1           2  52.9    127.4  30.8            74.4         0
2           3  64.7    129.7  33.4            87.5         0
3           4  77.8    115.9  33.3            86.1         1
4           5  51.5    135.2  21.1            79.8         1

Basic statistics:
       patient_id         age     glucose         bmi  blood_pressure  \
count  100.000000  100.000000  100.000000  100.000000      100.000000   
mean    50.500000   53.444000  140.670000   28.322000       81.066000   
std     29.011492   13.625024   28.611669    5.425223        8.842531   
min      1.000000   15.700000   82.400000   11.800000       58.800000   
25%     25.750000   46.000000  115.800000   24.700000       74.350000   
50%     50.500000   53.100000  142.550000   28.500000       80.500000   
75%     75.250000   61.075000  156.175000   31.500000       86.825000   
max    100.000000   82.800000  221.600000   47.300000      101.900000   

         diabetes  
count  100.000000  
mean     0.250000  
std      0.435194  
min      0.000000  
25%      0.000000  
50%      0.000000  
75%      0.250000  
max      1.000000

This is a numpy array:
[ 17  13  78 901]
Here is the mean/average:
252.25

You can also go through this Introduction to Visualization in Python course

5.1.1 Optional exercise on Python

Exercise 1 - exercise_python

Level:

Load the dataset from this GitHub URL:

https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv

Save it to a variable called crime_data and display:

The shape of the data
The first 3 rows
Column names using .columns

Answer

Simple data munging and visualization

import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv"

crime_data = pd.read_csv(url)

# Display the shape of the data
print("Shape of the dataset: ", crime_data.shape)

# Display the first 3 rows
print("\nFirst 3 rows: ")
print(crime_data.head(3))

# Display column names
print("\nColumn names: ")
print(crime_data.columns)

Shape of the dataset:  (48, 5)

First 3 rows: 
     State  Murder  Assault  UrbanPop  ViolentCrime
0  Alabama    13.2      236        58          21.2
1   Alaska    10.0      263        48          44.5
2  Arizona     8.1      294        80          31.0

Column names: 
Index(['State', 'Murder', 'Assault', 'UrbanPop', 'ViolentCrime'], dtype='object')

5.1.2 Optional exercise on Python

Exercise 2 - exercise_python_visualization

Level:

Load the dataset from this GitHub URL:

https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv

Histogram: Show the distribution of murders

Use plt.hist() with the Murder column
Use 15 bins
Add title: “Distribution of Murder”
Add axis labels and grid

Remember to use plt.show() after each plot!

Answer

Simple data visualization

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv"
crime_data = pd.read_csv(url)

# Draw histogram
plt.figure()
plt.hist(crime_data["Murder"], bins=15)
plt.grid(True)
plt.xlabel("Murder")
plt.ylabel("Frequency")
plt.title("Distribution of Murder")
plt.show()

5.1.3 Optional exercise on Python

Exercise 3 - exercise_python_numpy

Level:

Load the dataset from this GitHub URL:

https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv

Calculate the mean/average number of murders in USA.

Answer

Simple numpy usage

import pandas as pd
import numpy as np

# Load the dataset
url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv"
crime_data = pd.read_csv(url)

# Draw histogram
numpy_array_murders = crime_data["Murder"].to_numpy()

print( np.mean(numpy_array_murders) )

7.916666666666667

5.1.4 Optional exercise on Python

Exercise 4 - exercise_python_data_munging_advanced

Level:

Load the dataset from this GitHub URL:

https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv

Find the state that has the highest number of murders.

Answer

Data munging (advanced)

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
url = "https://raw.githubusercontent.com/cambiotraining/ml-unsupervised/refs/heads/main/course_files/data/USArrests.csv"
crime_data = pd.read_csv(url)

# use groupby
print(crime_data.groupby("State")["Murder"].mean().sort_values(ascending = False).head())

# Alternative use idxmax() and loc()
# arrests_data['Murder'].idxmax() → finds the index (row number) of the maximum value in the Murder column.
# .loc[ ... , ['State', 'Murder']] → uses .loc[] to look up the row with the maximum murder rate and only show the State name and Murder value.
print(crime_data.loc [crime_data["Murder"].idxmax(), ["State","Murder"] ])

State
Georgia           17.4
Mississippi       16.1
Florida           15.4
Louisiana         15.4
South Carolina    14.4
Name: Murder, dtype: float64
State     Georgia
Murder       17.4
Name: 9, dtype: object

5.1.5 Optional exercise on Python

Exercise 5 - exercise_python_numpy

Level:

Fill in the blanks in the code below.

import numpy as np

# 1) Reproducibility
np.random.seed(7)

# 2) Make a 5x4 array of random numbers in [0, 1)
X = np.random.rand(5, 4)

# 3) Compute:
# - mean of each column
# - mean of each row
# - overall mean
col_means = ...
row_means = ...
overall_mean = ...

print("X:\n", X)
print("Column means:", col_means)
print("Row means:", row_means)
print("Overall mean:", overall_mean)

# 4) Bonus: random integers from 0..99 (size=12). Compare mean to 49.5
ints = np.random.randint(0, 100, size=12)
ints_mean = ...
print("Random integers:", ints)
print("Integers mean:", ints_mean)

Answer

Numpy (advanced)

import numpy as np

# 1) Reproducibility
np.random.seed(7)

# 2) Make a 5x4 array of random numbers in [0, 1)
X = np.random.rand(5, 4)

# 3) Compute:
# - mean of each column
# - mean of each row
# - overall mean
col_means = X.mean(axis=0)
row_means = X.mean(axis=1)
overall_mean = X.mean()

print("X:\n", X)
print("Column means:", col_means)
print("Row means:", row_means)
print("Overall mean:", overall_mean)

# 4) Bonus: random integers from 0..99 (size=12). Compare mean to 49.5
ints = np.random.randint(0, 100, size=12)
ints_mean = ints.mean()
print("Random integers:", ints)
print("Integers mean:", ints_mean)

X:
 [[0.07630829 0.77991879 0.43840923 0.72346518]
 [0.97798951 0.53849587 0.50112046 0.07205113]
 [0.26843898 0.4998825  0.67923    0.80373904]
 [0.38094113 0.06593635 0.2881456  0.90959353]
 [0.21338535 0.45212396 0.93120602 0.02489923]]
Column means: [0.38341265 0.46727149 0.56762226 0.50674962]
Row means: [0.50452537 0.52241424 0.56282263 0.41115415 0.40540364]
Overall mean: 0.48126400765922045
Random integers: [61 64 34 56 73 78 38  4  9 87 99 67]
Integers mean: 55.833333333333336