14  Model comparisons

Learning outcomes

Questions

  • How do I compare linear models?
  • How do decide which one is the “best” model?

Objectives

  • Be able to compare models using the Akaike Information Criterion (AIC)
  • Use AIC in the context of Backwards Stepwise Elimination

14.1 Libraries and functions

14.1.1 Libraries

# A collection of R packages designed for data science
library(tidyverse)

# Converts stats functions to a tidyverse-friendly format
library(rstatix)

# Creates diagnostic plots using ggplot2
library(ggResidpanel)

# Helper functions for tidying data
library(broom)

14.1.2 Functions

# Calculates the Akaike Information Criterion
stats::AIC()

# Performs a backwards step-wise elimination process
stats::step()

14.1.3 Libraries

# A fundamental package for scientific computing in Python
import numpy as np

# A Python data analysis and manipulation tool
import pandas as pd

# Simple yet exhaustive stats functions.
import pingouin as pg

# Python equivalent of `ggplot2`
from plotnine import *

# Statistical models, conducting tests and statistical data exploration
import statsmodels.api as sm

# Convenience interface for specifying models using formula strings and DataFrames
import statsmodels.formula.api as smf

14.1.4 Functions

# Reads in a .csv file
pandas.read_csv()

# Creates a model from a formula and data frame
statsmodels.formula.api.ols()

14.2 Purpose and aim

In the previous example we used a single data set and fitted five linear models to it depending on which predictor variables we used. Whilst this was fun (seriously, what else would you be doing right now?) it seems that there should be a “better way”. Well, thankfully there is! In fact there a several methods that can be used to compare different models in order to help identify “the best” model. More specifically, we can determine if a full model (which uses all available predictor variables and interactions) is necessary to appropriately describe the dependent variable, or whether we can throw away some of the terms (e.g. an interaction term) because they don’t offer any useful predictive power.

Here we will use the Akaike Information Criterion in order to compare different models.

14.3 Data and hypotheses

This section uses the data/CS5-ladybird.csv data set. This data set comprises of 20 observations of three variables (one dependent and two predictor). This records the clutch size (eggs) in a species of ladybird, alongside two potential predictor variables; the mass of the female (weight), and the colour of the male (male) which is a categorical variable.

14.4 Backwards Stepwise Elimination

First, we load the data and visualise it:

ladybird <- read_csv("data/CS5-ladybird.csv")

head(ladybird)
# A tibble: 6 × 4
     id male  weight  eggs
  <dbl> <chr>  <dbl> <dbl>
1     1 Wild    10.6    27
2     2 Wild    10.6    26
3     3 Wild    12.5    28
4     4 Wild    11.4    24
5     5 Wild    14.9    30
6     6 Wild    10.4    21

We can visualise the data by male, so we can see if the eggs clutch size differs a lot between the two groups:

ggplot(ladybird,
       aes(x = male, y = eggs)) +
    geom_boxplot() +
    geom_jitter(width = 0.05)

We can also plot the egg clutch size against the weight, again for each male colour group:

# visualise the data
ggplot(ladybird,
       aes(x = weight, y = eggs,
           colour = male)) +
    geom_point() +
    scale_color_brewer(palette = "Dark2")

ladybird_py = pd.read_csv("data/CS5-ladybird.csv")

We can visualise the data by male, so we can see if the eggs clutch size differs a lot between the two groups:

(ggplot(ladybird_py,
        aes(x = "male", y = "eggs")) +
        geom_boxplot() +
        geom_jitter(width = 0.05))

We can also plot the egg clutch size against the weight, again for each male colour group:

(ggplot(ladybird_py,
        aes(x = "weight", y = "eggs",
            colour = "male")) +
        geom_point() +
        scale_color_brewer(type = "qual",
                           palette = "Dark2"))

We can see a few things already:

  1. There aren’t a huge number of data points in each group, so we need to be a bit cautious with drawing any firm conclusions.
  2. There is quite some spread in the egg clutch sizes, with two observations in the Melanic group being rather low.
  3. From the box plot, there does not seem to be much difference in the egg clutch size between the two male colour groups.
  4. The scatter plot suggests that egg clutch size seems to increase somewhat linearly as the weight of the female goes up. There does not seem to be much difference between the two male colour groups in this respect.

14.4.1 Comparing models with AIC (step 1)

We start with the complete or full model, that takes into account any possible interaction between weight and male.

Next, we define the reduced model. This is the next, most simple model. In this case we’re removing the interaction and constructing an additive model.

# define the full model
lm_full <- lm(eggs ~ weight * male,
              data = ladybird)
# define the additive model
lm_add <- lm(eggs ~ weight + male,
             data = ladybird)

We then extract the AIC values for each of the models:

AIC(lm_full)
[1] 100.0421
AIC(lm_add)
[1] 99.19573
# create the model
model = smf.ols(formula= "eggs ~ weight * C(male)", data = ladybird_py)
# and get the fitted parameters of the model
lm_full_py = model.fit()
# create the additive linear model
model = smf.ols(formula= "eggs ~ weight + C(male)", data = ladybird_py)
# and get the fitted parameters of the model
lm_add_py = model.fit()

We then extract the AIC values for each of the models:

lm_full_py.aic
98.04206256815984
lm_add_py.aic
97.19572958073036

Each line tells you the AIC score for that model. The full model has 4 parameters (the intercept, the coefficient for the continuous variable weight, the coefficient for the categorical variable male and a coefficient for the interaction term weight:male). The additive model has a lower AIC score with only 3 parameters (since we’ve dropped the interaction term). There are different ways of interpreting AIC scores but the most widely used interpretation says that:

  • if the difference between two AIC scores is greater than 2, then the model with the smallest AIC score is more supported than the model with the higher AIC score
  • if the difference between the two models’ AIC scores is less than 2 then both models are equally well supported

This choice of language (supported vs significant) is deliberate and there are areas of statistics where AIC scores are used differently from the way we are going to use them here (ask if you want a bit of philosophical ramble from me). However, in this situation we will use the AIC scores to decide whether our reduced model is at least as good as the full model. Here since the difference in AIC scores is less than 2, we can say that dropping the interaction term has left us with a model that is both simpler (fewer terms) and as least as good (AIC score) as the full model. As such our additive model eggs ~ weight + male is designated our current working minimal model.

14.4.2 Comparing models with AIC (step 2)

Next, we see which of the remaining terms can be dropped. We will look at the models where we have dropped both male and weight (i.e. eggs ~ weight and eggs ~ male) and compare their AIC values with the AIC of our current minimal model (eggs ~ weight + male). If the AIC values of at least one of our new reduced models is lower (or at least no more than 2 greater) than the AIC of our current minimal model, then we can drop the relevant term and get ourselves a new minimal model. If we find ourselves in a situation where we can drop more than one term we will drop the term that gives us the model with the lowest AIC.

Drop the variable weight and examine the AIC:

# define the model
lm_male <- lm(eggs ~ male,
              data = ladybird)

# extract the AIC
AIC(lm_male)
[1] 118.7093

Drop the variable male and examine the AIC:

# define the model
lm_weight <- lm(eggs ~ weight,
                data = ladybird)

# extract the AIC
AIC(lm_weight)
[1] 97.52601

Drop the variable weight and examine the AIC:

# create the model
model = smf.ols(formula= "eggs ~ C(male)", data = ladybird_py)
# and get the fitted parameters of the model
lm_male_py = model.fit()

# extract the AIC
lm_male_py.aic
116.7092646564482

Drop the variable male and examine the AIC:

# create the model
model = smf.ols(formula= "eggs ~ weight", data = ladybird_py)
# and get the fitted parameters of the model
lm_weight_py = model.fit()

# extract the AIC
lm_weight_py.aic
95.52601286248304

Considering both outputs together and comparing with the AIC of our current minimal model, we can see that dropping male has decreased the AIC further, whereas dropping weight has actually increased the AIC and thus worsened the model quality.

Hence we can drop male and our new minimal model is eggs ~ weight.

14.4.3 Comparing models with AIC (step 3)

Our final comparison is to drop the variable weight and compare this simple model with a null model (eggs ~ 1), which assumes that the clutch size is constant across all parameters.

Drop the variable weight and see if that has an effect:

# define the model
lm_null <- lm(eggs ~ 1,
              data = ladybird)

# extract the AIC
AIC(lm_null)
[1] 117.2178
# create the model
model = smf.ols(formula= "eggs ~ 1", data = ladybird_py)
# and get the fitted parameters of the model
lm_null_py = model.fit()

# extract the AIC
lm_null_py.aic
115.21783038624153

The AIC of our null model is quite a bit larger than that of our current minimal model eggs ~ weight and so we conclude that weight is important. As such our minimal model is eggs ~ weight.

So, in summary, we could conclude that:

Female size is a useful predictor of clutch size, but male type is not so important.

At this point we can then continue analysing this minimal model, by checking the diagnostic plots and checking the assumptions. If they all pan out, then we can continue with an ANOVA.

14.5 Notes on Backwards Stepwise Elimination

This method of finding a minimal model by starting with a full model and removing variables is called backward stepwise elimination. Although regularly practised in data analysis, there is increasing criticism of this approach, with calls for it to be avoided entirely.

Why have we made you work through this procedure then? Given their prevalence in academic papers, it is very useful to be aware of these procedures and to know that there are issues with them. In other situations, using AIC for model comparisons are justified and you will come across them regularly. Additionally, there may be situations where you feel there are good reasons to drop a parameter from your model – using this technique you can justify that this doesn’t affect the model fit. Taken together, using backwards stepwise elimination for model comparison is still a useful technique.

Automatic BSE in R (but not Python)

Performing backwards stepwise elimination manually can be quite tedious. Thankfully R acknowledges this and there is a single inbuilt function called step() that can perform all of the necessary steps for you using AIC.

# define the full model
lm_full <- lm(eggs ~ weight * male,
              data = ladybird)

# perform backwards stepwise elimination
step(lm_full)

This will perform a full backwards stepwise elimination process and will find the minimal model for you.

Yes, I could have told you this earlier, but where’s the fun in that? (it is also useful for you to understand the steps behind the technique I suppose…)

When doing this in Python, you are a bit stuck. There does not seem to be an equivalent function. If you want to cobble something together yourself, then use this link as a starting point.

14.6 Exercises

We are going to practice the backwards stepwise elimination technique on some of the data sets we analysed previously.

For each of the following data sets I would like you to:

  1. Define the response variable
  2. Define the relevant predictor variables
  3. Define relevant interactions
  4. Perform a backwards stepwise elimination and discover the minimal model using AIC

NB: if an interaction term is significant then any main factor that is part of the interaction term cannot be dropped from the model.

Perform a BSE on the following data sets:

  • data/CS5-trees.csv
  • data/CS5-pm2_5.csv

14.6.1 BSE: Trees

Exercise 1

Level:

BSE on trees:

14.7 Answer

Let’s start by reading in the data and checking which variables are in the data set.

trees <- read_csv("data/CS5-trees.csv")

head(trees)
# A tibble: 6 × 3
  girth height volume
  <dbl>  <dbl>  <dbl>
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

First, we read in the data (if needed) and have a look at the variables:

trees_py = pd.read_csv("data/CS5-trees.csv")

trees_py.head()
   girth  height  volume
0    8.3      70    10.3
1    8.6      65    10.3
2    8.8      63    10.2
3   10.5      72    16.4
4   10.7      81    18.8
  1. The response variable is volume
  2. The predictor variables are girth and height
  3. The only possible interaction term is girth:height
  4. We perform a BSE on the model using the step() function

The full model is volume ~ girth * height.

We perform the BSE as follows:

Define the model and use the step() function:

# define the full model
lm_trees <- lm(volume ~ girth * height,
               data = trees)

# perform BSE
step(lm_trees)
Start:  AIC=65.49
volume ~ girth * height

               Df Sum of Sq    RSS    AIC
<none>                      198.08 65.495
- girth:height  1    223.84 421.92 86.936

Call:
lm(formula = volume ~ girth * height, data = trees)

Coefficients:
 (Intercept)         girth        height  girth:height  
     69.3963       -5.8558       -1.2971        0.1347  

We first define the full model, then the model without the interaction (model_1).

We extract the AICs for both models. Because we do not have an automated way of performing the BSE, we’re stringing together a few operations to make the code a bit more concise (getting the .fit() of the model and immediately extracting the aic value):

# define the models
model_full = smf.ols(formula= "volume ~ girth * height",
                     data = trees_py)

model_1 = smf.ols(formula= "volume ~ girth + height",
                  data = trees_py)

# get the AIC of the model
model_full.fit().aic
153.46916240903528
model_1.fit().aic
174.9099729872703

This BSE approach only gets as far as the first step (trying to drop the interaction term). We see immediately that dropping the interaction term makes the model worse. This means that the best model is still the full model.

14.7.1 BSE: Air pollution

Exercise 2

Level:

Perform a BSE on pm2_5. Let’s start by reading in the data and checking which variables are in the data set.

14.8 Answer

pm2_5 <- read_csv("data/CS5-pm2_5.csv")

head(pm2_5)
# A tibble: 6 × 6
  avg_temp date       location pm2_5 rain_mm wind_m_s
     <dbl> <chr>      <chr>    <dbl>   <dbl>    <dbl>
1      4.5 01/01/2019 inner     17.1     2.3     3.87
2      4.9 01/01/2019 outer     10.8     2.3     5.84
3      4.3 02/01/2019 inner     14.9     2.3     3.76
4      4.8 02/01/2019 outer     11.4     2.3     6   
5      4   03/01/2019 inner     18.5     1.4     2.13
6      4.5 03/01/2019 outer     15.0     1.4     2.57
pm2_5_py = pd.read_csv("data/CS5-pm2_5.csv")

pm2_5_py.head()
   avg_temp        date location   pm2_5  rain_mm  wind_m_s
0       4.5  01/01/2019    inner  17.126      2.3      3.87
1       4.9  01/01/2019    outer  10.821      2.3      5.84
2       4.3  02/01/2019    inner  14.884      2.3      3.76
3       4.8  02/01/2019    outer  11.416      2.3      6.00
4       4.0  03/01/2019    inner  18.471      1.4      2.13
  1. The response variable is pm2_5
  2. The predictor variables are all the variables, apart from date and pm2_5. It would be strange to try and create a model that relies on each individual measurement!
  3. We can add the wind_m_s:location interaction, since it appeared that there is a difference between inner and outer London pollution levels, in relation to wind speed
  4. We can start the backwards stepwise elimination with the following model:

pm2_5 ~ avg_temp + rain_mm + wind_m_s * location

# define the model
lm_pm2_5 <- lm(pm2_5 ~ avg_temp + rain_mm + wind_m_s * location,
               data = pm2_5)

# perform BSE
step(lm_pm2_5)
Start:  AIC=41.08
pm2_5 ~ avg_temp + rain_mm + wind_m_s * location

                    Df Sum of Sq    RSS     AIC
- rain_mm            1     0.351 760.01  39.412
- avg_temp           1     1.804 761.47  40.806
<none>                           759.66  41.075
- wind_m_s:location  1   134.820 894.48 158.336

Step:  AIC=39.41
pm2_5 ~ avg_temp + wind_m_s + location + wind_m_s:location

                    Df Sum of Sq    RSS     AIC
- avg_temp           1     1.758 761.77  39.098
<none>                           760.01  39.412
- wind_m_s:location  1   134.530 894.54 156.385

Step:  AIC=39.1
pm2_5 ~ wind_m_s + location + wind_m_s:location

                    Df Sum of Sq    RSS     AIC
<none>                           761.77  39.098
- wind_m_s:location  1    136.95 898.72 157.788

Call:
lm(formula = pm2_5 ~ wind_m_s + location + wind_m_s:location, 
    data = pm2_5)

Coefficients:
           (Intercept)                wind_m_s           locationouter  
               18.2422                 -0.2851                 -2.0597  
wind_m_s:locationouter  
               -0.4318  

We first define the full model, again stringing together a few operations to be more concise.

# define the model
model_full = smf.ols(formula= "pm2_5 ~ avg_temp + rain_mm + wind_m_s * C(location)", data = pm2_5_py)

# get the AIC of the model
model_full.fit().aic
2112.725435984824

Can we drop the interaction term or any of the remaining main effects?

# define the model
model_1 = smf.ols(
    formula= "pm2_5 ~ avg_temp + rain_mm + wind_m_s + C(location)",
    data = pm2_5_py)
    
model_2 = smf.ols(
    formula= "pm2_5 ~ avg_temp + wind_m_s * C(location)",
    data = pm2_5_py)
    
model_3 = smf.ols(
    formula= "pm2_5 ~ rain_mm + wind_m_s * C(location)",
    data = pm2_5_py)

# get the AIC of the models
model_1.fit().aic
2229.9865962235162
model_2.fit().aic
2111.06230638372
model_3.fit().aic
2112.456639492829
# compare to the full model
model_full.fit().aic
2112.725435984824

The AIC goes up quite a bit if we drop the interaction term. This means that we cannot drop the interaction term, nor any of the main effects that are included in the interaction. These are wind_m_s and location.

The model with the lowest AIC is the one without the rain_mm term, so our working model is:

working_model = smf.ols(
    formula= "pm2_5 ~ avg_temp + wind_m_s * C(location)",
    data = pm2_5_py)

Now we can again check if dropping the avg_temp term or the wind_m_s:location interaction has any effect on the model performance:

model_1 = smf.ols(
    formula= "pm2_5 ~ wind_m_s * C(location)",
    data = pm2_5_py)
    
model_2 = smf.ols(
    formula= "pm2_5 ~ avg_temp + wind_m_s + C(location)",
    data = pm2_5_py)

# get the AIC of the models
model_1.fit().aic
2110.748543452394
model_2.fit().aic
2228.035595215867
working_model.fit().aic
2111.06230638372

This shows that dropping the avg_temp term lowers the AIC, whereas dropping the interaction term makes the model markedly worse.

So, our new working model is:

working_model = smf.ols(
    formula= "pm2_5 ~ wind_m_s * C(location)",
    data = pm2_5_py)

Lastly, now we’ve dropped the avg_temp term we can do one final check on the interaction term:

model_1 = smf.ols(
    formula= "pm2_5 ~ wind_m_s + C(location) + wind_m_s + C(location)",
    data = pm2_5_py)

model_1.fit().aic
2229.438631394105
working_model.fit().aic
2110.748543452394

Dropping the interaction still makes the model worse.

Our minimal model is thus:

pm2_5 ~ wind_m_s + location + wind_m_s:location

14.9 Summary

Key points
  • We can use Backwards Stepwise Elimination (BSE) on a full model to see if certain terms add to the predictive power of the model or not
  • The AIC allows us to compare different models - if there is a difference in AIC of more than 2 between two models, then the smallest AIC score is more supported