In the previous example we used a single data set and fitted five linear models to it depending on which predictor variables we used. Whilst this was fun (seriously, what else would you be doing right now?) it seems that there should be a “better way”. Well, thankfully there is! In fact there a several methods that can be used to compare different models in order to help identify “the best” model. More specifically, we can determine if a full model (which uses all available predictor variables and interactions) is necessary to appropriately describe the dependent variable, or whether we can throw away some of the terms (e.g. an interaction term) because they don’t offer any useful predictive power.
Here we will use the Akaike Information Criterion in order to compare different models.
17.3 Data and hypotheses
This section uses the data/CS5-ladybird.csv data set. This data set comprises of 20 observations of three variables (one dependent and two predictor). This records the clutch size (eggs) in a species of ladybird, alongside two potential predictor variables; the mass of the female (weight), and the colour of the male (male) which is a categorical variable.
There aren’t a huge number of data points in each group, so we need to be a bit cautious with drawing any firm conclusions.
There is quite some spread in the egg clutch sizes, with two observations in the Melanic group being rather low.
From the box plot, there does not seem to be much difference in the egg clutch size between the two male colour groups.
The scatter plot suggests that egg clutch size seems to increase somewhat linearly as the weight of the female goes up. There does not seem to be much difference between the two male colour groups in this respect.
17.4.1 Comparing models with AIC (step 1)
We start with the complete or full model, that takes into account any possible interaction between weight and male.
Next, we define the reduced model. This is the next, most simple model. In this case we’re removing the interaction and constructing an additive model.
We then extract the AIC values for each of the models:
AIC(lm_full)
[1] 100.0421
AIC(lm_add)
[1] 99.19573
# create the modelmodel = smf.ols(formula="eggs ~ weight * C(male)", data = ladybird_py)# and get the fitted parameters of the modellm_full_py = model.fit()
# create the additive linear modelmodel = smf.ols(formula="eggs ~ weight + C(male)", data = ladybird_py)# and get the fitted parameters of the modellm_add_py = model.fit()
We then extract the AIC values for each of the models:
lm_full_py.aic
98.04206256815984
lm_add_py.aic
97.19572958073036
Each line tells you the AIC score for that model. The full model has 4 parameters (the intercept, the coefficient for the continuous variable weight, the coefficient for the categorical variable male and a coefficient for the interaction term weight:male). The additive model has a lower AIC score with only 3 parameters (since we’ve dropped the interaction term). There are different ways of interpreting AIC scores but the most widely used interpretation says that:
if the difference between two AIC scores is greater than 2, then the model with the smallest AIC score is more supported than the model with the higher AIC score
if the difference between the two models’ AIC scores is less than 2 then both models are equally well supported
This choice of language (supported vs significant) is deliberate and there are areas of statistics where AIC scores are used differently from the way we are going to use them here (ask if you want a bit of philosophical ramble from me). However, in this situation we will use the AIC scores to decide whether our reduced model is at least as good as the full model. Here since the difference in AIC scores is less than 2, we can say that dropping the interaction term has left us with a model that is both simpler (fewer terms) and as least as good (AIC score) as the full model. As such our additive model eggs ~ weight + male is designated our current working minimal model.
17.4.2 Comparing models with AIC (step 2)
Next, we see which of the remaining terms can be dropped. We will look at the models where we have dropped both male and weight (i.e. eggs ~ weight and eggs ~ male) and compare their AIC values with the AIC of our current minimal model (eggs ~ weight + male). If the AIC values of at least one of our new reduced models is lower (or at least no more than 2 greater) than the AIC of our current minimal model, then we can drop the relevant term and get ourselves a new minimal model. If we find ourselves in a situation where we can drop more than one term we will drop the term that gives us the model with the lowest AIC.
# define the modellm_male <-lm(eggs ~ male,data = ladybird)# extract the AICAIC(lm_male)
[1] 118.7093
Drop the variable male and examine the AIC:
# define the modellm_weight <-lm(eggs ~ weight,data = ladybird)# extract the AICAIC(lm_weight)
[1] 97.52601
Drop the variable weight and examine the AIC:
# create the modelmodel = smf.ols(formula="eggs ~ C(male)", data = ladybird_py)# and get the fitted parameters of the modellm_male_py = model.fit()# extract the AIClm_male_py.aic
116.7092646564482
Drop the variable male and examine the AIC:
# create the modelmodel = smf.ols(formula="eggs ~ weight", data = ladybird_py)# and get the fitted parameters of the modellm_weight_py = model.fit()# extract the AIClm_weight_py.aic
95.52601286248304
Considering both outputs together and comparing with the AIC of our current minimal model, we can see that dropping male has decreased the AIC further, whereas dropping weight has actually increased the AIC and thus worsened the model quality.
Hence we can drop male and our new minimal model is eggs ~ weight.
17.4.3 Comparing models with AIC (step 3)
Our final comparison is to drop the variable weight and compare this simple model with a null model (eggs ~ 1), which assumes that the clutch size is constant across all parameters.
Drop the variable weight and see if that has an effect:
# define the modellm_null <-lm(eggs ~1,data = ladybird)# extract the AICAIC(lm_null)
[1] 117.2178
# create the modelmodel = smf.ols(formula="eggs ~ 1", data = ladybird_py)# and get the fitted parameters of the modellm_null_py = model.fit()# extract the AIClm_null_py.aic
115.21783038624153
The AIC of our null model is quite a bit larger than that of our current minimal model eggs ~ weight and so we conclude that weight is important. As such our minimal model is eggs ~ weight.
So, in summary, we could conclude that:
Female size is a useful predictor of clutch size, but male type is not so important.
At this point we can then continue analysing this minimal model, by checking the diagnostic plots and checking the assumptions. If they all pan out, then we can continue with an ANOVA.
17.5 Notes on Backwards Stepwise Elimination
This method of finding a minimal model by starting with a full model and removing variables is called backward stepwise elimination. Although regularly practised in data analysis, there is increasing criticism of this approach, with calls for it to be avoided entirely.
Why have we made you work through this procedure then? Given their prevalence in academic papers, it is very useful to be aware of these procedures and to know that there are issues with them. In other situations, using AIC for model comparisons are justified and you will come across them regularly. Additionally, there may be situations where you feel there are good reasons to drop a parameter from your model – using this technique you can justify that this doesn’t affect the model fit. Taken together, using backwards stepwise elimination for model comparison is still a useful technique.
Automatic BSE in R (but not Python)
Performing backwards stepwise elimination manually can be quite tedious. Thankfully R acknowledges this and there is a single inbuilt function called step() that can perform all of the necessary steps for you using AIC.
# define the full modellm_full <-lm(eggs ~ weight * male,data = ladybird)# perform backwards stepwise eliminationstep(lm_full)
This will perform a full backwards stepwise elimination process and will find the minimal model for you.
Yes, I could have told you this earlier, but where’s the fun in that? (it is also useful for you to understand the steps behind the technique I suppose…)
When doing this in Python, you are a bit stuck. There does not seem to be an equivalent function. If you want to cobble something together yourself, then use this link as a starting point.
17.6 Exercises
We are going to practice the backwards stepwise elimination technique on some of the data sets we analysed previously.
For each of the following data sets I would like you to:
Define the response variable
Define the relevant predictor variables
Define relevant interactions
Perform a backwards stepwise elimination and discover the minimal model using AIC
NB: if an interaction term is significant then any main factor that is part of the interaction term cannot be dropped from the model.
Perform a BSE on the following data sets:
data/CS5-trees.csv
data/CS5-pm2_5.csv
17.6.1 BSE: Trees
Exercise
Level:
BSE on trees:
Answer
17.7 Answer
Let’s start by reading in the data and checking which variables are in the data set.
We first define the full model, then the model without the interaction (model_1).
We extract the AICs for both models. Because we do not have an automated way of performing the BSE, we’re stringing together a few operations to make the code a bit more concise (getting the .fit() of the model and immediately extracting the aic value):
# define the modelsmodel_full = smf.ols(formula="volume ~ girth * height", data = trees_py)model_1 = smf.ols(formula="volume ~ girth + height", data = trees_py)# get the AIC of the modelmodel_full.fit().aic
153.46916240903528
model_1.fit().aic
174.9099729872703
This BSE approach only gets as far as the first step (trying to drop the interaction term). We see immediately that dropping the interaction term makes the model worse. This means that the best model is still the full model.
17.7.1 BSE: Air pollution
Exercise
Level:
Perform a BSE on pm2_5. Let’s start by reading in the data and checking which variables are in the data set.
The predictor variables are all the variables, apart from date and pm2_5. It would be strange to try and create a model that relies on each individual measurement!
We can add the wind_m_s:location interaction, since it appeared that there is a difference between inner and outer London pollution levels, in relation to wind speed
We can start the backwards stepwise elimination with the following model:
We first define the full model, again stringing together a few operations to be more concise.
# define the modelmodel_full = smf.ols(formula="pm2_5 ~ avg_temp + rain_mm + wind_m_s * C(location)", data = pm2_5_py)# get the AIC of the modelmodel_full.fit().aic
2112.725435984824
Can we drop the interaction term or any of the remaining main effects?
# define the modelmodel_1 = smf.ols( formula="pm2_5 ~ avg_temp + rain_mm + wind_m_s + C(location)", data = pm2_5_py)model_2 = smf.ols( formula="pm2_5 ~ avg_temp + wind_m_s * C(location)", data = pm2_5_py)model_3 = smf.ols( formula="pm2_5 ~ rain_mm + wind_m_s * C(location)", data = pm2_5_py)# get the AIC of the modelsmodel_1.fit().aic
2229.9865962235162
model_2.fit().aic
2111.06230638372
model_3.fit().aic
2112.456639492829
# compare to the full modelmodel_full.fit().aic
2112.725435984824
The AIC goes up quite a bit if we drop the interaction term. This means that we cannot drop the interaction term, nor any of the main effects that are included in the interaction. These are wind_m_s and location.
The model with the lowest AIC is the one without the rain_mm term, so our working model is:
Now we can again check if dropping the avg_temp term or the wind_m_s:location interaction has any effect on the model performance:
model_1 = smf.ols( formula="pm2_5 ~ wind_m_s * C(location)", data = pm2_5_py)model_2 = smf.ols( formula="pm2_5 ~ avg_temp + wind_m_s + C(location)", data = pm2_5_py)# get the AIC of the modelsmodel_1.fit().aic
2110.748543452394
model_2.fit().aic
2228.035595215867
working_model.fit().aic
2111.06230638372
This shows that dropping the avg_temp term lowers the AIC, whereas dropping the interaction term makes the model markedly worse.
Dropping the interaction still makes the model worse.
Our minimal model is thus:
pm2_5 ~ wind_m_s + location + wind_m_s:location
17.9 Summary
Key points
We can use Backwards Stepwise Elimination (BSE) on a full model to see if certain terms add to the predictive power of the model or not
The AIC allows us to compare different models - if there is a difference in AIC of more than 2 between two models, then the smallest AIC score is more supported