# Load required R libraries
library(tidyverse)
library(broom)
library(lmtest)
library(performance)
# Read in the required data
<- read_csv("data/diabetes.csv")
diabetes <- read_csv("data/challenger.csv")
challenger <- read_csv("data/levers.csv") levers
9 Significance testing
- Understand the concept of statistical deviance
- Use likelihood ratio tests to perform significance testing for:
- An entire model (versus the null model)
- Individual predictor variables
9.1 Context
Up until now we’ve focussed on creating appropriate models for non-continuous data and making model predictions. In this section we’re going to focus on statistical significance testing.
Generalised linear models are fitted a little differently to standard linear models - namely, using maximum likelihood estimation instead of ordinary least squares for estimating the model coefficients.
As a result, we can no longer use F-tests for significance, or interpret \(R^2\) values in quite the same way. This section will introduce likelihood ratio tests, a method for extracting p-values for GLMs.
9.2 Section setup
We’ll use the following libraries and data:
# Load required Python libraries
import pandas as pd
import numpy as np
import math
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import *
from plotnine import *
# Read in the required data
= pd.read_csv("data/diabetes.csv")
diabetes = pd.read_csv("data/challenger.csv")
challenger = pd.read_csv("data/levers.csv") levers
9.3 Deviance
Several of the tests and metrics we’ll discuss below are based heavily on deviance. So, what is deviance, and where does it come from?
Here’s a few key definitions:
Maximum likelihood estimation | This is the method by which we fit the GLM (i.e., find the values for the beta coefficients). As the name suggests, we are trying to find the beta coefficients that maximise the likelihood of the dataset/sample. |
Likelihood | In this context, “likelihood” refers to the joint probability of all of the data points in the sample. In other words, how likely is it that you would sample a set of data points like these, if they were being drawn from an underlying population where your model is true? Each candidate model fitted to a dataset will have its own unique likelihood. |
Saturated (perfect) model | For each dataset, there is a “saturated”, or perfect, model. This model has the same number of parameters in it as there are data points, meaning the data are fitted exactly - as if connecting the dots between them. This model has the largest possible likelihood of any model fitted to the dataset. Of course, we don’t actually use the saturated model for drawing real conclusions, but we can use it as a baseline for comparison. |
Deviance (residual deviance) |
Each candidate model is compared back to the saturated model to figure out its deviance. Deviance is defined as the difference between the log-likelihood of your fitted model and the log-likelihood of the saturated model (multiplied by 2). Because deviance is all about capturing the discrepancy between fitted and actual values, it’s performing a similar function to the residual sum of squares (RSS) in a standard linear model. In fact, the RSS is really just a specific type of deviance. Sometimes, the deviance of a candidate (non-null) model is referred to more fully as “residual deviance”. |
Null deviance | One of the models that we can compare against the saturated model is the null model (a model with no predictors). This gives us the deviance value for the null model. This is the greatest deviance of any possible model that could be fitted to the data, because it explains zero variance in the response variable. |
9.4 Revisiting the diabetes dataset
As a worked example, we’ll use a logistic regression fitted to the diabetes
dataset that we saw in a previous section.
<- read_csv("data/diabetes.csv") diabetes
Rows: 728 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): glucose, diastolic, test_result
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
= pd.read_csv("data/diabetes.csv")
diabetes_py
diabetes_py.head()
glucose diastolic test_result
0 148 72 1
1 85 66 0
2 183 64 1
3 89 66 0
4 137 40 1
As a reminder, this dataset contains three variables:
test_result
, binary results of a diabetes test result (1 for positive, 0 for negative)glucose
, the results of a glucose tolerance testdiastolic
blood pressure
<- glm(test_result ~ glucose * diastolic,
glm_dia family = "binomial",
data = diabetes)
= smf.glm(formula = "test_result ~ glucose * diastolic",
model = sm.families.Binomial(),
family = diabetes_py)
data
= model.fit() glm_dia_py
9.5 What are the p-values in the summary?
You might have noticed that when you use summary
to see the model output, it comes with some p-values automatically.
What are they? Can you use/interpret them?
summary(glm_dia)
Call:
glm(formula = test_result ~ glucose * diastolic, family = "binomial",
data = diabetes)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.5710565 2.7032318 -3.171 0.00152 **
glucose 0.0547050 0.0209256 2.614 0.00894 **
diastolic 0.0423651 0.0363681 1.165 0.24406
glucose:diastolic -0.0002221 0.0002790 -0.796 0.42590
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 936.60 on 727 degrees of freedom
Residual deviance: 748.01 on 724 degrees of freedom
AIC: 756.01
Number of Fisher Scoring iterations: 4
print(glm_dia_py.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: test_result No. Observations: 728
Model: GLM Df Residuals: 724
Model Family: Binomial Df Model: 3
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -374.00
Date: Fri, 25 Jul 2025 Deviance: 748.01
Time: 08:28:22 Pearson chi2: 720.
No. Iterations: 5 Pseudo R-squ. (CS): 0.2282
Covariance Type: nonrobust
=====================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept -8.5711 2.703 -3.171 0.002 -13.869 -3.273
glucose 0.0547 0.021 2.614 0.009 0.014 0.096
diastolic 0.0424 0.036 1.165 0.244 -0.029 0.114
glucose:diastolic -0.0002 0.000 -0.796 0.426 -0.001 0.000
=====================================================================================
Each individual parameter, or coefficient, has its own z-value and associated p-value. In each case, a hypothesis test has been performed - these are formally called Wald tests.
The null hypothesis for these Wald tests is that the value of the coefficient = 0. The idea is that if a coefficient isn’t significantly different from 0, then that parameter isn’t useful and could be dropped from the model.
These tests are the equivalent of the t-tests that are calculated as part of the summary
output for standard linear models.
In some cases, you can. However, there are a few cases where they don’t give you all the info you need.
Firstly: they don’t tell you about the significance of the model as a whole (versus the null model).
Secondly: for categorical predictors, you will get a separate Wald p-value for each non-reference group (compared back to the reference group). This is not the same as a p-value for the categorical predictor as a whole. The Wald p-values can also be heavily affected by which group was chosen as the reference.
It’s typically preferable to use a likelihood ratio test instead.
9.6 Likelihood ratio tests (LRTs)
When we were assessing the significance of standard linear models, we were able to use the F-statistic to determine:
- the significance of the model versus a null model, and
- the significance of individual predictors.
We can’t use these F-tests for GLMs, but we can use LRTs in a very similar way, to calculate p-values for both the model as a whole, and for individual variables. This work because, in general, if you have a test statistic and you know the distribution of that test statistic, then you can use this to calculate a p-value.
These tests are all built on the idea of deviance, or the likelihood ratio, as discussed above on this page. We can compare any two models fitted to the same dataset by looking at the difference in their deviances, also known as the difference in their log-likelihoods, or more simply as a likelihood ratio.
Helpfully, this likelihood ratio approximately follows a chi-square distribution, which we can capitalise on to calculate a p-value. All we need is the number of degrees of freedom, which is equal to the difference in the number of parameters of the two models you’re comparing.
Importantly, we are only able to use this sort of test when one of the two models that we are comparing is a “simpler” version of the other, i.e., one model has a subset of the parameters of the other model.
So while we could perform an LRT just fine between these two models: Y ~ A + B + C
and Y ~ A + B + C + D
, or between any model and the null (Y ~ 1
), we would not be able to use this test to compare Y ~ A + B + C
and Y ~ A + B + D
.
9.6.1 Testing the model versus the null
Since LRTs involve making a comparison between two models, we must first decide which models we’re comparing, and check that one model is a “subset” of the other.
Let’s use an example from a previous section of the course, where we fitted a logistic regression to the diabetes
dataset.
The first step is to create the two models that we want to compare: our original model, and the null model (with and without predictors, respectively).
<- glm(test_result ~ glucose * diastolic,
glm_dia family = "binomial",
data = diabetes)
<- glm(test_result ~ 1,
glm_null family = binomial,
data = diabetes)
Then, we use the lrtest
function from the lmtest
package to perform the test itself; we include both the models that we want to compare, listing them one after another.
lrtest(glm_dia, glm_null)
Likelihood ratio test
Model 1: test_result ~ glucose * diastolic
Model 2: test_result ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -374.0
2 1 -468.3 -3 188.59 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see from the output that our chi-square statistic is significant, with a really small p-value. This tells us that, for the difference in degrees of freedom (here, that’s 3), the change in deviance is actually quite big. (In this case, you can use summary(glm_dia)
to see those deviances - 936 versus 748!)
In other words, our model is better than the null.
The first step is to create the two models that we want to compare: our original model, and the null model (with and without our predictor, respectively).
= smf.glm(formula = "test_result ~ glucose * diastolic",
model = sm.families.Binomial(),
family = diabetes_py)
data
= model.fit()
glm_dia_py
= smf.glm(formula = "test_result ~ 1",
model = sm.families.Binomial(),
family = diabetes_py)
data
= model.fit() glm_null_py
Unlike in R, there isn’t a nice neat function for extracting the \(\chi^2\) value, so we have to do a little bit of work by hand.
# calculate the likelihood ratio (i.e. the chi-square value)
= -2*(glm_null_py.llf - glm_dia_py.llf)
lrstat
# calculate the associated p-value
= chi2.sf(lrstat, glm_null_py.df_resid - glm_dia_py.df_resid)
pvalue
print(lrstat, pvalue)
188.5931483744455 1.228870036004379e-40
This gives us the likelihood ratio, based on the log-likelihoods that we’ve extracted directly from the models, which approximates a chi-square distribution.
We’ve also calculated the associated p-value, by providing the difference in degrees of freedom between the two models (in this case, that’s simply 1, but for more complicated models it’s easier to extract the degrees of freedom directly from the model as we’ve done here).
Here, we have a large chi-square statistic and a small p-value. This tells us that, for the difference in degrees of freedom (here, that’s 1), the change in deviance is actually quite big. (In this case, you can use glm_dia_py.summary()
to see those deviances - 936 versus 748!)
In other words, our model is better than the null.
9.6.2 Testing individual predictors
As well as testing the overall model versus the null, we might want to test particular predictors to determine whether they are individually significant.
The way to achieve this is essentially to perform a series of “targeted” likelihood ratio tests. In each LRT, we’ll compare two models that are almost identical - one with, and one without, our variable of interest in each case.
The first step is to construct a new model that doesn’t contain our predictor of interest. Let’s test the glucose:diastolic
interaction.
<- glm(test_result ~ glucose + diastolic,
glm_dia_add family = "binomial",
data = diabetes)
Now, we can use the lrtest
function (or the anova
function) to compare the models with and without the interaction:
lrtest(glm_dia, glm_dia_add)
Likelihood ratio test
Model 1: test_result ~ glucose * diastolic
Model 2: test_result ~ glucose + diastolic
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -374.00
2 3 -374.32 -1 0.6288 0.4278
This tells us that our interaction glucose:diastolic
isn’t significant - our more complex model doesn’t have a meaningful reduction in deviance.
This might, however, seem like a slightly clunky way to test each individual predictor. Luckily, we can also use our trusty anova
function with an extra argument to tell us about individual predictors.
By specifying that we want to use a chi-squared test, we are able to construct an analysis of deviance table (as opposed to an analysis of variance table) that will perform the likelihood ratio tests for us for each predictor:
anova(glm_dia, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: test_result
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 727 936.60
glucose 1 184.401 726 752.20 < 2e-16 ***
diastolic 1 3.564 725 748.64 0.05905 .
glucose:diastolic 1 0.629 724 748.01 0.42779
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You’ll spot that the p-values we get from the analysis of deviance table match the p-values you could calculate yourself using lrtest
; this is just more efficient when you have a complex model!
The first step is to construct a new model that doesn’t contain our predictor of interest. Let’s test the glucose:diastolic
interaction.
= smf.glm(formula = "test_result ~ glucose + diastolic",
model = sm.families.Binomial(),
family = diabetes_py)
data
= model.fit() glm_dia_add_py
We’ll then use the same code we used above, to compare the models with and without the interaction:
= -2*(glm_dia_add_py.llf - glm_dia_py.llf)
lrstat
= chi2.sf(lrstat, glm_dia_py.df_model - glm_dia_add_py.df_model)
pvalue
print(lrstat, pvalue)
0.6288201373599804 0.42778842576800746
This tells us that our interaction glucose:diastolic
isn’t significant - our more complex model doesn’t have a meaningful reduction in deviance.
9.7 Exercises
9.7.1 Predicting failure
9.7.2 Predicting failure (with a tweak)
9.7.3 Revisiting rats and levers
9.8 Summary
Likelihood and deviance are very important in generalised linear models - not just for fitting the model via maximum likelihood estimation, but for assessing significance and goodness-of-fit. To determine the quality of a model and draw conclusions from it, it’s important to assess both of these things.
- Deviance is the difference between predicted and actual values, and is calculated by comparing a model’s log-likelihood to that of the perfect “saturated” model
- Using deviance, likelihood ratio tests can be used in lieu of F-tests for generalised linear models
- This is distinct from (and often better than) using the Wald p-values that are reported automatically in the model summary