# Load required R libraries
library(tidyverse)
library(broom)
library(performance)
library(MASS)
# Read in the required data
<- read_csv("data/parasites.csv")
parasites <- read_csv("data/bacteria.csv")
bacteria <- read_csv("data/galapagos.csv") galapagos
13 Overdispersed count data
- Understand what dispersion is and why it’s important
- Be able to diagnose or recognise overdispersion in count data
- Fit a negative binomial regression model to overdispersed count data
- Know the difference between Poisson and negative binomial regression
- Assess whether assumptions are met and evaluate fit of a negative binomial model
13.1 Context
In the previous chapter we looked at how to analyse count data. We used a Poisson regression to do this, which makes assumptions about the dispersion parameter. If everything is as expected (the mean and variance are they same) it’s 1. But, data are rarely that compliant so we need to be able to deal with situations where this is not the case. Here, we look at the situation where we have overdispersion.
13.2 Section setup
We’ll use the following libraries and data:
# Load required Python libraries
import pandas as pd
import numpy as np
import math
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.discrete.discrete_model import NegativeBinomial
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import *
from patsy import dmatrix
from plotnine import *
# Read in the required data
= pd.read_csv("data/parasites.csv")
parasites = pd.read_csv("data/bacteria.csv") bacteria
13.3 Removing dispersion assumptions
When we analysed the count data in the previous chapter, we used a Poisson regression. This makes the assumption that the dispersion parameter \(\approx 1\) (for a refresher on dispersion, see Section 12.7).
Here we look at a situation where that assumption is violated and the dispersion parameter is much larger than 1. For this we turn to negative binomial models. Instead of making an assumption about the dispersion parameter, they estimate it as part of the model fitting.
13.4 Parasites dataset
We’ll explore this with the parasites
dataset.
These data are about parasites found on the gills of freshwater fish.
The ecologists who conducted the research observed a total of 64 fish, which varied by:
- which
lake
they were found in - the
fish_length
(cm)
They were interested in how these factors influenced the parasite_count
of small parasites found on each fish.
13.5 Load and visualise the data
First we load the data, then we visualise it.
<- read_csv("data/parasites.csv")
parasites
head(parasites)
# A tibble: 6 × 3
parasite_count lake fish_length
<dbl> <chr> <dbl>
1 46 C 21.1
2 138 C 26.4
3 23 C 18.9
4 35 B 27.2
5 118 C 29
6 43 B 24.2
ggplot(parasites, aes(x = fish_length, y = parasite_count, colour = lake)) +
geom_point()
ggplot(parasites, aes(x = lake, y = parasite_count, colour = fish_length)) +
geom_violin() +
geom_jitter(width = 0.2)
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
= pd.read_csv("data/parasites.csv")
parasites
parasites.head()
parasite_count lake fish_length
0 46 C 21.1
1 138 C 26.4
2 23 C 18.9
3 35 B 27.2
4 118 C 29.0
= (ggplot(parasites, aes(x = "fish_length", y = "parasite_count",
p = "lake")) +
colour
geom_point())
p.show()
We get a reasonably clear picture here; it seems that lake C
might have a higher number of parasites overall, and that across all three lakes, bigger fish have more parasites too.
13.6 Constructing a Poisson model
Given that the response variable, parasite_count
, is a count variable, let’s try to construct a Poisson regression.
We’ll construct the full model, with an interaction.
<- glm(parasite_count ~ lake * fish_length,
glm_para data = parasites, family = "poisson")
and we look at the model summary:
summary(glm_para)
Call:
glm(formula = parasite_count ~ lake * fish_length, family = "poisson",
data = parasites)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.656e+00 1.763e-01 9.391 <2e-16 ***
lakeB -7.854e-01 2.804e-01 -2.801 0.0051 **
lakeC 3.527e-01 2.261e-01 1.560 0.1188
fish_length 9.127e-02 6.643e-03 13.738 <2e-16 ***
lakeB:fish_length 1.523e-02 1.031e-02 1.477 0.1398
lakeC:fish_length -5.079e-05 8.389e-03 -0.006 0.9952
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2073.8 on 63 degrees of freedom
Residual deviance: 1056.5 on 58 degrees of freedom
AIC: 1429.2
Number of Fisher Scoring iterations: 5
= smf.glm(formula = "parasite_count ~ lake * fish_length",
model = sm.families.Poisson(),
family = parasites)
data
= model.fit() glm_para
Let’s look at the model output:
print(glm_para.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: parasite_count No. Observations: 64
Model: GLM Df Residuals: 58
Model Family: Poisson Df Model: 5
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -708.61
Date: Fri, 25 Jul 2025 Deviance: 1056.5
Time: 08:28:56 Pearson chi2: 1.07e+03
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=========================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 1.6556 0.176 9.391 0.000 1.310 2.001
lake[T.B] -0.7854 0.280 -2.801 0.005 -1.335 -0.236
lake[T.C] 0.3527 0.226 1.560 0.119 -0.090 0.796
fish_length 0.0913 0.007 13.738 0.000 0.078 0.104
lake[T.B]:fish_length 0.0152 0.010 1.477 0.140 -0.005 0.035
lake[T.C]:fish_length -5.079e-05 0.008 -0.006 0.995 -0.016 0.016
=========================================================================================
13.6.1 Checking dispersion
Before we launch into any significance testing, we’re going to check one of the assumptions: that the dispersion parameter = 1 (you might’ve guessed where this is going…)
check_overdispersion(glm_para)
# Overdispersion test
dispersion ratio = 18.378
Pearson's Chi-Squared = 1065.905
p-value = < 0.001
Overdispersion detected.
print(glm_para.deviance/glm_para.df_resid)
18.215417572845308
= chi2.sf(glm_para.deviance, glm_para.df_resid)
pvalue print(pvalue)
2.3125874815114225e-183
Oh no - there is definitely overdispersion here.
We won’t bother looking at the analysis of deviance table or asking whether the model is better than the null model. The Poisson regression we’ve fitted here is not right, and we need to fit a different type of model.
The main alternative to a Poisson model, used for overdispersed count data, is something called a negative binomial model.
13.7 Negative binomial model
To specify a negative binomial model, we use the glm.nb
function from the MASS
package.
The syntax is the same as glm
, except we don’t need to specify a family
argument.
<- glm.nb(parasite_count ~ lake * fish_length, parasites)
nb_para
summary(nb_para)
Call:
glm.nb(formula = parasite_count ~ lake * fish_length, data = parasites,
init.theta = 3.37859216, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.943370 0.739434 2.628 0.00858 **
lakeB -0.784885 1.094625 -0.717 0.47335
lakeC 0.212060 1.006021 0.211 0.83305
fish_length 0.079939 0.029485 2.711 0.00671 **
lakeB:fish_length 0.015428 0.043380 0.356 0.72210
lakeC:fish_length 0.005719 0.039542 0.145 0.88501
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(3.3786) family taken to be 1)
Null deviance: 118.202 on 63 degrees of freedom
Residual deviance: 67.288 on 58 degrees of freedom
AIC: 615.99
Number of Fisher Scoring iterations: 1
Theta: 3.379
Std. Err.: 0.623
2 x log-likelihood: -601.991
This output is very similar to the other GLM outputs that we’ve seen but with some additional information at the bottom regarding the dispersion parameter that the negative binomial model has used, which in R is called theta (\(\theta\)), 3.379.
This number is estimated from the data. This is what makes negative binomial regression different from Poisson regression.
We can continue to use statsmodels
, specifically the NegativeBinomial
function.
The syntax for getting this to work is a little different, and takes several steps.
First, we identify our response variable:
= parasites['parasite_count'] Y
Now, we need to set up our set of predictors. The function we’re using requires us to provide a matrix where each row is a unique observation (fish, in this case) and each column is a predictor.
Because we want to include the lake:fish_length
interaction, we actually need to manually generate those columns ourselves, like this:
# 1. Create dummy variables for 'lake' (switch to True/False values)
= pd.get_dummies(parasites['lake'], drop_first=True)
lake_dummies
# 2. Create interaction terms manually: lake_dummies * fish_length
= lake_dummies.multiply(parasites['fish_length'], axis=0)
interactions = [f'{col}:fish_length' for col in interactions.columns]
interactions.columns
# 3. Combine all predictors: fish_length, lake dummies, interactions
= pd.concat([parasites['fish_length'], lake_dummies, interactions], axis=1)
X
# 4. Add a constant/intercept (required), and make sure that we're using floats
= sm.add_constant(X).astype(float) X
Now, we can fit our model and look at the summary:
from statsmodels.discrete.discrete_model import NegativeBinomial
# Specify the model (Y, X)
= NegativeBinomial(Y, X)
model
= model.fit() nb_para
Optimization terminated successfully.
Current function value: 4.703057
Iterations: 32
Function evaluations: 39
Gradient evaluations: 39
print(nb_para.summary())
NegativeBinomial Regression Results
==============================================================================
Dep. Variable: parasite_count No. Observations: 64
Model: NegativeBinomial Df Residuals: 58
Method: MLE Df Model: 5
Date: Fri, 25 Jul 2025 Pseudo R-squ.: 0.05947
Time: 08:28:56 Log-Likelihood: -301.00
converged: True LL-Null: -320.03
Covariance Type: nonrobust LLR p-value: 3.658e-07
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
const 1.9434 0.689 2.819 0.005 0.592 3.294
fish_length 0.0799 0.027 2.915 0.004 0.026 0.134
B -0.7849 1.026 -0.765 0.444 -2.796 1.226
C 0.2121 0.961 0.221 0.825 -1.671 2.095
B:fish_length 0.0154 0.041 0.380 0.704 -0.064 0.095
C:fish_length 0.0057 0.038 0.152 0.879 -0.068 0.080
alpha 0.2960 0.055 5.424 0.000 0.189 0.403
=================================================================================
In particular, we might like to know what the dispersion parameter is.
By default, NegativeBinomial
from statsmodels
provides something called alpha (\(\alpha\)), which is not the same as the significance threshold. To convert this into the dispersion parameter you’re using to seeing (sometimes referred to as theta, \(\theta\)), you need \(\frac{1}{\alpha}\).
print(1/nb_para.params['alpha'])
3.378591992548569
13.7.1 Extract equation
If we wanted to express our model as a formal equation, we need to extract the coefficients:
coefficients(nb_para)
(Intercept) lakeB lakeC fish_length
1.943370195 -0.784885086 0.212060149 0.079939234
lakeB:fish_length lakeC:fish_length
0.015428470 0.005718571
print(nb_para.params)
const 1.943370
fish_length 0.079939
B -0.784885
C 0.212060
B:fish_length 0.015428
C:fish_length 0.005719
alpha 0.295981
dtype: float64
These are the coefficients of the negative binomial model equation and need to be placed in the following formula in order to estimate the expected number of species as a function of the other variables:
\[ \begin{split} E(species) = \exp(1.94 + \begin{pmatrix} 0 \\ -0.78 \\ 0.212 \end{pmatrix} \begin{pmatrix} A \\ B \\ C \end{pmatrix} + 0.08 \times fishlength + \begin{pmatrix} 0 \\ 0.0154 \\ 0.0057 \end{pmatrix} \begin{pmatrix} A \\ B \\ C \end{pmatrix} \times fishlength) \end{split} \]
13.7.2 Comparing Poisson and negative binomial
A Poisson regression and negative binomial regression are very similar to each other. They use the same link function - the log link function. There is however a key difference.
In a Poisson regression the dispersion parameter \(\lambda\) = 1. In a negative binomial regression it is estimated from the data.
This means they will both have equations in the same form, as we just showed above. It also means that they can produce some quite similar-looking models.
Let’s visualise and compare them.
First, let’s plot the Poisson model:
To do this, we need to produce a set of predictions, which we can then plot.
# Get unique lake values from your dataset
= parasites['lake'].unique()
lake_levels
# Create prediction grid for each lake
= pd.concat([
new_data
pd.DataFrame({'fish_length': np.linspace(10, 40, 100),
'lake': lake
for lake in lake_levels
})
])
# Predict
'pred'] = glm_para.predict(new_data)
new_data[
new_data.head()
fish_length lake pred
0 10.000000 C 18.549643
1 10.303030 C 19.069529
2 10.606061 C 19.603986
3 10.909091 C 20.153422
4 11.212121 C 20.718257
Now, let’s plot the negative binomial model:
All we have to do is switch the method
for geom_smooth
over to glm.nb
:
# Get unique lake values from your dataset
= parasites['lake'].unique()
lake_levels
# Create prediction grid for each lake
= pd.concat([
new_data_nb
pd.DataFrame({'fish_length': np.linspace(10, 40, 100),
'lake': lake
for lake in lake_levels
})
])
# Repeat the transformations we used when fitting the model
= pd.get_dummies(new_data['lake'], drop_first=True)
lake_dummies = lake_dummies.multiply(new_data_nb['fish_length'], axis=0)
interactions = [f'{col}:fish_length' for col in interactions.columns]
interactions.columns = pd.concat([new_data_nb['fish_length'], lake_dummies, interactions], axis=1)
X_new = sm.add_constant(X_new).astype(float)
X_new
# Now we can safely predict:
'pred'] = nb_para.predict(X_new)
new_data_nb[
new_data_nb.head()
fish_length lake pred
0 10.000000 C 20.328185
1 10.303030 C 20.862750
2 10.606061 C 21.411372
3 10.909091 C 21.974421
4 11.212121 C 22.552276
It’s subtle - the two sets of curves look quite similar - but they’re not quite the same.
Looking at them side-by-side may help:
13.7.3 Checking assumptions & model quality
These steps are performed in precisely the same way we would do for a Poisson regression.
Remember, the things we checked for in a Poisson regression were:
- Response distribution
- Correct link function
- Independence
- Influential points
- Collinearity
- Dispersion
For free, I’ll just tell you now that the first three assumptions are met.
Plus, we don’t have to worry about dispersion with a negative binomial model.
So, all we really need to pay attention to is whether there are any influential points to worry about, or any collinearity.
Once again, the performance
package comes in handy for assessing all of these.
check_model(nb_para, check = c('pp_check',
'vif',
'outliers'))
Cannot simulate residuals for models of class `negbin`. Please try
`check_model(..., residual_type = "normal")` instead.
check_outliers(nb_para, threshold = list('cook'=0.5))
OK: No outliers detected.
- Based on the following method and threshold: cook (0.5).
- For variable: (Whole model)
check_collinearity(nb_para)
Model has interaction terms. VIFs might be inflated.
Try to center the variables used for the interaction, or check
multicollinearity among predictors of a model without interaction terms.
# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
fish_length 3.13 [ 2.26, 4.61] 1.77 0.32 [0.22, 0.44]
High Correlation
Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
lake 1321.85 [863.04, 2024.86] 36.36 7.57e-04 [0.00, 0.00]
lake:fish_length 1430.57 [934.01, 2191.42] 37.82 6.99e-04 [0.00, 0.00]
We check these things similarly to how we’ve done it previously.
For influential points, we’re actually going to refit our model via smf.glm
(which will give us access to get_influence
). However, we’ll use the alpha
value that was estimated from the model we fitted with NegativeBinomial
, rather than making any assumptions about its value.
= nb_para.params['alpha']
alpha_est
= smf.glm(formula='parasite_count ~ lake * fish_length',
model =parasites,
data=sm.families.NegativeBinomial(alpha = alpha_est))
family
= model.fit() glm_nb_para
Now we can check influential points:
= glm_nb_para.get_influence()
influence
= influence.cooks_distance[0]
cooks
= np.where(cooks > 0.5)[0] # Set appropriate threshold here
influential_points
print("Influential points:", influential_points)
Influential points: [19 26]
There seem to be a couple of points with a higher Cook’s distance. If you investigate a bit further, they both still have a Cook’s distance of <1, and neither of them cause dramatic changes to the model fit if removed.
The method that we’re using here to assess influence is quite sensitive, so a threshold of 1 might be more appropriate.
Collinearity/variance inflation factors:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrix
# Drop the response variable
= parasites.drop(columns='parasite_count')
X
# Create design matrix based on model formula
= dmatrix("lake * fish_length", data=parasites, return_type='dataframe')
X
# Calculate VIF for each feature
= pd.DataFrame()
vif_data 'feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i)
vif_data[for i in range(X.shape[1])]
print(vif_data)
feature VIF
0 Intercept 110.301908
1 lake[T.B] 46.505809
2 lake[T.C] 47.455623
3 fish_length 3.122694
4 lake[T.B]:fish_length 47.367520
5 lake[T.C]:fish_length 49.942831
The posterior predictive check looks alright, and there’s no high leverage points to worry about, but we seem to have some potential issues around our main effect of lake
along with the lake:fish_length
interaction.
This is often a sign that that the interaction term is unnecessary. Let’s test that theory with some significance testing and model comparison.
13.7.4 Significance & model comparison
Let’s see whether any of our predictors are significant (main effects and/or interaction effect).
We can use the trusty anova
and step
functions:
anova(nb_para, test = "Chisq")
Warning in anova.negbin(nb_para, test = "Chisq"): tests made without
re-estimating 'theta'
Analysis of Deviance Table
Model: Negative Binomial(3.3786), link: log
Response: parasite_count
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 63 118.202
lake 2 19.1628 61 99.040 6.900e-05 ***
fish_length 1 31.6050 60 67.435 1.889e-08 ***
lake:fish_length 2 0.1471 58 67.288 0.9291
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
step(nb_para)
Start: AIC=613.99
parasite_count ~ lake * fish_length
Df Deviance AIC
- lake:fish_length 2 67.435 610.14
<none> 67.288 613.99
Step: AIC=610.14
parasite_count ~ lake + fish_length
Df Deviance AIC
<none> 67.286 610.14
- lake 2 84.179 623.03
- fish_length 1 98.819 639.67
Call: glm.nb(formula = parasite_count ~ lake + fish_length, data = parasites,
init.theta = 3.370433436, link = log)
Coefficients:
(Intercept) lakeB lakeC fish_length
1.78006 -0.40015 0.35282 0.08654
Degrees of Freedom: 63 Total (i.e. Null); 60 Residual
Null Deviance: 117.9
Residual Deviance: 67.29 AIC: 612.1
Alternatively, we could fit an additive model and compare the two directly:
<- glm.nb(parasite_count ~ lake + fish_length, parasites)
nb_para_add
anova(nb_para, nb_para_add)
Likelihood ratio tests of Negative Binomial Models
Response: parasite_count
Model theta Resid. df 2 x log-lik. Test df LR stat.
1 lake + fish_length 3.370433 60 -602.1381
2 lake * fish_length 3.378592 58 -601.9912 1 vs 2 2 0.1469037
Pr(Chi)
1
2 0.9291809
This tells us that the interaction term is not significant.
Let’s start by fitting a new additive model without our interaction.
Again, we start by specifying our response variable.
= parasites['parasite_count'] Y
This next bit is much simpler than before, without the need to generate the interaction:
# Create dummy variables for 'lake' (switch to True/False values)
= pd.get_dummies(parasites['lake'], drop_first=True)
lake_dummies
# Combine the predictors: fish_length, lake dummies
= pd.concat([parasites['fish_length'], lake_dummies], axis=1)
X
# Add a constant/intercept (required), and make sure that we're using floats
= sm.add_constant(X).astype(float) X
Now, we can fit our model and look at the summary.
# Specify the model (Y, X)
= NegativeBinomial(Y, X)
model
= model.fit() nb_para_add
Optimization terminated successfully.
Current function value: 4.704204
Iterations: 17
Function evaluations: 22
Gradient evaluations: 22
Now, we want to compare the two models (nb_para
and nb_para_add
).
= -2*(nb_para_add.llf - nb_para.llf)
lrstat
= chi2.sf(lrstat, nb_para_add.df_resid - nb_para.df_resid)
pvalue
print(lrstat, pvalue)
0.14690373762027775 0.9291808672960002
This tells us that the interaction term is not significant.
Dropping the interaction will also solve our VIF problem:
check_collinearity(nb_para_add)
# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
lake 1.01 [1.00, 1.29e+14] 1.00 0.99 [0.00, 1.00]
fish_length 1.01 [1.00, 1.29e+14] 1.00 0.99 [0.00, 1.00]
# Drop the response variable
= parasites.drop(columns='parasite_count')
X
# Create design matrix based on model formula
= dmatrix("lake + fish_length", data=parasites, return_type='dataframe')
X
# Calculate VIF for each feature
= pd.DataFrame()
vif_data 'feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i)
vif_data[for i in range(X.shape[1])]
print(vif_data)
feature VIF
0 Intercept 37.353114
1 lake[T.B] 1.254778
2 lake[T.C] 1.261793
3 fish_length 1.006317
Excellent. Seems we have found a well-fitting model for our parasites
data!
13.8 Exercises
13.8.1 Bacteria colonies
13.8.2 Galapagos models
13.9 Summary
- Negative binomial regression relaxes the assumption made by Poisson regressions that the variance is equal to the mean (dispersion = 1)
- In a negative binomial regression the dispersion parameter \(\theta\) is estimated from the data
- However, they both use the log link function and often produce similar-looking models