# A collection of R packages designed for data science
library(tidyverse)
18 Operationalising variables
This section of the course covers how we define and measure variables, and how that can affect our analyses. This is illustrated with an example dataset. If you want to do the exercises yourself, make sure to check if you have all the required libraries installed.
18.1 Libraries and functions
18.1.1 Libraries
18.1.2 Libraries
# A Python data analysis and manipulation tool
import pandas as pd
# Simple yet exhaustive stats functions.
import pingouin as pg
# Python equivalent of `ggplot2`
from plotnine import *
# Statistical models, conducting tests and statistical data exploration
import statsmodels.api as sm
# Convenience interface for specifying models using formula strings and DataFrames
import statsmodels.formula.api as smf
18.2 Exercise 1 - Cycling to work
For this example, we’re interested in finding out whether cycling to work increases staff members’ productivity.
Download the productivity.csv
file.
This file contains a fictional dataset that explores the relationship between cycling to work and productivity at work. Each row corresponds to a different staff member at a small Cambridge-based company. There are four variables: cycle
is a categorical variable denoting whether the individual cycles to work; distance
is the distance in kilometres between the individual’s house and the office; projects
is the number of projects successfully completed by the individual within the last 6 months; and mean_hrs
is the average number of hours worked per week in the last 6 months.
As you may have noticed, we have two variables here that could serve as measures of productivity, and two ways of looking at cycling - whether someone cycles, versus how far they cycle.
First, let’s start by reading in the data, and visualising it.
# load the data
<- read_csv("data/productivity.csv")
productivity
# and have a look
head(productivity)
# load the data
= pd.read_csv("data/productivity.csv")
productivity_py
# and have a look
productivity_py.head()
cycle distance projects mean_hrs
0 no 6.68 2 55.0
1 no 4.32 3 48.0
2 no 5.81 2 42.0
3 no 8.49 2 37.5
4 no 6.47 4 32.0
Now it’s time to explore this data in a bit more detail. We can gain some insight by examining our two measures of “cycling” (our yes/no categorical variable, and the distance between home and office) and our two measures of “productivity” (mean hours worked per week, and projects completed in the last 6 months).
# visualise using a boxplot
%>%
productivity ggplot(aes(x = cycle, y = distance)) +
geom_boxplot()
Now, we’ll use a t-test to compare distance
between those who cycle, and those who don’t.
t.test(distance ~ cycle, data = productivity)
Welch Two Sample t-test
data: distance by cycle
t = 3.4417, df = 10.517, p-value = 0.005871
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
1.801439 8.293561
sample estimates:
mean in group no mean in group yes
7.3700 2.3225
# visualise using a boxplot
(ggplot(productivity_py,= "cycle",
aes(x = "distance")) +
y geom_boxplot())
Next, we compare the distance between those who cycle and those who do not. We use a t-test, since there are only two groups.
Here we use the ttest()
function from the pingouin
library. This needs two vectors as input, so we split the data as follows and then run the test:
= productivity_py.query('cycle == "no"')["distance"]
dist_no_cycle = productivity_py.query('cycle == "yes"')["distance"]
dist_yes_cycle
pg.ttest(dist_no_cycle, dist_yes_cycle).transpose()
T-test
T 3.441734
dof 10.516733
alternative two-sided
p-val 0.005871
CI95% [1.8, 8.29]
cohen-d 1.683946
BF10 15.278
power 0.960302
Let’s look at the second set of variables: the mean hours of worked per week and the number of projects completed in the past 6 months. When visualising this, we need to consider the projects
as a categorical variable.
# visualise the data
%>%
productivity ggplot(aes(x = as.factor(projects), y = mean_hrs)) +
geom_boxplot()
# construct a one-way ANOVA, treating projects as a categorical variable
<- lm(mean_hrs ~ as.factor(projects), data = productivity)
lm_prod anova(lm_prod)
Analysis of Variance Table
Response: mean_hrs
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(projects) 7 936.02 133.72 1.598 0.2066
Residuals 16 1338.88 83.68
# visualise using a boxplot
(ggplot(productivity_py,= productivity_py['projects'].astype('category'),
aes(x = "mean_hrs")) +
y geom_boxplot())
# construct a one-way ANOVA, treating projects as a categorical variable
= "mean_hrs",
pg.anova(dv = "projects",
between = productivity_py,
data = True).round(3) detailed
Source SS DF MS F p-unc np2
0 projects 936.023 7 133.718 1.598 0.207 0.411
1 Within 1338.883 16 83.680 NaN NaN NaN
What does this tell you about how these two sets of variables, which (in theory at least!) tap into the same underlying construct, relate to one another? Can you spot any problems, or have you got any concerns at this stage?
If so, hold that thought.
Assessing the effect of cycling on productivity
The next step is to run some exploratory analyses. Since we’re not going to reporting these data in any kind of paper or article, and the whole point is to look at different versions of the same analysis with different variables, we won’t worry about multiple comparison correction here.
When treating mean_hrs
as our response variable, we can use standard linear models approach, since this variable is continuous.
# visualise using ggplot
%>%
productivity ggplot(aes(x = cycle, y = mean_hrs)) +
geom_boxplot()
# run a t-test to compare mean_hrs for those who cycle vs those who don't
t.test(mean_hrs ~ cycle, data = productivity)
Welch Two Sample t-test
data: mean_hrs by cycle
t = -0.51454, df = 11.553, p-value = 0.6166
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
-12.803463 7.928463
sample estimates:
mean in group no mean in group yes
36.6875 39.1250
# visualise using a boxplot
(ggplot(productivity_py,= "cycle",
aes(x = "mean_hrs")) +
y geom_boxplot())
# run a t-test to compare mean_hrs for those who cycle vs those who don't
= productivity_py.query('cycle == "no"')["mean_hrs"]
hrs_no_cycle = productivity_py.query('cycle == "yes"')["mean_hrs"]
hrs_yes_cycle
pg.ttest(hrs_no_cycle, hrs_yes_cycle).transpose()
T-test
T -0.514542
dof 11.553188
alternative two-sided
p-val 0.616573
CI95% [-12.8, 7.93]
cohen-d 0.24139
BF10 0.427
power 0.083194
Let’s also look at mean_hrs
vs distance
:
%>%
productivity ggplot(aes(x = distance, y = mean_hrs)) +
geom_point()
# run a simple linear regression analysis
<- lm(mean_hrs ~ distance, data = productivity)
lm_hrs1 anova(lm_hrs1)
Analysis of Variance Table
Response: mean_hrs
Df Sum Sq Mean Sq F value Pr(>F)
distance 1 473.28 473.28 5.7793 0.02508 *
Residuals 22 1801.63 81.89
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# visualise using a scatterplot
(ggplot(productivity_py,= "distance",
aes(x = "mean_hrs")) +
y geom_point())
We can perform a linear regression on these data:
# create a linear model
= smf.ols(formula = "mean_hrs ~ distance",
model = productivity_py)
data # and get the fitted parameters of the model
= model.fit()
lm_hrs1_py
# look at the model output
print(lm_hrs1_py.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mean_hrs R-squared: 0.208
Model: OLS Adj. R-squared: 0.172
Method: Least Squares F-statistic: 5.779
Date: Tue, 16 Apr 2024 Prob (F-statistic): 0.0251
Time: 07:45:01 Log-Likelihood: -85.875
No. Observations: 24 AIC: 175.8
Df Residuals: 22 BIC: 178.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 43.0833 2.711 15.891 0.000 37.461 48.706
distance -1.1912 0.496 -2.404 0.025 -2.219 -0.164
==============================================================================
Omnibus: 1.267 Durbin-Watson: 1.984
Prob(Omnibus): 0.531 Jarque-Bera (JB): 0.300
Skew: -0.163 Prob(JB): 0.861
Kurtosis: 3.441 Cond. No. 8.18
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This shows us that while cycle
does not significantly predict mean_hrs
, distance
does. (If you had some concerns about the distance
variable earlier, continue to hold that thought.)
So, that’s the picture for mean_hrs
, the first of our two possible outcome variables. What about the predictive relationship(s) of cycling on our other candidate outcome variable, projects
?
Let’s try fitting some models with projects
as the outcome. We’ll continue to use linear models for this for now, although technically, as projects
is what we would refer to as a count variable, we should technically be fitting a different type of model called a generalised linear model. This topic will come up later in the week, and for this dataset the two types of model lead to similar outcomes, so we won’t worry about the distinction for now.
First, we look at distance
vs projects
.
%>%
productivity ggplot(aes(x = distance, y = projects)) +
geom_point()
<- lm(projects ~ distance, data = productivity)
lm_proj1 anova(lm_proj1)
Analysis of Variance Table
Response: projects
Df Sum Sq Mean Sq F value Pr(>F)
distance 1 12.644 12.644 3.667 0.06859 .
Residuals 22 75.856 3.448
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# visualise using a scatterplot
(ggplot(productivity_py,= "distance",
aes(x = "projects")) +
y geom_point())
# create a linear model
= smf.ols(formula = "projects ~ distance",
model = productivity_py)
data # and get the fitted parameters of the model
= model.fit()
lm_proj1_py
# look at the model output
print(lm_proj1_py.summary())
OLS Regression Results
==============================================================================
Dep. Variable: projects R-squared: 0.143
Model: OLS Adj. R-squared: 0.104
Method: Least Squares F-statistic: 3.667
Date: Tue, 16 Apr 2024 Prob (F-statistic): 0.0686
Time: 07:45:01 Log-Likelihood: -47.864
No. Observations: 24 AIC: 99.73
Df Residuals: 22 BIC: 102.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.5298 0.556 8.143 0.000 3.376 5.683
distance -0.1947 0.102 -1.915 0.069 -0.406 0.016
==============================================================================
Omnibus: 1.216 Durbin-Watson: 1.491
Prob(Omnibus): 0.544 Jarque-Bera (JB): 1.048
Skew: 0.317 Prob(JB): 0.592
Kurtosis: 2.196 Cond. No. 8.18
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Next, we look at cycle
vs projects
.
%>%
productivity ggplot(aes(x = cycle, y = projects)) +
geom_boxplot()
<- lm(projects ~ cycle, data = productivity)
lm_proj2 anova(lm_proj2)
Analysis of Variance Table
Response: projects
Df Sum Sq Mean Sq F value Pr(>F)
cycle 1 18.75 18.7500 5.914 0.02362 *
Residuals 22 69.75 3.1705
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# visualise using a scatterplot
(ggplot(productivity_py,= "cycle",
aes(x = "projects")) +
y geom_boxplot())
# create a generalised linear model
= smf.ols(formula = "projects ~ cycle",
model = productivity_py)
data # and get the fitted parameters of the model
= model.fit()
lm_proj2_py
# look at the model output
print(lm_proj2_py.summary())
OLS Regression Results
==============================================================================
Dep. Variable: projects R-squared: 0.212
Model: OLS Adj. R-squared: 0.176
Method: Least Squares F-statistic: 5.914
Date: Tue, 16 Apr 2024 Prob (F-statistic): 0.0236
Time: 07:45:02 Log-Likelihood: -46.857
No. Observations: 24 AIC: 97.71
Df Residuals: 22 BIC: 100.1
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 2.5000 0.630 3.971 0.001 1.194 3.806
cycle[T.yes] 1.8750 0.771 2.432 0.024 0.276 3.474
==============================================================================
Omnibus: 0.057 Durbin-Watson: 1.674
Prob(Omnibus): 0.972 Jarque-Bera (JB): 0.221
Skew: 0.096 Prob(JB): 0.895
Kurtosis: 2.571 Cond. No. 3.23
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This shows us that cycle
significantly predicts projects
, meaning the number of projects that get completed is not completely random, but some of the variance in that can be explained by whether a person cycles to work, or not. In contrast, distance
does not appear to be a significant predictor of projects
(although it’s only marginally non-significant). This is the opposite pattern, more or less, to the one we had for mean_hrs
.
That thought you were holding…
Those of you who are discerning may have noticed that the distance
variable is problematic as a measure of “cycling to work” in this particular dataset - this is because the dataset includes all the distances to work for the staff members who don’t cycle, as well as those who do.
What happens if we remove those values, and look at the relationship between distance
and our response variables again?
# use the filter function to retain only the rows where the staff member cycles
<- productivity %>%
productivity_cycle filter(cycle == "yes")
= productivity_py[productivity_py["cycle"] == "yes"] productivity_cycle_py
We’ll repeat earlier visualisations and analyses, this time with the colour aesthetic helping us to visualise how the cycle
variable affects the relationships between distance
, mean_hrs
and projects
.
%>%
productivity ggplot(aes(x = distance, y = mean_hrs, colour = cycle)) +
geom_point()
<- lm(mean_hrs ~ distance, data = productivity_cycle)
lm_hrs2 anova(lm_hrs2)
Analysis of Variance Table
Response: mean_hrs
Df Sum Sq Mean Sq F value Pr(>F)
distance 1 202.77 202.766 2.6188 0.1279
Residuals 14 1083.98 77.427
%>%
productivity ggplot(aes(x = distance, y = projects, colour = cycle)) +
geom_point()
<- lm(projects ~ distance, data = productivity_cycle)
lm_proj3 summary(lm_proj3)
Call:
lm(formula = projects ~ distance, data = productivity_cycle)
Residuals:
Min 1Q Median 3Q Max
-3.2086 -1.0679 -0.1724 1.6874 3.4674
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6899 0.7163 6.547 1.3e-05 ***
distance -0.1356 0.2095 -0.647 0.528
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.103 on 14 degrees of freedom
Multiple R-squared: 0.02904, Adjusted R-squared: -0.04031
F-statistic: 0.4187 on 1 and 14 DF, p-value: 0.528
# visualise using a scatterplot
(ggplot(productivity_py,= "distance",
aes(x = "mean_hrs",
y = "cycle")) +
colour geom_point())
# create a linear model
= smf.ols(formula = "mean_hrs ~ distance",
model = productivity_cycle_py)
data # and get the fitted parameters of the model
= model.fit()
lm_hrs2_py
# look at the model output
print(lm_hrs2_py.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mean_hrs R-squared: 0.158
Model: OLS Adj. R-squared: 0.097
Method: Least Squares F-statistic: 2.619
Date: Tue, 16 Apr 2024 Prob (F-statistic): 0.128
Time: 07:45:04 Log-Likelihood: -56.429
No. Observations: 16 AIC: 116.9
Df Residuals: 14 BIC: 118.4
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 42.4204 2.998 14.151 0.000 35.991 48.850
distance -1.4189 0.877 -1.618 0.128 -3.299 0.462
==============================================================================
Omnibus: 4.148 Durbin-Watson: 2.906
Prob(Omnibus): 0.126 Jarque-Bera (JB): 2.008
Skew: -0.820 Prob(JB): 0.366
Kurtosis: 3.565 Cond. No. 4.85
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# visualise using a scatterplot
(ggplot(productivity_py,= "distance",
aes(x = "projects",
y = "cycle")) +
colour geom_point())
# create a linear model
= smf.ols(formula = "projects ~ distance",
model = productivity_cycle_py)
data # and get the fitted parameters of the model
= model.fit()
lm_proj3_py
# look at the model output
print(lm_proj3_py.summary())
OLS Regression Results
==============================================================================
Dep. Variable: projects R-squared: 0.029
Model: OLS Adj. R-squared: -0.040
Method: Least Squares F-statistic: 0.4187
Date: Tue, 16 Apr 2024 Prob (F-statistic): 0.528
Time: 07:45:05 Log-Likelihood: -33.526
No. Observations: 16 AIC: 71.05
Df Residuals: 14 BIC: 72.60
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.6899 0.716 6.547 0.000 3.154 6.226
distance -0.1356 0.210 -0.647 0.528 -0.585 0.314
==============================================================================
Omnibus: 0.919 Durbin-Watson: 1.657
Prob(Omnibus): 0.632 Jarque-Bera (JB): 0.694
Skew: 0.016 Prob(JB): 0.707
Kurtosis: 1.980 Cond. No. 4.85
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Ah. Turns out we were right to be concerned; when staff members who don’t cycle are removed from the dataset, the significant relationship that we saw earlier between distance
and mean_hrs
disappears. And the marginally non-significant relationship we observed between distance
and projects
becomes much less significant.
This leaves us with just one significant result: projects ~ cycle
. But if we really were trying to report on these data, in a paper or report of some kind, we’d need to think very carefully about how much we trust this result, or whether perhaps we’ve stumbled on a false positive by virtue of running so many tests. We may also want to think carefully about whether or not we’re happy with these definitions of the variables; for instance, is the number of projects completed really the best metric for productivity at work?
18.3 Summary
- There are multiple ways to operationalise a variable, which may affect whether the variable is categorical or continuous
- The nature of the response variable will alter what type of model can be fitted to the dataset
- Some operationalisations may better capture your variable of interest than others
- If you do not effectively operationalise your variable in advance, you may find yourself “cherry-picking” your dataset