18 Operationalising variables

This section of the course covers how we define and measure variables, and how that can affect our analyses. This is illustrated with an example dataset. If you want to do the exercises yourself, make sure to check if you have all the required libraries installed.

18.1 Libraries and functions

Click to expand

R
Python

18.1.1 Libraries

# A collection of R packages designed for data science
library(tidyverse)

18.1.2 Libraries

# A Python data analysis and manipulation tool
import pandas as pd

# Simple yet exhaustive stats functions.
import pingouin as pg

# Python equivalent of `ggplot2`
from plotnine import *

# Statistical models, conducting tests and statistical data exploration
import statsmodels.api as sm

# Convenience interface for specifying models using formula strings and DataFrames
import statsmodels.formula.api as smf

18.2 Exercise 1 - Cycling to work

For this example, we’re interested in finding out whether cycling to work increases staff members’ productivity.

Download the productivity.csv file.

This file contains a fictional dataset that explores the relationship between cycling to work and productivity at work. Each row corresponds to a different staff member at a small Cambridge-based company. There are four variables: cycle is a categorical variable denoting whether the individual cycles to work; distance is the distance in kilometres between the individual’s house and the office; projects is the number of projects successfully completed by the individual within the last 6 months; and mean_hrs is the average number of hours worked per week in the last 6 months.

As you may have noticed, we have two variables here that could serve as measures of productivity, and two ways of looking at cycling - whether someone cycles, versus how far they cycle.

First, let’s start by reading in the data, and visualising it.

R
Python

# load the data
productivity <- read_csv("data/productivity.csv")

# and have a look
head(productivity)

# load the data
productivity_py = pd.read_csv("data/productivity.csv")

# and have a look
productivity_py.head()

  cycle  distance  projects  mean_hrs
0    no      6.68         2      55.0
1    no      4.32         3      48.0
2    no      5.81         2      42.0
3    no      8.49         2      37.5
4    no      6.47         4      32.0

Now it’s time to explore this data in a bit more detail. We can gain some insight by examining our two measures of “cycling” (our yes/no categorical variable, and the distance between home and office) and our two measures of “productivity” (mean hours worked per week, and projects completed in the last 6 months).

R
Python

# visualise using a boxplot

productivity %>%
  ggplot(aes(x = cycle, y = distance)) +
  geom_boxplot()

Now, we’ll use a t-test to compare distance between those who cycle, and those who don’t.

t.test(distance ~ cycle, data = productivity)


    Welch Two Sample t-test

data:  distance by cycle
t = 3.4417, df = 10.517, p-value = 0.005871
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 1.801439 8.293561
sample estimates:
 mean in group no mean in group yes 
           7.3700            2.3225

# visualise using a boxplot
(ggplot(productivity_py,
        aes(x = "cycle",
            y = "distance")) +
     geom_boxplot())

Next, we compare the distance between those who cycle and those who do not. We use a t-test, since there are only two groups.

Here we use the ttest() function from the pingouin library. This needs two vectors as input, so we split the data as follows and then run the test:

dist_no_cycle = productivity_py.query('cycle == "no"')["distance"]
dist_yes_cycle = productivity_py.query('cycle == "yes"')["distance"]

pg.ttest(dist_no_cycle, dist_yes_cycle).transpose()

                  T-test
T               3.441734
dof            10.516733
alternative    two-sided
p-val           0.005871
CI95%        [1.8, 8.29]
cohen-d         1.683946
BF10              15.278
power           0.960302

Let’s look at the second set of variables: the mean hours of worked per week and the number of projects completed in the past 6 months. When visualising this, we need to consider the projects as a categorical variable.

R
Python

# visualise the data
productivity %>%
  ggplot(aes(x = as.factor(projects), y = mean_hrs)) +
  geom_boxplot()

# construct a one-way ANOVA, treating projects as a categorical variable

lm_prod <- lm(mean_hrs ~ as.factor(projects), data = productivity)
anova(lm_prod)

Analysis of Variance Table

Response: mean_hrs
                    Df  Sum Sq Mean Sq F value Pr(>F)
as.factor(projects)  7  936.02  133.72   1.598 0.2066
Residuals           16 1338.88   83.68

# visualise using a boxplot
(ggplot(productivity_py,
        aes(x = productivity_py['projects'].astype('category'),
            y = "mean_hrs")) +
     geom_boxplot())

# construct a one-way ANOVA, treating projects as a categorical variable
pg.anova(dv = "mean_hrs",
         between = "projects",
         data = productivity_py,
         detailed = True).round(3)

     Source        SS  DF       MS      F  p-unc    np2
0  projects   936.023   7  133.718  1.598  0.207  0.411
1    Within  1338.883  16   83.680    NaN    NaN    NaN

What does this tell you about how these two sets of variables, which (in theory at least!) tap into the same underlying construct, relate to one another? Can you spot any problems, or have you got any concerns at this stage?

If so, hold that thought.

Assessing the effect of cycling on productivity

The next step is to run some exploratory analyses. Since we’re not going to reporting these data in any kind of paper or article, and the whole point is to look at different versions of the same analysis with different variables, we won’t worry about multiple comparison correction here.

When treating mean_hrs as our response variable, we can use standard linear models approach, since this variable is continuous.

R
Python

# visualise using ggplot

productivity %>%
  ggplot(aes(x = cycle, y = mean_hrs)) +
  geom_boxplot()

# run a t-test to compare mean_hrs for those who cycle vs those who don't

t.test(mean_hrs ~ cycle, data = productivity)


    Welch Two Sample t-test

data:  mean_hrs by cycle
t = -0.51454, df = 11.553, p-value = 0.6166
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -12.803463   7.928463
sample estimates:
 mean in group no mean in group yes 
          36.6875           39.1250

# visualise using a boxplot
(ggplot(productivity_py,
        aes(x = "cycle",
            y = "mean_hrs")) +
     geom_boxplot())

# run a t-test to compare mean_hrs for those who cycle vs those who don't
hrs_no_cycle = productivity_py.query('cycle == "no"')["mean_hrs"]
hrs_yes_cycle = productivity_py.query('cycle == "yes"')["mean_hrs"]

pg.ttest(hrs_no_cycle, hrs_yes_cycle).transpose()

                    T-test
T                -0.514542
dof              11.553188
alternative      two-sided
p-val             0.616573
CI95%        [-12.8, 7.93]
cohen-d            0.24139
BF10                 0.427
power             0.083194

Let’s also look at mean_hrs vs distance:

R
Python

productivity %>%
  ggplot(aes(x = distance, y = mean_hrs)) +
  geom_point()

# run a simple linear regression analysis

lm_hrs1 <- lm(mean_hrs ~ distance, data = productivity)
anova(lm_hrs1)

Analysis of Variance Table

Response: mean_hrs
          Df  Sum Sq Mean Sq F value  Pr(>F)  
distance   1  473.28  473.28  5.7793 0.02508 *
Residuals 22 1801.63   81.89                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# visualise using a scatterplot
(ggplot(productivity_py,
        aes(x = "distance",
            y = "mean_hrs")) +
     geom_point())

We can perform a linear regression on these data:

# create a linear model
model = smf.ols(formula = "mean_hrs ~ distance",
                data = productivity_py)
# and get the fitted parameters of the model
lm_hrs1_py = model.fit()

# look at the model output
print(lm_hrs1_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               mean_hrs   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.172
Method:                 Least Squares   F-statistic:                     5.779
Date:                Tue, 16 Apr 2024   Prob (F-statistic):             0.0251
Time:                        07:45:01   Log-Likelihood:                -85.875
No. Observations:                  24   AIC:                             175.8
Df Residuals:                      22   BIC:                             178.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     43.0833      2.711     15.891      0.000      37.461      48.706
distance      -1.1912      0.496     -2.404      0.025      -2.219      -0.164
==============================================================================
Omnibus:                        1.267   Durbin-Watson:                   1.984
Prob(Omnibus):                  0.531   Jarque-Bera (JB):                0.300
Skew:                          -0.163   Prob(JB):                        0.861
Kurtosis:                       3.441   Cond. No.                         8.18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This shows us that while cycle does not significantly predict mean_hrs, distance does. (If you had some concerns about the distance variable earlier, continue to hold that thought.)

So, that’s the picture for mean_hrs, the first of our two possible outcome variables. What about the predictive relationship(s) of cycling on our other candidate outcome variable, projects?

Let’s try fitting some models with projects as the outcome. We’ll continue to use linear models for this for now, although technically, as projects is what we would refer to as a count variable, we should technically be fitting a different type of model called a generalised linear model. This topic will come up later in the week, and for this dataset the two types of model lead to similar outcomes, so we won’t worry about the distinction for now.

First, we look at distance vs projects.

R
Python

productivity %>%
  ggplot(aes(x = distance, y = projects)) +
  geom_point()

lm_proj1 <- lm(projects ~ distance, data = productivity)
anova(lm_proj1)

Analysis of Variance Table

Response: projects
          Df Sum Sq Mean Sq F value  Pr(>F)  
distance   1 12.644  12.644   3.667 0.06859 .
Residuals 22 75.856   3.448                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# visualise using a scatterplot
(ggplot(productivity_py,
        aes(x = "distance",
            y = "projects")) +
     geom_point())

# create a linear model
model = smf.ols(formula = "projects ~ distance",
                    data = productivity_py)
# and get the fitted parameters of the model
lm_proj1_py = model.fit()

# look at the model output
print(lm_proj1_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               projects   R-squared:                       0.143
Model:                            OLS   Adj. R-squared:                  0.104
Method:                 Least Squares   F-statistic:                     3.667
Date:                Tue, 16 Apr 2024   Prob (F-statistic):             0.0686
Time:                        07:45:01   Log-Likelihood:                -47.864
No. Observations:                  24   AIC:                             99.73
Df Residuals:                      22   BIC:                             102.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.5298      0.556      8.143      0.000       3.376       5.683
distance      -0.1947      0.102     -1.915      0.069      -0.406       0.016
==============================================================================
Omnibus:                        1.216   Durbin-Watson:                   1.491
Prob(Omnibus):                  0.544   Jarque-Bera (JB):                1.048
Skew:                           0.317   Prob(JB):                        0.592
Kurtosis:                       2.196   Cond. No.                         8.18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Next, we look at cycle vs projects.

R
Python

productivity %>%
  ggplot(aes(x = cycle, y = projects)) +
  geom_boxplot()

lm_proj2 <- lm(projects ~ cycle, data = productivity)
anova(lm_proj2)

Analysis of Variance Table

Response: projects
          Df Sum Sq Mean Sq F value  Pr(>F)  
cycle      1  18.75 18.7500   5.914 0.02362 *
Residuals 22  69.75  3.1705                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# visualise using a scatterplot
(ggplot(productivity_py,
        aes(x = "cycle",
            y = "projects")) +
     geom_boxplot())

# create a generalised linear model
model = smf.ols(formula = "projects ~ cycle",
                    data = productivity_py)
# and get the fitted parameters of the model
lm_proj2_py = model.fit()

# look at the model output
print(lm_proj2_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               projects   R-squared:                       0.212
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     5.914
Date:                Tue, 16 Apr 2024   Prob (F-statistic):             0.0236
Time:                        07:45:02   Log-Likelihood:                -46.857
No. Observations:                  24   AIC:                             97.71
Df Residuals:                      22   BIC:                             100.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        2.5000      0.630      3.971      0.001       1.194       3.806
cycle[T.yes]     1.8750      0.771      2.432      0.024       0.276       3.474
==============================================================================
Omnibus:                        0.057   Durbin-Watson:                   1.674
Prob(Omnibus):                  0.972   Jarque-Bera (JB):                0.221
Skew:                           0.096   Prob(JB):                        0.895
Kurtosis:                       2.571   Cond. No.                         3.23
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This shows us that cycle significantly predicts projects, meaning the number of projects that get completed is not completely random, but some of the variance in that can be explained by whether a person cycles to work, or not. In contrast, distance does not appear to be a significant predictor of projects (although it’s only marginally non-significant). This is the opposite pattern, more or less, to the one we had for mean_hrs.

That thought you were holding…

Those of you who are discerning may have noticed that the distance variable is problematic as a measure of “cycling to work” in this particular dataset - this is because the dataset includes all the distances to work for the staff members who don’t cycle, as well as those who do.

What happens if we remove those values, and look at the relationship between distance and our response variables again?

R
Python

# use the filter function to retain only the rows where the staff member cycles

productivity_cycle <- productivity %>%
  filter(cycle == "yes")

productivity_cycle_py = productivity_py[productivity_py["cycle"] == "yes"]

We’ll repeat earlier visualisations and analyses, this time with the colour aesthetic helping us to visualise how the cycle variable affects the relationships between distance, mean_hrs and projects.

R
Python

productivity %>%
  ggplot(aes(x = distance, y = mean_hrs, colour = cycle)) +
  geom_point()

lm_hrs2 <- lm(mean_hrs ~ distance, data = productivity_cycle)
anova(lm_hrs2)

Analysis of Variance Table

Response: mean_hrs
          Df  Sum Sq Mean Sq F value Pr(>F)
distance   1  202.77 202.766  2.6188 0.1279
Residuals 14 1083.98  77.427

productivity %>%
  ggplot(aes(x = distance, y = projects, colour = cycle)) +
  geom_point()

lm_proj3 <- lm(projects ~ distance, data = productivity_cycle)
summary(lm_proj3)


Call:
lm(formula = projects ~ distance, data = productivity_cycle)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2086 -1.0679 -0.1724  1.6874  3.4674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.6899     0.7163   6.547  1.3e-05 ***
distance     -0.1356     0.2095  -0.647    0.528    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.103 on 14 degrees of freedom
Multiple R-squared:  0.02904,   Adjusted R-squared:  -0.04031 
F-statistic: 0.4187 on 1 and 14 DF,  p-value: 0.528

# visualise using a scatterplot
(ggplot(productivity_py,
        aes(x = "distance",
            y = "mean_hrs",
            colour = "cycle")) +
     geom_point())

# create a linear model
model = smf.ols(formula = "mean_hrs ~ distance",
                data = productivity_cycle_py)
# and get the fitted parameters of the model
lm_hrs2_py = model.fit()

# look at the model output
print(lm_hrs2_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               mean_hrs   R-squared:                       0.158
Model:                            OLS   Adj. R-squared:                  0.097
Method:                 Least Squares   F-statistic:                     2.619
Date:                Tue, 16 Apr 2024   Prob (F-statistic):              0.128
Time:                        07:45:04   Log-Likelihood:                -56.429
No. Observations:                  16   AIC:                             116.9
Df Residuals:                      14   BIC:                             118.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     42.4204      2.998     14.151      0.000      35.991      48.850
distance      -1.4189      0.877     -1.618      0.128      -3.299       0.462
==============================================================================
Omnibus:                        4.148   Durbin-Watson:                   2.906
Prob(Omnibus):                  0.126   Jarque-Bera (JB):                2.008
Skew:                          -0.820   Prob(JB):                        0.366
Kurtosis:                       3.565   Cond. No.                         4.85
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# visualise using a scatterplot
(ggplot(productivity_py,
        aes(x = "distance",
            y = "projects",
            colour = "cycle")) +
     geom_point())

# create a linear model
model = smf.ols(formula = "projects ~ distance",
                    data = productivity_cycle_py)
# and get the fitted parameters of the model
lm_proj3_py = model.fit()

# look at the model output
print(lm_proj3_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               projects   R-squared:                       0.029
Model:                            OLS   Adj. R-squared:                 -0.040
Method:                 Least Squares   F-statistic:                    0.4187
Date:                Tue, 16 Apr 2024   Prob (F-statistic):              0.528
Time:                        07:45:05   Log-Likelihood:                -33.526
No. Observations:                  16   AIC:                             71.05
Df Residuals:                      14   BIC:                             72.60
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.6899      0.716      6.547      0.000       3.154       6.226
distance      -0.1356      0.210     -0.647      0.528      -0.585       0.314
==============================================================================
Omnibus:                        0.919   Durbin-Watson:                   1.657
Prob(Omnibus):                  0.632   Jarque-Bera (JB):                0.694
Skew:                           0.016   Prob(JB):                        0.707
Kurtosis:                       1.980   Cond. No.                         4.85
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Ah. Turns out we were right to be concerned; when staff members who don’t cycle are removed from the dataset, the significant relationship that we saw earlier between distance and mean_hrs disappears. And the marginally non-significant relationship we observed between distance and projects becomes much less significant.

This leaves us with just one significant result: projects ~ cycle. But if we really were trying to report on these data, in a paper or report of some kind, we’d need to think very carefully about how much we trust this result, or whether perhaps we’ve stumbled on a false positive by virtue of running so many tests. We may also want to think carefully about whether or not we’re happy with these definitions of the variables; for instance, is the number of projects completed really the best metric for productivity at work?

18.3 Summary

Key points

There are multiple ways to operationalise a variable, which may affect whether the variable is categorical or continuous
The nature of the response variable will alter what type of model can be fitted to the dataset
Some operationalisations may better capture your variable of interest than others
If you do not effectively operationalise your variable in advance, you may find yourself “cherry-picking” your dataset