9  Chaining operations

Learning objectives
  • Learn how to chain operations together.

9.1 Context

In the section above we performed several operations on a single data set. Often there is a sequence to this, where the output of one operation gets fed into the next. We can simplify this by chaining commands.

9.2 Section setup

We’ll continue this section with the script named da3-06-chaining-operations.R. If needed, add the following code to the top of your script and run it.

# A collection of R packages designed for data science
library(tidyverse)

surveys <- read_csv("data/surveys.csv")

We’ll continue this section with the script named da3-06-chaining-operations.py. If needed, add the following code to the top of your script and run it.

# A Python data analysis and manipulation tool
import pandas as pd

# Python equivalent of `ggplot2`
from plotnine import *

surveys = pd.read_csv("data/surveys.csv")

9.3 Pipes

So far, we’ve used single operations when we were manipulating our data. For example, we can select columns with:

select(surveys, record_id, hindfoot_length)
# A tibble: 35,549 × 2
   record_id hindfoot_length
       <dbl>           <dbl>
 1         1              32
 2         2              33
 3         3              37
 4         4              36
 5         5              35
 6         6              14
 7         7              NA
 8         8              37
 9         9              34
10        10              20
# ℹ 35,539 more rows

Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.

We would have to do the following:

# grab the relevant columns and store in a new object
subset_surveys <- select(surveys, record_id, hindfoot_length)

# create the new column
mutate(subset_surveys, hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
   record_id hindfoot_length hindfoot_length_cm
       <dbl>           <dbl>              <dbl>
 1         1              32                3.2
 2         2              33                3.3
 3         3              37                3.7
 4         4              36                3.6
 5         5              35                3.5
 6         6              14                1.4
 7         7              NA               NA  
 8         8              37                3.7
 9         9              34                3.4
10        10              20                2  
# ℹ 35,539 more rows

We had to create a new object (here, called subset_surveys) to store the intermediate data we were interested in, and then continue with creating the new column.

This clutters up your computer’s memory rather quickly when dealing with lots of data. A much better way is that we pipe one after the other. To do this, we start with the data and use a pipe symbol (|> or %>%) as follows:

surveys |> 
  select(record_id, hindfoot_length) |>
  mutate(hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
   record_id hindfoot_length hindfoot_length_cm
       <dbl>           <dbl>              <dbl>
 1         1              32                3.2
 2         2              33                3.3
 3         3              37                3.7
 4         4              36                3.6
 5         5              35                3.5
 6         6              14                1.4
 7         7              NA               NA  
 8         8              37                3.7
 9         9              34                3.4
10        10              20                2  
# ℹ 35,539 more rows

An easy way of remembering what the pipe does is to replace (in your head) the pipe symbol with the phrase “and then…”.

So, we select() the record_id and hindfoot_length columns and then use mutate to create a new column called hindfoot_length_cm.

Which pipe symbol do I use?

You’ll find that people use two pipe symbols quite interchangeably in R: the |> pipe (native, built-in R) and %>% from the magrittr package.

The native, built-in pipe is a rather new addition, since version 4.1. It is slightly different in its behaviour than the %>% pipe (if you want to know more, see here), but for most purposes they work the same.

We tend to use the native, built-in pipe throughout the materials. But the magrittr pipe works just as well! You can change your preference in RStudio by going to Tools > Global options > Code and changing the tickbox enabling/disabling the native pipe operator.

surveys[["record_id", "hindfoot_length"]].copy()
       record_id  hindfoot_length
0              1             32.0
1              2             33.0
2              3             37.0
3              4             36.0
4              5             35.0
...          ...              ...
35544      35545              NaN
35545      35546              NaN
35546      35547             15.0
35547      35548             36.0
35548      35549              NaN

[35549 rows x 2 columns]

Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.

We would have to do the following:

# select the required columns and store in a new data set
selected = surveys[["record_id", "hindfoot_length"]].copy()

# take the new data set and calculate the new column
selected["hindfoot_length_cm"] = selected["hindfoot_length"] / 10

We had to create a new object (here, called subset_surveys) to store the intermediate data we were interested in, and then continue with creating the new column.

This clutters up your computer’s memory rather quickly when dealing with lots of data. So, it’d be good if we could pipe these commands through, like we can do in R.

But, a bit of sad news here. Python does not really have an equivalent to pipes in R. You can somewhat emulate it with a non-intuitive set of operations like this:

(surveys
  [["record_id", "hindfoot_length"]]
  .assign(hindfoot_length_cm = lambda df: df["hindfoot_length"] / 10))
       record_id  hindfoot_length  hindfoot_length_cm
0              1             32.0                 3.2
1              2             33.0                 3.3
2              3             37.0                 3.7
3              4             36.0                 3.6
4              5             35.0                 3.5
...          ...              ...                 ...
35544      35545              NaN                 NaN
35545      35546              NaN                 NaN
35546      35547             15.0                 1.5
35547      35548             36.0                 3.6
35548      35549              NaN                 NaN

[35549 rows x 3 columns]

Here, we do the following:

  • surveys[["record_id", "hindfoot_length"]] selects the columns you want
  • .assign(...) then creates a new column
  • lambda df tells pandas to compute the new column using the current data frame in the chain

But I guess you’ll agree that this is not that much easier to read. There are some dplyr-style implementations in Python, that also include a pipe. One is siuba but it does not seem to be actively maintained. Another one is dfply, which has not been updated for 7 years and counting…

So, rather than being frustrated about this, I suggest we accept the differences between the two languages and move on! :-)

9.4 Summary

Key points
  • In Python there is not a clear way to chain operations.
  • In R we can use |> (built-in) or %>% (via magrittr package) to chain operations.
  • This allows us to run multiple lines of code sequentially, simplifying pipelines and making them easier to read.