9  Chaining operations

Learning objectives
  • Learn how to chain operations together.

9.1 Context

In the section above we performed several operations on a single data set. Often there is a sequence to this, where the output of one operation gets fed into the next. We can simplify this by chaining commands.

9.2 Section setup

We’ll continue this section with the script named da3-06-chaining-operations.R. If needed, add the following code to the top of your script and run it.

# A collection of R packages designed for data science
library(tidyverse)

surveys <- read_csv("data/surveys.csv")

We’ll continue this section with the script named da3-06-chaining-operations.py. If needed, add the following code to the top of your script and run it.

# A Python data analysis and manipulation tool
import pandas as pd

# Python equivalent of `ggplot2`
from plotnine import *

surveys = pd.read_csv("data/surveys.csv")

9.3 Pipes or chaining commands

So far, we’ve used single operations when we were manipulating our data. For example, we can select columns with:

select(surveys, record_id, hindfoot_length)

Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.

We would have to do the following:

# grab the relevant columns and store in a new object
subset_surveys <- select(surveys, record_id, hindfoot_length)

# create the new column
mutate(subset_surveys, hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
   record_id hindfoot_length hindfoot_length_cm
       <dbl>           <dbl>              <dbl>
 1         1              32                3.2
 2         2              33                3.3
 3         3              37                3.7
 4         4              36                3.6
 5         5              35                3.5
 6         6              14                1.4
 7         7              NA               NA  
 8         8              37                3.7
 9         9              34                3.4
10        10              20                2  
# ℹ 35,539 more rows

We had to create a new object (here, called subset_surveys) to store the intermediate data we were interested in, and then continue with creating the new column.

This clutters up your computer’s memory rather quickly when dealing with lots of data. A much better way is that we pipe or chain one after the other. To do this, we start with the data and use a pipe symbol (|> or %>%) as follows:

surveys |> 
  select(record_id, hindfoot_length) |>
  mutate(hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
   record_id hindfoot_length hindfoot_length_cm
       <dbl>           <dbl>              <dbl>
 1         1              32                3.2
 2         2              33                3.3
 3         3              37                3.7
 4         4              36                3.6
 5         5              35                3.5
 6         6              14                1.4
 7         7              NA               NA  
 8         8              37                3.7
 9         9              34                3.4
10        10              20                2  
# ℹ 35,539 more rows

An easy way of remembering what the pipe does is to replace (in your head) the pipe symbol with the phrase “and then…”.

So, we select() the record_id and hindfoot_length columns and then use mutate() to create a new column called hindfoot_length_cm.

Which pipe symbol do I use?

You’ll find that people use two pipe symbols quite interchangeably in R: the |> pipe (native, built-in R) and %>% from the magrittr package.

The native, built-in pipe is a rather new addition, since version 4.1. It is slightly different in its behaviour than the %>% pipe (if you want to know more, see here), but for most purposes they work the same.

We tend to use the native, built-in pipe throughout the materials. But the magrittr pipe works just as well! You can change your preference in RStudio by going to Tools > Global options > Code and changing the tickbox enabling/disabling the native pipe operator.

surveys[["record_id", "hindfoot_length"]].copy()
       record_id  hindfoot_length
0              1             32.0
1              2             33.0
2              3             37.0
3              4             36.0
4              5             35.0
...          ...              ...
35544      35545              NaN
35545      35546              NaN
35546      35547             15.0
35547      35548             36.0
35548      35549              NaN

[35549 rows x 2 columns]

Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.

We would have to do the following:

# select the required columns and store in a new data set
subset_surveys = surveys[["record_id", "hindfoot_length"]].copy()

# take the new data set and calculate the new column
subset_surveys["hindfoot_length_cm"] = subset_surveys["hindfoot_length"] / 10

We had to create a new object (here, called subset_surveys) to store the intermediate data we were interested in, and then continue with creating the new column.

This clutters up your computer’s memory rather quickly when dealing with lots of data. So, it’d be good if we could pipe or chain these commands, like we can do in R.

Python does not have an exact equivalent to pipes in R, but you can chain methods in pandas, which can make your life a lot easier. Let’s explore that a bit further.

In the code below we are using .loc[:, ["col1", "col2"]] to select columns. This is a bit more chain-friendly than using [["col1", "col2"]] because it also allows filtering (more on this in the next chapter).

(
  surveys
  .loc[:, ['record_id', 'hindfoot_length']]
  .assign(hindfoot_length_cm = lambda df: df["hindfoot_length"] / 10)
)
       record_id  hindfoot_length  hindfoot_length_cm
0              1             32.0                 3.2
1              2             33.0                 3.3
2              3             37.0                 3.7
3              4             36.0                 3.6
4              5             35.0                 3.5
...          ...              ...                 ...
35544      35545              NaN                 NaN
35545      35546              NaN                 NaN
35546      35547             15.0                 1.5
35547      35548             36.0                 3.6
35548      35549              NaN                 NaN

[35549 rows x 3 columns]

Here, we do the following:

  • .loc[:, ['record_id', 'hindfoot_length']] selects the columns you want
  • .assign(...) then creates a new column
  • lambda df tells pandas to compute the new column using the current data frame in the chain
Oh, lambda!

Why Use lambda in Pandas .assign()?

In Python, a lambda is an anonymous function — a quick, in-place function without a name.

Why is lambda important in method chaining?

This all relates to the evaluation order: .assign() evaluates its arguments before the previous steps in the chain are fully applied.

So if you reference a column that was renamed or created earlier in the chain, pandas won’t find it unless you use lambda to delay evaluation.

Using lambda in .assign() ensures that pandas waits to evaluate the new column until the DataFrame is fully updated by the earlier steps.

Suppose we want to:

  • Select two columns
  • Create a new column weight_kg as weight_g / 1000

Without lambda — this will raise an error:

(
  surveys
  .loc[:, ["record_id", "weight"]]
  .assign(weight_kg = surveys["weight_g"] / 1000)  # KeyError: "weight_g"
)
Pipe-style packages in Python

There are some dplyr-style implementations in Python, that also include a pipe. One is siuba but it does not seem to be actively maintained. Another one is dfply, which has not been updated for 7 years and counting…

So, rather than being frustrated about this, I suggest we accept the differences between the two languages and move on! :-)

9.4 Chaining different commands

We can extend this concept a bit further, beyond just the creation of new columns.

surveys |> 
  select(record_id, weight) |>  # select columns
  rename(weight_g = weight) |>  # rename weight
  head()                        # show the first few rows
# A tibble: 6 × 2
  record_id weight_g
      <dbl>    <dbl>
1         1       NA
2         2       NA
3         3       NA
4         4       NA
5         5       NA
6         6       NA
(
  surveys
  .loc[:, ['record_id', 'weight']]          # select columns
  .rename(columns = {'weight': 'weight_g'}) # rename column
  .head()                                   # show first few rows
)
   record_id  weight_g
0          1       NaN
1          2       NaN
2          3       NaN
3          4       NaN
4          5       NaN

9.4.1 Chaining

Exercise 1 - Chaining

Level:

For this exercise we’ll be using the data from data/parasites.csv.

The fish_length variable is the length of the fish measured in centimetres. Do the following:

  • select all but the parasite_count column
  • rename the fish_length column to fish_length_cm
  • create a new column that contains the fish length in inches (divide by 2.54)
  • do all of this in a single chain

Load the data, if needed.

parasites <- read_csv("data/parasites.csv")
parasites = pd.read_csv("data/parasites.csv")

Perform chaining operations:

parasites |> 
  select(lake, fish_length) |> 
  rename(fish_length_cm = fish_length) |> 
  mutate(fish_length_in = fish_length_cm / 2.54)
# A tibble: 64 × 3
   lake  fish_length_cm fish_length_in
   <chr>          <dbl>          <dbl>
 1 C               21.1           8.31
 2 C               26.4          10.4 
 3 C               18.9           7.44
 4 B               27.2          10.7 
 5 C               29            11.4 
 6 B               24.2           9.53
 7 B               31.2          12.3 
 8 B               20.3           7.99
 9 C               27            10.6 
10 A               27            10.6 
# ℹ 54 more rows
(
  parasites
  .loc[:, ["lake", "fish_length"]]
  .rename(columns = {"fish_length": "fish_length_cm"})
  .assign(fish_length_in = lambda df: df["fish_length_cm"] / 2.54)
)
   lake  fish_length_cm  fish_length_in
0     C            21.1        8.307087
1     C            26.4       10.393701
2     C            18.9        7.440945
3     B            27.2       10.708661
4     C            29.0       11.417323
..  ...             ...             ...
59    B            26.5       10.433071
60    A            24.6        9.685039
61    C            27.1       10.669291
62    A            25.9       10.196850
63    C            33.9       13.346457

[64 rows x 3 columns]

9.5 Summary

Key points
  • In Python we use method chaining, which apply a series of transformations to a pandas DataFrame.
  • In R we have dedicated pipe symbols and we can use |> (built-in) or %>% (via magrittr package) to chain operations.
  • Both of these approaches allow us to run multiple lines of code sequentially, simplifying pipelines and making them easier to read.