9 Chaining operations

Learning objectives

Learn how to chain operations together.

9.1 Context

In the section above we performed several operations on a single data set. Often there is a sequence to this, where the output of one operation gets fed into the next. We can simplify this by chaining commands.

9.2 Section setup

We’ll continue this section with the script named 03_session. If needed, add the following code to the top of your script and run it.

# A collection of R packages designed for data science
library(tidyverse)

surveys <- read_csv("data/surveys.csv")

We’ll continue this section with the Notebook named 03_session. Add the following code to the first cell and run it.

# A Python data analysis and manipulation tool
import pandas as pd

# Python equivalent of `ggplot2`
from plotnine import *

surveys = pd.read_csv("data/surveys.csv")

9.3 Pipes or chaining commands

So far, we’ve used single operations when we were manipulating our data. For example, we can select columns with:

select(surveys, record_id, hindfoot_length)

Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.

We would have to do the following:

# grab the relevant columns and store in a new object
subset_surveys <- select(surveys, record_id, hindfoot_length)

# create the new column
mutate(subset_surveys, hindfoot_length_cm = hindfoot_length / 10)

# A tibble: 35,549 × 3
   record_id hindfoot_length hindfoot_length_cm
       <dbl>           <dbl>              <dbl>
 1         1              32                3.2
 2         2              33                3.3
 3         3              37                3.7
 4         4              36                3.6
 5         5              35                3.5
 6         6              14                1.4
 7         7              NA               NA  
 8         8              37                3.7
 9         9              34                3.4
10        10              20                2  
# ℹ 35,539 more rows

We had to create a new object (here, called subset_surveys) to store the intermediate data we were interested in, and then continue with creating the new column.

This clutters up your computer’s memory rather quickly when dealing with lots of data. A much better way is that we pipe or chain one after the other. To do this, we start with the data and use a pipe symbol (|> or %>%) as follows:

surveys |> 
  select(record_id, hindfoot_length) |>
  mutate(hindfoot_length_cm = hindfoot_length / 10)

# A tibble: 35,549 × 3
   record_id hindfoot_length hindfoot_length_cm
       <dbl>           <dbl>              <dbl>
 1         1              32                3.2
 2         2              33                3.3
 3         3              37                3.7
 4         4              36                3.6
 5         5              35                3.5
 6         6              14                1.4
 7         7              NA               NA  
 8         8              37                3.7
 9         9              34                3.4
10        10              20                2  
# ℹ 35,539 more rows

An easy way of remembering what the pipe does is to replace (in your head) the pipe symbol with the phrase “and then…”.

So, we select() the record_id and hindfoot_length columns and then use mutate() to create a new column called hindfoot_length_cm.

Which pipe symbol do I use?

You’ll find that people use two pipe symbols quite interchangeably in R: the |> pipe (native, built-in R) and %>% from the magrittr package.

The native, built-in pipe is a rather new addition, since version 4.1. It is slightly different in its behaviour than the %>% pipe (if you want to know more, see here), but for most purposes they work the same.

We tend to use the native, built-in pipe throughout the materials. But the magrittr pipe works just as well! You can change your preference in RStudio by going to Tools > Global options > Code and changing the tickbox enabling/disabling the native pipe operator.

surveys[["record_id", "hindfoot_length"]].copy()

       record_id  hindfoot_length
0              1             32.0
1              2             33.0
2              3             37.0
3              4             36.0
4              5             35.0
...          ...              ...
35544      35545              NaN
35545      35546              NaN
35546      35547             15.0
35547      35548             36.0
35548      35549              NaN

[35549 rows x 2 columns]

Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.

We would have to do the following:

# select the required columns and store in a new data set
subset_surveys = surveys[["record_id", "hindfoot_length"]].copy()

# take the new data set and calculate the new column
subset_surveys["hindfoot_length_cm"] = subset_surveys["hindfoot_length"] / 10

We had to create a new object (here, called subset_surveys) to store the intermediate data we were interested in, and then continue with creating the new column.

This clutters up your computer’s memory rather quickly when dealing with lots of data. So, it’d be good if we could pipe or chain these commands, like we can do in R.

Python does not have an exact equivalent to pipes in R, but you can chain methods in pandas, which can make your life a lot easier. Let’s explore that a bit further.

In the code below we are using .loc[:, ["col1", "col2"]] to select columns. This is a bit more chain-friendly than using [["col1", "col2"]] because it also allows filtering (more on this in the next chapter).

(
  surveys
  .loc[:, ['record_id', 'hindfoot_length']]
  .assign(hindfoot_length_cm = lambda df: df["hindfoot_length"] / 10)
)

       record_id  hindfoot_length  hindfoot_length_cm
0              1             32.0                 3.2
1              2             33.0                 3.3
2              3             37.0                 3.7
3              4             36.0                 3.6
4              5             35.0                 3.5
...          ...              ...                 ...
35544      35545              NaN                 NaN
35545      35546              NaN                 NaN
35546      35547             15.0                 1.5
35547      35548             36.0                 3.6
35548      35549              NaN                 NaN

[35549 rows x 3 columns]

Here, we do the following:

.loc[:, ['record_id', 'hindfoot_length']] selects the columns you want
.assign(...) then creates a new column
lambda df tells pandas to compute the new column using the current data frame in the chain

Oh, lambda!

Why Use lambda in Pandas .assign()?

In Python, a lambda is an anonymous function — a quick, in-place function without a name.

Why is lambda important in method chaining?

This all relates to the evaluation order: .assign() evaluates its arguments before the previous steps in the chain are fully applied.

So if you reference a column that was renamed or created earlier in the chain, pandas won’t find it unless you use lambda to delay evaluation.

Using lambda in .assign() ensures that pandas waits to evaluate the new column until the DataFrame is fully updated by the earlier steps.

Suppose we want to:

Select two columns
Create a new column weight_kg as weight_g / 1000

Without lambda — this will raise an error:

(
  surveys
  .loc[:, ["record_id", "weight"]]
  .assign(weight_kg = surveys["weight_g"] / 1000)  # KeyError: "weight_g"
)

Pipe-style packages in Python

There are some dplyr-style implementations in Python, that also include a pipe. One is siuba but it does not seem to be actively maintained. Another one is dfply, which has not been updated for 7 years and counting…

So, rather than being frustrated about this, I suggest we accept the differences between the two languages and move on! :-)

9.4 Chaining different commands

We can extend this concept a bit further, beyond just the creation of new columns.

surveys |> 
  select(record_id, weight) |>  # select columns
  rename(weight_g = weight) |>  # rename weight
  head()                        # show the first few rows

# A tibble: 6 × 2
  record_id weight_g
      <dbl>    <dbl>
1         1       NA
2         2       NA
3         3       NA
4         4       NA
5         5       NA
6         6       NA

(
  surveys
  .loc[:, ['record_id', 'weight']]          # select columns
  .rename(columns = {'weight': 'weight_g'}) # rename column
  .head()                                   # show first few rows
)

   record_id  weight_g
0          1       NaN
1          2       NaN
2          3       NaN
3          4       NaN
4          5       NaN

9.5 Exercises

9.5.1 Chaining

Exercise 1 - Chaining

Level:

For this exercise we’ll be using the data from data/parasites.csv.

The fish_length variable is the length of the fish measured in centimeters. Do the following:

select all but the parasite_count column
rename the fish_length column to fish_length_cm
create a new column that contains the fish length in inches (divide by 2.54)
do all of this in a single chain

Answer

Load the data, if needed.

parasites <- read_csv("data/parasites.csv")

parasites = pd.read_csv("data/parasites.csv")

Perform chaining operations.

We use select() to select the columns, rename() to rename the fish_length column and mutate() to create a new column containing the fishlength values in inches.

parasites |> 
  select(lake, fish_length) |> 
  rename(fish_length_cm = fish_length) |> 
  mutate(fish_length_in = fish_length_cm / 2.54)

# A tibble: 64 × 3
   lake  fish_length_cm fish_length_in
   <chr>          <dbl>          <dbl>
 1 C               21.1           8.31
 2 C               26.4          10.4 
 3 C               18.9           7.44
 4 B               27.2          10.7 
 5 C               29            11.4 
 6 B               24.2           9.53
 7 B               31.2          12.3 
 8 B               20.3           7.99
 9 C               27            10.6 
10 A               27            10.6 
# ℹ 54 more rows

We use .loc to select the columns, .rename() to rename the fish_length column and .assign() to create the new column containin the fishlength values in inches.

(
  parasites
  .loc[:, ["lake", "fish_length"]]
  .rename(columns = {"fish_length": "fish_length_cm"})
  .assign(fish_length_in = lambda df: df["fish_length_cm"] / 2.54)
)

   lake  fish_length_cm  fish_length_in
0     C            21.1        8.307087
1     C            26.4       10.393701
2     C            18.9        7.440945
3     B            27.2       10.708661
4     C            29.0       11.417323
..  ...             ...             ...
59    B            26.5       10.433071
60    A            24.6        9.685039
61    C            27.1       10.669291
62    A            25.9       10.196850
63    C            33.9       13.346457

[64 rows x 3 columns]

9.5.2 Converting dates: `surveys`

Exercise 2 - Converting dates

Level:

For this exercise we’ll go back to using the data from data/surveys.csv.

In our data we have the day, month and year values in separate columns. This is a safe way of storing the data, but when it comes to plotting or working with the data it’s useful to have it in a different format.

Create two new columns:

record_date that contains the full date in the format yyyy-mm-dd
month_abbr that contains the month abbreviation (e.g. Jan, Feb, …)
Do this using chains, so you don’t end up needing to save lots of intermediate objects

Feel free to use new packages…

Answer

We can do this as follows.

In R there is the lubridate package, which allows us to manipulate dates efficiently. After installing it, make sure to load it.

we use lubridate::make_date() to create the date column
we use the lubridate::month() function to create the month label, using abbreviations

library(lubridate)

surveys |> 
  mutate(record_date = make_date(year, month, day)) |> 
  mutate(month_abbr = month(month, label = TRUE, abbr = TRUE)) |> 
  select(record_id, year, month, day, record_date, month_abbr)

# A tibble: 35,549 × 6
   record_id  year month   day record_date month_abbr
       <dbl> <dbl> <dbl> <dbl> <date>      <ord>     
 1         1  1977     7    16 1977-07-16  Jul       
 2         2  1977     7    16 1977-07-16  Jul       
 3         3  1977     7    16 1977-07-16  Jul       
 4         4  1977     7    16 1977-07-16  Jul       
 5         5  1977     7    16 1977-07-16  Jul       
 6         6  1977     7    16 1977-07-16  Jul       
 7         7  1977     7    16 1977-07-16  Jul       
 8         8  1977     7    16 1977-07-16  Jul       
 9         9  1977     7    16 1977-07-16  Jul       
10        10  1977     7    16 1977-07-16  Jul       
# ℹ 35,539 more rows

We can use some functionality from the calendar package, as well as pandas.

we can use to_datetime() to convert the year, month and day columns to a date.
we can use calendar.month_abbr to convert the numeric value of month to a month abbreviation

import calendar

surveys = (
    surveys
    .assign(record_date = pd.to_datetime(surveys[['year', 'month', 'day']]))
    .assign(month_abbr = surveys['month'].apply(lambda x: calendar.month_abbr[x]))
)

surveys.head()

   record_id  month  day  year  ...  hindfoot_length weight record_date  month_abbr
0          1      7   16  1977  ...             32.0    NaN  1977-07-16         Jul
1          2      7   16  1977  ...             33.0    NaN  1977-07-16         Jul
2          3      7   16  1977  ...             37.0    NaN  1977-07-16         Jul
3          4      7   16  1977  ...             36.0    NaN  1977-07-16         Jul
4          5      7   16  1977  ...             35.0    NaN  1977-07-16         Jul

[5 rows x 11 columns]

9.6 Summary

Key points

In Python we use method chaining, which apply a series of transformations to a pandas DataFrame.
In R we have dedicated pipe symbols and we can use |> (built-in) or %>% (via magrittr package) to chain operations.
Both of these approaches allow us to run multiple lines of code sequentially, simplifying pipelines and making them easier to read.

9.1 Context

9.2 Section setup

9.3 Pipes or chaining commands

9.4 Chaining different commands

9.5 Exercises

9.5.1 Chaining

9.5.2 Converting dates: surveys

9.6 Summary

9.5.2 Converting dates: `surveys`