# A collection of R packages designed for data science
library(tidyverse)
<- read_csv("data/surveys.csv") surveys
9 Chaining operations
- Learn how to chain operations together.
9.1 Context
In the section above we performed several operations on a single data set. Often there is a sequence to this, where the output of one operation gets fed into the next. We can simplify this by chaining commands.
9.2 Section setup
We’ll continue this section with the script named da3-06-chaining-operations.R
. If needed, add the following code to the top of your script and run it.
We’ll continue this section with the script named da3-06-chaining-operations.py
. If needed, add the following code to the top of your script and run it.
# A Python data analysis and manipulation tool
import pandas as pd
# Python equivalent of `ggplot2`
from plotnine import *
= pd.read_csv("data/surveys.csv") surveys
9.3 Pipes or chaining commands
So far, we’ve used single operations when we were manipulating our data. For example, we can select columns with:
select(surveys, record_id, hindfoot_length)
Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.
We would have to do the following:
# grab the relevant columns and store in a new object
<- select(surveys, record_id, hindfoot_length)
subset_surveys
# create the new column
mutate(subset_surveys, hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
record_id hindfoot_length hindfoot_length_cm
<dbl> <dbl> <dbl>
1 1 32 3.2
2 2 33 3.3
3 3 37 3.7
4 4 36 3.6
5 5 35 3.5
6 6 14 1.4
7 7 NA NA
8 8 37 3.7
9 9 34 3.4
10 10 20 2
# ℹ 35,539 more rows
We had to create a new object (here, called subset_surveys
) to store the intermediate data we were interested in, and then continue with creating the new column.
This clutters up your computer’s memory rather quickly when dealing with lots of data. A much better way is that we pipe or chain one after the other. To do this, we start with the data and use a pipe symbol (|>
or %>%
) as follows:
|>
surveys select(record_id, hindfoot_length) |>
mutate(hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
record_id hindfoot_length hindfoot_length_cm
<dbl> <dbl> <dbl>
1 1 32 3.2
2 2 33 3.3
3 3 37 3.7
4 4 36 3.6
5 5 35 3.5
6 6 14 1.4
7 7 NA NA
8 8 37 3.7
9 9 34 3.4
10 10 20 2
# ℹ 35,539 more rows
An easy way of remembering what the pipe does is to replace (in your head) the pipe symbol with the phrase “and then…”.
So, we select()
the record_id
and hindfoot_length
columns and then use mutate()
to create a new column called hindfoot_length_cm
.
You’ll find that people use two pipe symbols quite interchangeably in R: the |>
pipe (native, built-in R) and %>%
from the magrittr
package.
The native, built-in pipe is a rather new addition, since version 4.1. It is slightly different in its behaviour than the %>%
pipe (if you want to know more, see here), but for most purposes they work the same.
We tend to use the native, built-in pipe throughout the materials. But the magrittr
pipe works just as well! You can change your preference in RStudio by going to Tools > Global options > Code
and changing the tickbox enabling/disabling the native pipe operator.
"record_id", "hindfoot_length"]].copy() surveys[[
record_id hindfoot_length
0 1 32.0
1 2 33.0
2 3 37.0
3 4 36.0
4 5 35.0
... ... ...
35544 35545 NaN
35545 35546 NaN
35546 35547 15.0
35547 35548 36.0
35548 35549 NaN
[35549 rows x 2 columns]
Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.
We would have to do the following:
# select the required columns and store in a new data set
= surveys[["record_id", "hindfoot_length"]].copy()
subset_surveys
# take the new data set and calculate the new column
"hindfoot_length_cm"] = subset_surveys["hindfoot_length"] / 10 subset_surveys[
We had to create a new object (here, called subset_surveys
) to store the intermediate data we were interested in, and then continue with creating the new column.
This clutters up your computer’s memory rather quickly when dealing with lots of data. So, it’d be good if we could pipe or chain these commands, like we can do in R.
Python does not have an exact equivalent to pipes in R, but you can chain methods in pandas
, which can make your life a lot easier. Let’s explore that a bit further.
In the code below we are using .loc[:, ["col1", "col2"]]
to select columns. This is a bit more chain-friendly than using [["col1", "col2"]]
because it also allows filtering (more on this in the next chapter).
(
surveys'record_id', 'hindfoot_length']]
.loc[:, [= lambda df: df["hindfoot_length"] / 10)
.assign(hindfoot_length_cm )
record_id hindfoot_length hindfoot_length_cm
0 1 32.0 3.2
1 2 33.0 3.3
2 3 37.0 3.7
3 4 36.0 3.6
4 5 35.0 3.5
... ... ... ...
35544 35545 NaN NaN
35545 35546 NaN NaN
35546 35547 15.0 1.5
35547 35548 36.0 3.6
35548 35549 NaN NaN
[35549 rows x 3 columns]
Here, we do the following:
.loc[:, ['record_id', 'hindfoot_length']]
selects the columns you want.assign(...)
then creates a new columnlambda df
tellspandas
to compute the new column using the current data frame in the chain
lambda
!
Why Use lambda
in Pandas .assign()
?
In Python, a lambda
is an anonymous function — a quick, in-place function without a name.
Why is lambda
important in method chaining?
This all relates to the evaluation order: .assign()
evaluates its arguments before the previous steps in the chain are fully applied.
So if you reference a column that was renamed or created earlier in the chain, pandas
won’t find it unless you use lambda
to delay evaluation.
Using lambda
in .assign()
ensures that pandas waits to evaluate the new column until the DataFrame is fully updated by the earlier steps.
Suppose we want to:
- Select two columns
- Create a new column
weight_kg
asweight_g / 1000
Without lambda
— this will raise an error:
(
surveys"record_id", "weight"]]
.loc[:, [= surveys["weight_g"] / 1000) # KeyError: "weight_g"
.assign(weight_kg )
There are some dplyr
-style implementations in Python, that also include a pipe. One is siuba but it does not seem to be actively maintained. Another one is dfply, which has not been updated for 7 years and counting…
So, rather than being frustrated about this, I suggest we accept the differences between the two languages and move on! :-)
9.4 Chaining different commands
We can extend this concept a bit further, beyond just the creation of new columns.
|>
surveys select(record_id, weight) |> # select columns
rename(weight_g = weight) |> # rename weight
head() # show the first few rows
# A tibble: 6 × 2
record_id weight_g
<dbl> <dbl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
(
surveys'record_id', 'weight']] # select columns
.loc[:, [= {'weight': 'weight_g'}) # rename column
.rename(columns # show first few rows
.head() )
record_id weight_g
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
9.4.1 Chaining
9.5 Summary
- In Python we use method chaining, which apply a series of transformations to a
pandas
DataFrame. - In R we have dedicated pipe symbols and we can use
|>
(built-in) or%>%
(viamagrittr
package) to chain operations. - Both of these approaches allow us to run multiple lines of code sequentially, simplifying pipelines and making them easier to read.