# A collection of R packages designed for data science
library(tidyverse)
<- read_csv("data/surveys.csv") surveys
9 Chaining operations
- Learn how to chain operations together.
9.1 Context
In the section above we performed several operations on a single data set. Often there is a sequence to this, where the output of one operation gets fed into the next. We can simplify this by chaining commands.
9.2 Section setup
We’ll continue this section with the script named da3-06-chaining-operations.R
. If needed, add the following code to the top of your script and run it.
We’ll continue this section with the script named da3-06-chaining-operations.py
. If needed, add the following code to the top of your script and run it.
# A Python data analysis and manipulation tool
import pandas as pd
# Python equivalent of `ggplot2`
from plotnine import *
= pd.read_csv("data/surveys.csv") surveys
9.3 Pipes
So far, we’ve used single operations when we were manipulating our data. For example, we can select columns with:
select(surveys, record_id, hindfoot_length)
# A tibble: 35,549 × 2
record_id hindfoot_length
<dbl> <dbl>
1 1 32
2 2 33
3 3 37
4 4 36
5 5 35
6 6 14
7 7 NA
8 8 37
9 9 34
10 10 20
# ℹ 35,539 more rows
Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.
We would have to do the following:
# grab the relevant columns and store in a new object
<- select(surveys, record_id, hindfoot_length)
subset_surveys
# create the new column
mutate(subset_surveys, hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
record_id hindfoot_length hindfoot_length_cm
<dbl> <dbl> <dbl>
1 1 32 3.2
2 2 33 3.3
3 3 37 3.7
4 4 36 3.6
5 5 35 3.5
6 6 14 1.4
7 7 NA NA
8 8 37 3.7
9 9 34 3.4
10 10 20 2
# ℹ 35,539 more rows
We had to create a new object (here, called subset_surveys
) to store the intermediate data we were interested in, and then continue with creating the new column.
This clutters up your computer’s memory rather quickly when dealing with lots of data. A much better way is that we pipe one after the other. To do this, we start with the data and use a pipe symbol (|>
or %>%
) as follows:
|>
surveys select(record_id, hindfoot_length) |>
mutate(hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 3
record_id hindfoot_length hindfoot_length_cm
<dbl> <dbl> <dbl>
1 1 32 3.2
2 2 33 3.3
3 3 37 3.7
4 4 36 3.6
5 5 35 3.5
6 6 14 1.4
7 7 NA NA
8 8 37 3.7
9 9 34 3.4
10 10 20 2
# ℹ 35,539 more rows
An easy way of remembering what the pipe does is to replace (in your head) the pipe symbol with the phrase “and then…”.
So, we select()
the record_id
and hindfoot_length
columns and then use mutate to create a new column called hindfoot_length_cm
.
You’ll find that people use two pipe symbols quite interchangeably in R: the |>
pipe (native, built-in R) and %>%
from the magrittr
package.
The native, built-in pipe is a rather new addition, since version 4.1. It is slightly different in its behaviour than the %>%
pipe (if you want to know more, see here), but for most purposes they work the same.
We tend to use the native, built-in pipe throughout the materials. But the magrittr
pipe works just as well! You can change your preference in RStudio by going to Tools > Global options > Code
and changing the tickbox enabling/disabling the native pipe operator.
"record_id", "hindfoot_length"]].copy() surveys[[
record_id hindfoot_length
0 1 32.0
1 2 33.0
2 3 37.0
3 4 36.0
4 5 35.0
... ... ...
35544 35545 NaN
35545 35546 NaN
35546 35547 15.0
35547 35548 36.0
35548 35549 NaN
[35549 rows x 2 columns]
Let’s say we wanted combine that with creating a new column, for example hindfoot length in centimeters.
We would have to do the following:
# select the required columns and store in a new data set
= surveys[["record_id", "hindfoot_length"]].copy()
selected
# take the new data set and calculate the new column
"hindfoot_length_cm"] = selected["hindfoot_length"] / 10 selected[
We had to create a new object (here, called subset_surveys
) to store the intermediate data we were interested in, and then continue with creating the new column.
This clutters up your computer’s memory rather quickly when dealing with lots of data. So, it’d be good if we could pipe these commands through, like we can do in R.
But, a bit of sad news here. Python does not really have an equivalent to pipes in R. You can somewhat emulate it with a non-intuitive set of operations like this:
(surveys"record_id", "hindfoot_length"]]
[[= lambda df: df["hindfoot_length"] / 10)) .assign(hindfoot_length_cm
record_id hindfoot_length hindfoot_length_cm
0 1 32.0 3.2
1 2 33.0 3.3
2 3 37.0 3.7
3 4 36.0 3.6
4 5 35.0 3.5
... ... ... ...
35544 35545 NaN NaN
35545 35546 NaN NaN
35546 35547 15.0 1.5
35547 35548 36.0 3.6
35548 35549 NaN NaN
[35549 rows x 3 columns]
Here, we do the following:
surveys[["record_id", "hindfoot_length"]]
selects the columns you want.assign(...)
then creates a new columnlambda df
tellspandas
to compute the new column using the current data frame in the chain
But I guess you’ll agree that this is not that much easier to read. There are some dplyr
-style implementations in Python, that also include a pipe. One is siuba but it does not seem to be actively maintained. Another one is dfply, which has not been updated for 7 years and counting…
So, rather than being frustrated about this, I suggest we accept the differences between the two languages and move on! :-)
9.4 Summary
- In Python there is not a clear way to chain operations.
- In R we can use
|>
(built-in) or%>%
(viamagrittr
package) to chain operations. - This allows us to run multiple lines of code sequentially, simplifying pipelines and making them easier to read.