# A collection of R packages designed for data science
library(tidyverse)
<- read_csv("data/surveys.csv") surveys
7 Manipulating columns
- Learn how to select and/or rename specific columns from a data frame
- Be able to create new columns and modify existing ones
7.1 Context
In the tabular data section we learned to deal with, well, tabular data in the form of our surveys
data set. This data set isn’t huge, but sometimes we have many variables and we might only want to work with a subset of them. Or, we might want to create new columns based on existing data. In this section we’ll cover how we can do this.
7.2 Section setup
We’ll start this section with a new script named, for example, 05-manipulation.R
. If needed, add the following code to the top of your script and run it.
We’ll start this section with a new script named, for example, 05-manipulation.py
. If needed, add the following code to the top of your script and run it.
# A Python data analysis and manipulation tool
import pandas as pd
# Python equivalent of `ggplot2`
from plotnine import *
= pd.read_csv("data/surveys.csv") surveys
7.3 Selecting columns
Let’s remind ourselves to which columns we have in our surveys
data set. After that, we’ll start making some changes.
colnames(surveys)
[1] "record_id" "month" "day" "year"
[5] "plot_id" "species_id" "sex" "hindfoot_length"
[9] "weight"
surveys.columns
Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
'hindfoot_length', 'weight'],
dtype='object')
7.3.1 Selecting individual columns
Let’s say we wanted to select only the record_id
and year
columns. We’ve briefly done this when we looked at subsetting rows and columns.
However, there is an alternative way of doing this using the dplyr
package - which is part of tidyverse
.
We can use the select()
function:
select(surveys, record_id, year)
# A tibble: 35,549 × 2
record_id year
<dbl> <dbl>
1 1 1977
2 2 1977
3 3 1977
4 4 1977
5 5 1977
6 6 1977
7 7 1977
8 8 1977
9 9 1977
10 10 1977
# ℹ 35,539 more rows
Using the base R syntax, this is equivalent to surveys[, c("record_id", "year")]
. Notice that with the select()
function (and generally with dplyr
functions) we didn’t need to quote ” the column names. This is because the first input to the function is the table name, and so everything after is assumed to be column names of that table.
The way we need to specify this is by giving a list of column names ["record_id", "year"]
and subsetting the surveys data set with this.
The way we subset is with surveys[ ]
, so we end up with double square brackets:
"record_id", "year"]] surveys[[
record_id year
0 1 1977
1 2 1977
2 3 1977
3 4 1977
4 5 1977
... ... ...
35544 35545 2002
35545 35546 2002
35546 35547 2002
35547 35548 2002
35548 35549 2002
[35549 rows x 2 columns]
7.3.2 Selecting with helper functions
The select()
function becomes particularly useful when we combine it with other helper functions. For example, this code will select all the columns where the column name contains the string (text) "_id"
:
# returns all columns where the column name contains the text "_id"
select(surveys, contains("_id"))
# A tibble: 35,549 × 3
record_id plot_id species_id
<dbl> <dbl> <chr>
1 1 2 NL
2 2 3 NL
3 3 2 DM
4 4 7 DM
5 5 3 DM
6 6 1 PF
7 7 2 PE
8 8 1 DM
9 9 1 DM
10 10 6 PF
# ℹ 35,539 more rows
The subsetting becomes a bit tedious when we’re looking for patterns in the column names. Here, we can instead use the .filter
attribute of the surveys data set, and look for a string (text) where the column name contains "_id"
.
filter(like = "_id") surveys.
record_id plot_id species_id
0 1 2 NL
1 2 3 NL
2 3 2 DM
3 4 7 DM
4 5 3 DM
... ... ... ...
35544 35545 15 AH
35545 35546 15 AH
35546 35547 10 RM
35547 35548 7 DO
35548 35549 5 NaN
[35549 rows x 3 columns]
7.3.3 Selecting a range of columns
Let’s say we’re interested in all the columns from record_id
to year
.
In that case, we can use the :
symbol.
# returns all columns between and including record_id and year
select(surveys, record_id:year)
# A tibble: 35,549 × 4
record_id month day year
<dbl> <dbl> <dbl> <dbl>
1 1 7 16 1977
2 2 7 16 1977
3 3 7 16 1977
4 4 7 16 1977
5 5 7 16 1977
6 6 7 16 1977
7 7 7 16 1977
8 8 7 16 1977
9 9 7 16 1977
10 10 7 16 1977
# ℹ 35,539 more rows
We can also combine this with the previous method:
# returns all columns between and including record_id and year
# and all columns where the column name contains the text "_id"
select(surveys, record_id:year, contains("_id"))
# A tibble: 35,549 × 6
record_id month day year plot_id species_id
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 7 16 1977 2 NL
2 2 7 16 1977 3 NL
3 3 7 16 1977 2 DM
4 4 7 16 1977 7 DM
5 5 7 16 1977 3 DM
6 6 7 16 1977 1 PF
7 7 7 16 1977 2 PE
8 8 7 16 1977 1 DM
9 9 7 16 1977 1 DM
10 10 7 16 1977 6 PF
# ℹ 35,539 more rows
In that case, we can use the :
symbol, in combination with the .loc
indexer.
"record_id":"year"] surveys.loc[:,
record_id month day year
0 1 7 16 1977
1 2 7 16 1977
2 3 7 16 1977
3 4 7 16 1977
4 5 7 16 1977
... ... ... ... ...
35544 35545 12 31 2002
35545 35546 12 31 2002
35546 35547 12 31 2002
35547 35548 12 31 2002
35548 35549 12 31 2002
[35549 rows x 4 columns]
7.3.4 Unselecting columns
Lastly, we can also unselect columns. This can be useful when you want most columns, apart from some.
To do this, we use the -
symbol before the column name.
# returns all columns apart from record_id
select(surveys, -record_id)
# A tibble: 35,549 × 8
month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 7 16 1977 2 NL M 32 NA
2 7 16 1977 3 NL M 33 NA
3 7 16 1977 2 DM F 37 NA
4 7 16 1977 7 DM M 36 NA
5 7 16 1977 3 DM M 35 NA
6 7 16 1977 1 PF M 14 NA
7 7 16 1977 2 PE F NA NA
8 7 16 1977 1 DM M 37 NA
9 7 16 1977 1 DM F 34 NA
10 7 16 1977 6 PF F 20 NA
# ℹ 35,539 more rows
To do this, we use the .drop
attribute. Here, we only unselect one column, but we can easily extend this by providing a list of columns do the column =
argument.
= "record_id") surveys.drop(columns
month day year plot_id species_id sex hindfoot_length weight
0 7 16 1977 2 NL M 32.0 NaN
1 7 16 1977 3 NL M 33.0 NaN
2 7 16 1977 2 DM F 37.0 NaN
3 7 16 1977 7 DM M 36.0 NaN
4 7 16 1977 3 DM M 35.0 NaN
... ... ... ... ... ... ... ... ...
35544 12 31 2002 15 AH NaN NaN NaN
35545 12 31 2002 15 AH NaN NaN NaN
35546 12 31 2002 10 RM F 15.0 14.0
35547 12 31 2002 7 DO M 36.0 51.0
35548 12 31 2002 5 NaN NaN NaN NaN
[35549 rows x 8 columns]
7.4 Renaming and reshuffling columns
7.4.1 Renaming columns
For example, we might want to change the weight
column name to weight_g
, to reflect that the values are in grams.
We can use the rename()
function to change a column name. We do this as follows:
rename(surveys, weight_g = weight)
# A tibble: 35,549 × 9
record_id month day year plot_id species_id sex hindfoot_length weight_g
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
7 7 7 16 1977 2 PE F NA NA
8 8 7 16 1977 1 DM M 37 NA
9 9 7 16 1977 1 DM F 34 NA
10 10 7 16 1977 6 PF F 20 NA
# ℹ 35,539 more rows
We can use the .rename()
attribute of the surveys
Pandas data frame:
= {'weight': 'weight_g'}) surveys.rename(columns
record_id month day year ... species_id sex hindfoot_length weight_g
0 1 7 16 1977 ... NL M 32.0 NaN
1 2 7 16 1977 ... NL M 33.0 NaN
2 3 7 16 1977 ... DM F 37.0 NaN
3 4 7 16 1977 ... DM M 36.0 NaN
4 5 7 16 1977 ... DM M 35.0 NaN
... ... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 ... AH NaN NaN NaN
35545 35546 12 31 2002 ... AH NaN NaN NaN
35546 35547 12 31 2002 ... RM F 15.0 14.0
35547 35548 12 31 2002 ... DO M 36.0 51.0
35548 35549 12 31 2002 ... NaN NaN NaN NaN
[35549 rows x 9 columns]
7.4.2 Reshuffling columns
It might be that you want to reorder/reshuffle a column. Here, the year
column is our fourth variable. Let’s say we’d want to move this to the second position (after record_id
).
We can use the relocate()
function to do this. The function has several arguments, starting with .
, such as .before =
or .after =
. These allow you to specify where you want to reinsert the column.
relocate(surveys, year, .after = record_id)
# A tibble: 35,549 × 9
record_id year month day plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 1977 7 16 2 NL M 32 NA
2 2 1977 7 16 3 NL M 33 NA
3 3 1977 7 16 2 DM F 37 NA
4 4 1977 7 16 7 DM M 36 NA
5 5 1977 7 16 3 DM M 35 NA
6 6 1977 7 16 1 PF M 14 NA
7 7 1977 7 16 2 PE F NA NA
8 8 1977 7 16 1 DM M 37 NA
9 9 1977 7 16 1 DM F 34 NA
10 10 1977 7 16 6 PF F 20 NA
# ℹ 35,539 more rows
Unlike in R, there isn’t a very clear, straightforward way of reinserting columns in a Pandas data frame. We could show you convoluted ways of doing so, but at this point that’s just confusing. So, we’ll leave you with a link to a Stackoverflow solution.
7.5 Creating new columns
Sometimes we need to create new columns. For example, we might have a variable that is not in the unit of measurement we need (e.g. in millimeter, instead of centimeters).
month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|
7 | 16 | 1977 | 2 | NL | M | 32 | NA |
7 | 16 | 1977 | 3 | NL | M | 33 | NA |
7 | 16 | 1977 | 2 | DM | F | 37 | NA |
7 | 16 | 1977 | 7 | DM | M | 36 | NA |
7 | 16 | 1977 | 3 | DM | M | 35 | NA |
Let’s say we wanted to get hindfoot_length
in centimeters, instead of millimeters. We’d have to go through each row, take the hindfoot_length
value and divide it by 10. We then need to store this output in a column called, for example, hindfoot_length_cm
.
We can use the mutate()
function to create new columns:
mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
# A tibble: 35,549 × 10
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
7 7 7 16 1977 2 PE F NA NA
8 8 7 16 1977 1 DM M 37 NA
9 9 7 16 1977 1 DM F 34 NA
10 10 7 16 1977 6 PF F 20 NA
# ℹ 35,539 more rows
# ℹ 1 more variable: hindfoot_length_cm <dbl>
We use the square brackets to define the name of the new column, then specify what needs to go in the new column:
'hindfoot_length_cm'] = surveys['hindfoot_length'] / 10 surveys[
Although it has created the column, we can’t quite see it because we have too many columns. So, let’s save the new column to the data set and then select the relevant columns.
First, we update our data:
<- mutate(surveys, hindfoot_length_cm = hindfoot_length / 10) surveys
Next, we can select the columns.
select(surveys, record_id, hindfoot_length, hindfoot_length_cm)
# A tibble: 35,549 × 3
record_id hindfoot_length hindfoot_length_cm
<dbl> <dbl> <dbl>
1 1 32 3.2
2 2 33 3.3
3 3 37 3.7
4 4 36 3.6
5 5 35 3.5
6 6 14 1.4
7 7 NA NA
8 8 37 3.7
9 9 34 3.4
10 10 20 2
# ℹ 35,539 more rows
Our previous step already added the new column to the DataFrame, so we can directly select the relevant columns, by giving a list of the columns we’re interested in:
'record_id', 'hindfoot_length', 'hindfoot_length_cm']] surveys[[
record_id hindfoot_length hindfoot_length_cm
0 1 32.0 3.2
1 2 33.0 3.3
2 3 37.0 3.7
3 4 36.0 3.6
4 5 35.0 3.5
... ... ... ...
35544 35545 NaN NaN
35545 35546 NaN NaN
35546 35547 15.0 1.5
35547 35548 36.0 3.6
35548 35549 NaN NaN
[35549 rows x 3 columns]
7.6 Summary
- We have several functions available that allow us to select, move, rename and create new columns