library(tidyverse)
8 Data wrangling
- Be able to make changes to variables (columns).
- Be able to make changes to observations (rows).
- Implement changes on a grouped basis.
- Export a data set to file.
8.1 Libraries and functions
8.1.1 Libraries
8.1.2 Functions
# create / change columns
::mutate()
dplyr
# move columns
::relocate()
dplyr
# group values by one or more variables
::group_by()
dplyr
# count number of unique observations
::count()
dplyr
# summarises data; specify the type of summary within the function
::summarise()
dplyr
# reshapes the data into a wide format
::pivot_wider()
tidyr
# reshapes the data into a long format
::pivot_longer() tidyr
8.2 Purpose and aim
Often, there is not one single data format that allows you to do all of your analysis. Getting comfortable with making changes to the way your data are organised is an important skill. This is sometimes referred to as ‘data wrangling’. In this section we’ll learn how we can change the organisation of columns, how to add new columns, manipulate rows and perform these operations on subgroups of the data.
8.3 Reading in data
We’ll keep using our data set on Darwin’s finches. If you haven’t read these data in, please do so with the following:
<- read_csv("data/finches.csv") finches
8.4 Creating new columns
Sometimes you’ll have to create new columns in your data set. For example, you might have a column that records something in kilograms, but you need it in milligrams. You’d then have to either convert the original column or create a new one with the new data.
Let’s see how to do this using the weight
column from the finches
data.
We’ll use pipes to do this, so we can see what R is doing without immediately updating the data. This is generally a useful technique: check each step one-by-one and after you’re happy with the changes, then update the table.
To add a column, we use the mutate()
function. We first define the name of the new column, then tell it what needs to go in it.
%>%
finches mutate(weight_kg = weight / 1000)
# A tibble: 180 × 13
species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 G. fortis early_… 15.8 67.1 19.6 10.3 8.95 8.32 0.382 -0.431
2 G. fortis early_… 15.2 66 18.3 10.4 8.7 8.4 -1.06 -0.452
3 G. fortis early_… 18.0 68 18.9 11.2 9.6 8.83 0.839 0.955
4 G. fortis early_… 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824
5 G. fortis early_… 15.7 69 18.9 10.9 9.8 9 0.332 1.08
6 G. fortis early_… 17.8 70.1 19.2 12.7 10.9 9.79 1.50 3.55
7 G. fortis early_… 17.2 69 20.3 11.9 9.8 9 1.86 1.67
8 G. fortis early_… 17.2 68.5 19.2 11.4 9.8 8.6 0.879 1.00
9 G. fortis early_… 16.5 66.3 18.7 9.04 8.42 7.98 -0.227 -1.81
10 G. fortis early_… 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00
# ℹ 170 more rows
# ℹ 3 more variables: pc2_beak <dbl>, is_early <lgl>, weight_kg <dbl>
You’ll probably notice that our new column isn’t visible on screen. This is because we have quite a few columns in our table. We can move the new column to directly after the weight
column. We use the relocate()
function for this.
We tell relocate()
which column we want to move, then use the .after =
argument to specify where we want to insert the column.
%>%
finches mutate(weight_kg = weight / 1000) %>%
relocate(weight_kg, .after = weight)
# A tibble: 180 × 13
species group weight weight_kg wing tarsus blength bdepth bwidth pc1_body
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 G. fortis early… 15.8 0.0158 67.1 19.6 10.3 8.95 8.32 0.382
2 G. fortis early… 15.2 0.0152 66 18.3 10.4 8.7 8.4 -1.06
3 G. fortis early… 18.0 0.0180 68 18.9 11.2 9.6 8.83 0.839
4 G. fortis early… 18.5 0.0185 70.3 19.7 11 9.7 8.73 2.16
5 G. fortis early… 15.7 0.0157 69 18.9 10.9 9.8 9 0.332
6 G. fortis early… 17.8 0.0178 70.1 19.2 12.7 10.9 9.79 1.50
7 G. fortis early… 17.2 0.0172 69 20.3 11.9 9.8 9 1.86
8 G. fortis early… 17.2 0.0172 68.5 19.2 11.4 9.8 8.6 0.879
9 G. fortis early… 16.5 0.0165 66.3 18.7 9.04 8.42 7.98 -0.227
10 G. fortis early… 19.4 0.0194 69 18.7 11.3 9.6 8.8 1.39
# ℹ 170 more rows
# ℹ 3 more variables: pc1_beak <dbl>, pc2_beak <dbl>, is_early <lgl>
We can see that the new column indeed contains the new weight measurements, composed of the original weight
values divided by 1,000.
Now that we know this gives us the result we want, we can update the original table:
<- finches %>%
finches mutate(weight_kg = weight / 1000) %>%
relocate(weight_kg, .after = weight)
8.5 Grouping and summarising
A very common technique used in data analysis is the “split-apply-combine”. This is a three-step process, where we:
- Split the data into subgroups.
- Apply a set of transformations / calculations / … to each subgroup.
- Combine the result into a single table.
8.5.1 Groups
I happen to know that there are two distinct species in this data set. Let’s say we’re interested in finding out how many observations we have for each species.
There are two steps to this process:
- We need to split the data by
species
. - We need to count the number of rows (= observations) in each subgroup.
We can use the group_by()
function to group data by a given variable. Here, we will group the data by species
:
%>%
finches group_by(species)
# A tibble: 180 × 13
# Groups: species [2]
species group weight weight_kg wing tarsus blength bdepth bwidth pc1_body
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 G. fortis early… 15.8 0.0158 67.1 19.6 10.3 8.95 8.32 0.382
2 G. fortis early… 15.2 0.0152 66 18.3 10.4 8.7 8.4 -1.06
3 G. fortis early… 18.0 0.0180 68 18.9 11.2 9.6 8.83 0.839
4 G. fortis early… 18.5 0.0185 70.3 19.7 11 9.7 8.73 2.16
5 G. fortis early… 15.7 0.0157 69 18.9 10.9 9.8 9 0.332
6 G. fortis early… 17.8 0.0178 70.1 19.2 12.7 10.9 9.79 1.50
7 G. fortis early… 17.2 0.0172 69 20.3 11.9 9.8 9 1.86
8 G. fortis early… 17.2 0.0172 68.5 19.2 11.4 9.8 8.6 0.879
9 G. fortis early… 16.5 0.0165 66.3 18.7 9.04 8.42 7.98 -0.227
10 G. fortis early… 19.4 0.0194 69 18.7 11.3 9.6 8.8 1.39
# ℹ 170 more rows
# ℹ 3 more variables: pc1_beak <dbl>, pc2_beak <dbl>, is_early <lgl>
This doesn’t seem to make much difference, since it’s still outputting all of the data. However, if you look closely, you will notice that next to the A tibble: 180 x 13
text in the top-left corner there is now a Groups: species [2]
designation. What this means is that, behind the scenes, the table is now also split by the species
variable and that there are two distinct groups in there.
So, if we want to see how many observations we have in each group we can use the very useful count()
function. We don’t have to specify anything - in this case it just counts the number of rows.
%>%
finches group_by(species) %>%
count()
# A tibble: 2 × 2
# Groups: species [2]
species n
<chr> <int>
1 G. fortis 89
2 G. scandens 91
There we are, we have two distinct species of finch in these data and they more or less have an equal number of observations.
8.5.2 Summarising data
Quite often you might find yourself in a situation where you want to get some summary statistics, based on subgroups within the data. Let’s see how that works with our data.
We now know there are two species in our data. Let’s imagine we wanted to know the average weight
for each species.
We can use the summarise()
function to, well, summarise data. The first bit indicates the name of the new column that will contain the summarised values. The part after it determines what goes into this column.
Here we want the average weight, so we use mean(weight)
to calculate this. Let’s store this in a column called avg_weight
.
%>%
finches group_by(species) %>%
summarise(avg_weight = mean(weight))
# A tibble: 2 × 2
species avg_weight
<chr> <dbl>
1 G. fortis 15.8
2 G. scandens 19.5
This gives us a table where we have the average weight for each species. We can simply expand this for any other variables, for example:
# calculate mean, median, minimum and maximum weight per group
%>%
finches group_by(species) %>%
summarise(avg_weight = mean(weight),
median_weight = median(weight),
min_weight = min(weight),
max_weight = max(weight))
# A tibble: 2 × 5
species avg_weight median_weight min_weight max_weight
<chr> <dbl> <dbl> <dbl> <dbl>
1 G. fortis 15.8 15.5 11.6 19.9
2 G. scandens 19.5 19 15.4 24.4
8.6 Reshaping data
When you’re analysing your data, you’ll often find that you will need to structure your data in different ways, for different purposes.
Ideally, you always have the same starting point where:
- Each column contains a single variable (something you’re measuring).
- Each row is a single observation (all the measurements belonging to a single unit/person/plant etc).
Even though you might still need to have your data in a different shape, having it like this as a starting point means you can always rework your data.
Let’s illustrate this with the following example:
# A tibble: 6 × 3
species group n
<chr> <chr> <int>
1 G. fortis early_blunt 30
2 G. fortis late_blunt 30
3 G. fortis late_pointed 29
4 G. scandens early_pointed 31
5 G. scandens late_blunt 30
6 G. scandens late_pointed 30
Here we have count data (number of observations) for each species and group. It’s quite a list and you can imagine that if you had many more species then it would become tricky to interpret. So, instead we’re going to reshape the this table and have a column for each unique group
and a row for each species
.
We can obtain the data set above by using the count()
function. Here we are counting by two variables: species
and group
.
If we want to reshape the data, we can use the pivot_*
functions. There are two main ones:
pivot_longer()
creates a ‘long’ format data set; here each observation is a single row and data is repeated in the first column.pivot_wider()
creates a ‘wide’ format data set; here data is not repeated in the first column.
So, here we are using the pivot_wider()
function. We need to tell it where the new column names are going to come from (names_from =
). We also need to specify where the values are coming from that are going to be used to populate the new table (values_from =
):
<- finches %>%
finches_wide count(species, group) %>%
pivot_wider(names_from = group, values_from = n)
finches_wide
# A tibble: 2 × 5
species early_blunt late_blunt late_pointed early_pointed
<chr> <int> <int> <int> <int>
1 G. fortis 30 30 29 NA
2 G. scandens NA 30 30 31
This gives us a ‘wide’ table, where the original data are split by the type of group
. We have 4 distinct groups, so we end up with one column for each group plus the original one for species
.
Deciding which format to use can sometimes feel a bit tricky. Relating it to plotting can be helpful. Ask yourself the question: “what is going on the x and y axis?”. Each variable that you want to plot on either the x or y axis needs to be in its own column.
8.7 Exporting data
It can be useful to save data sets you create throughout your analysis.
We can do this using the write_csv()
function. This will write a table to a .csv
file (comma-separated values). The first part tells it which data set we’re saving. We’ll use the finches_wide
as an example. The file =
argument specifies where the file needs to be stored. Here, we are saving it in the data
folder, under the name finches_wide.csv
.
Note: the filename needs to be in quotes and needs to have a file extension.
write_csv(finches_wide, file = "data/finches_wide.csv")
8.8 Summary
- A 3-step process (split - apply - combine) allows you to apply transformations on subgroups of your data.
- The result can be combined in a single table.
- We reshape our data based on our type of analysis.
- Organise your data so that each variable has its own column and each observation is a row.