8 Data wrangling

Learning outcomes

Be able to make changes to variables (columns).
Be able to make changes to observations (rows).
Implement changes on a grouped basis.
Export a data set to file.

8.1 Libraries and functions

Click to expand

8.1.1 Libraries

library(tidyverse)

8.1.2 Functions

# create / change columns
dplyr::mutate()

# move columns
dplyr::relocate()

# group values by one or more variables
dplyr::group_by()

# count number of unique observations
dplyr::count()

# summarises data; specify the type of summary within the function
dplyr::summarise()

# reshapes the data into a wide format
tidyr::pivot_wider()

# reshapes the data into a long format
tidyr::pivot_longer()

8.2 Purpose and aim

Often, there is not one single data format that allows you to do all of your analysis. Getting comfortable with making changes to the way your data are organised is an important skill. This is sometimes referred to as ‘data wrangling’. In this section we’ll learn how we can change the organisation of columns, how to add new columns, manipulate rows and perform these operations on subgroups of the data.

8.3 Reading in data

We’ll keep using our data set on Darwin’s finches. If you haven’t read these data in, please do so with the following:

finches <- read_csv("data/finches.csv")

8.4 Creating new columns

Sometimes you’ll have to create new columns in your data set. For example, you might have a column that records something in kilograms, but you need it in milligrams. You’d then have to either convert the original column or create a new one with the new data.

Let’s see how to do this using the weight column from the finches data.

We’ll use pipes to do this, so we can see what R is doing without immediately updating the data. This is generally a useful technique: check each step one-by-one and after you’re happy with the changes, then update the table.

To add a column, we use the mutate() function. We first define the name of the new column, then tell it what needs to go in it.

finches %>% 
  mutate(weight_kg = weight / 1000)

# A tibble: 180 × 13
   species   group   weight  wing tarsus blength bdepth bwidth pc1_body pc1_beak
   <chr>     <chr>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
 1 G. fortis early_…   15.8  67.1   19.6   10.3    8.95   8.32    0.382   -0.431
 2 G. fortis early_…   15.2  66     18.3   10.4    8.7    8.4    -1.06    -0.452
 3 G. fortis early_…   18.0  68     18.9   11.2    9.6    8.83    0.839    0.955
 4 G. fortis early_…   18.5  70.3   19.7   11      9.7    8.73    2.16     0.824
 5 G. fortis early_…   15.7  69     18.9   10.9    9.8    9       0.332    1.08 
 6 G. fortis early_…   17.8  70.1   19.2   12.7   10.9    9.79    1.50     3.55 
 7 G. fortis early_…   17.2  69     20.3   11.9    9.8    9       1.86     1.67 
 8 G. fortis early_…   17.2  68.5   19.2   11.4    9.8    8.6     0.879    1.00 
 9 G. fortis early_…   16.5  66.3   18.7    9.04   8.42   7.98   -0.227   -1.81 
10 G. fortis early_…   19.4  69     18.7   11.3    9.6    8.8     1.39     1.00 
# ℹ 170 more rows
# ℹ 3 more variables: pc2_beak <dbl>, is_early <lgl>, weight_kg <dbl>

You’ll probably notice that our new column isn’t visible on screen. This is because we have quite a few columns in our table. We can move the new column to directly after the weight column. We use the relocate() function for this.

We tell relocate() which column we want to move, then use the .after = argument to specify where we want to insert the column.

finches %>% 
  mutate(weight_kg = weight / 1000) %>% 
  relocate(weight_kg, .after = weight)

# A tibble: 180 × 13
   species   group  weight weight_kg  wing tarsus blength bdepth bwidth pc1_body
   <chr>     <chr>   <dbl>     <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
 1 G. fortis early…   15.8    0.0158  67.1   19.6   10.3    8.95   8.32    0.382
 2 G. fortis early…   15.2    0.0152  66     18.3   10.4    8.7    8.4    -1.06 
 3 G. fortis early…   18.0    0.0180  68     18.9   11.2    9.6    8.83    0.839
 4 G. fortis early…   18.5    0.0185  70.3   19.7   11      9.7    8.73    2.16 
 5 G. fortis early…   15.7    0.0157  69     18.9   10.9    9.8    9       0.332
 6 G. fortis early…   17.8    0.0178  70.1   19.2   12.7   10.9    9.79    1.50 
 7 G. fortis early…   17.2    0.0172  69     20.3   11.9    9.8    9       1.86 
 8 G. fortis early…   17.2    0.0172  68.5   19.2   11.4    9.8    8.6     0.879
 9 G. fortis early…   16.5    0.0165  66.3   18.7    9.04   8.42   7.98   -0.227
10 G. fortis early…   19.4    0.0194  69     18.7   11.3    9.6    8.8     1.39 
# ℹ 170 more rows
# ℹ 3 more variables: pc1_beak <dbl>, pc2_beak <dbl>, is_early <lgl>

We can see that the new column indeed contains the new weight measurements, composed of the original weight values divided by 1,000.

Now that we know this gives us the result we want, we can update the original table:

finches <- finches %>% 
  mutate(weight_kg = weight / 1000) %>% 
  relocate(weight_kg, .after = weight)

8.5 Grouping and summarising

A very common technique used in data analysis is the “split-apply-combine”. This is a three-step process, where we:

Split the data into subgroups.
Apply a set of transformations / calculations / … to each subgroup.
Combine the result into a single table.

8.5.1 Groups

I happen to know that there are two distinct species in this data set. Let’s say we’re interested in finding out how many observations we have for each species.

There are two steps to this process:

We need to split the data by species.
We need to count the number of rows (= observations) in each subgroup.

We can use the group_by() function to group data by a given variable. Here, we will group the data by species:

finches %>% 
  group_by(species)

# A tibble: 180 × 13
# Groups:   species [2]
   species   group  weight weight_kg  wing tarsus blength bdepth bwidth pc1_body
   <chr>     <chr>   <dbl>     <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
 1 G. fortis early…   15.8    0.0158  67.1   19.6   10.3    8.95   8.32    0.382
 2 G. fortis early…   15.2    0.0152  66     18.3   10.4    8.7    8.4    -1.06 
 3 G. fortis early…   18.0    0.0180  68     18.9   11.2    9.6    8.83    0.839
 4 G. fortis early…   18.5    0.0185  70.3   19.7   11      9.7    8.73    2.16 
 5 G. fortis early…   15.7    0.0157  69     18.9   10.9    9.8    9       0.332
 6 G. fortis early…   17.8    0.0178  70.1   19.2   12.7   10.9    9.79    1.50 
 7 G. fortis early…   17.2    0.0172  69     20.3   11.9    9.8    9       1.86 
 8 G. fortis early…   17.2    0.0172  68.5   19.2   11.4    9.8    8.6     0.879
 9 G. fortis early…   16.5    0.0165  66.3   18.7    9.04   8.42   7.98   -0.227
10 G. fortis early…   19.4    0.0194  69     18.7   11.3    9.6    8.8     1.39 
# ℹ 170 more rows
# ℹ 3 more variables: pc1_beak <dbl>, pc2_beak <dbl>, is_early <lgl>

This doesn’t seem to make much difference, since it’s still outputting all of the data. However, if you look closely, you will notice that next to the A tibble: 180 x 13 text in the top-left corner there is now a Groups: species [2] designation. What this means is that, behind the scenes, the table is now also split by the species variable and that there are two distinct groups in there.

So, if we want to see how many observations we have in each group we can use the very useful count() function. We don’t have to specify anything - in this case it just counts the number of rows.

finches %>% 
  group_by(species) %>% 
  count()

# A tibble: 2 × 2
# Groups:   species [2]
  species         n
  <chr>       <int>
1 G. fortis      89
2 G. scandens    91

There we are, we have two distinct species of finch in these data and they more or less have an equal number of observations.

8.5.2 Summarising data

Quite often you might find yourself in a situation where you want to get some summary statistics, based on subgroups within the data. Let’s see how that works with our data.

We now know there are two species in our data. Let’s imagine we wanted to know the average weight for each species.

We can use the summarise() function to, well, summarise data. The first bit indicates the name of the new column that will contain the summarised values. The part after it determines what goes into this column.

Here we want the average weight, so we use mean(weight) to calculate this. Let’s store this in a column called avg_weight.

finches %>% 
  group_by(species) %>% 
  summarise(avg_weight = mean(weight))

# A tibble: 2 × 2
  species     avg_weight
  <chr>            <dbl>
1 G. fortis         15.8
2 G. scandens       19.5

This gives us a table where we have the average weight for each species. We can simply expand this for any other variables, for example:

# calculate mean, median, minimum and maximum weight per group
finches %>% 
  group_by(species) %>% 
  summarise(avg_weight = mean(weight),
            median_weight = median(weight),
            min_weight = min(weight),
            max_weight = max(weight))

# A tibble: 2 × 5
  species     avg_weight median_weight min_weight max_weight
  <chr>            <dbl>         <dbl>      <dbl>      <dbl>
1 G. fortis         15.8          15.5       11.6       19.9
2 G. scandens       19.5          19         15.4       24.4

8.6 Reshaping data

When you’re analysing your data, you’ll often find that you will need to structure your data in different ways, for different purposes.

Ideally, you always have the same starting point where:

Each column contains a single variable (something you’re measuring).
Each row is a single observation (all the measurements belonging to a single unit/person/plant etc).

Even though you might still need to have your data in a different shape, having it like this as a starting point means you can always rework your data.

Let’s illustrate this with the following example:

# A tibble: 6 × 3
  species     group             n
  <chr>       <chr>         <int>
1 G. fortis   early_blunt      30
2 G. fortis   late_blunt       30
3 G. fortis   late_pointed     29
4 G. scandens early_pointed    31
5 G. scandens late_blunt       30
6 G. scandens late_pointed     30

Here we have count data (number of observations) for each species and group. It’s quite a list and you can imagine that if you had many more species then it would become tricky to interpret. So, instead we’re going to reshape the this table and have a column for each unique group and a row for each species.

We can obtain the data set above by using the count() function. Here we are counting by two variables: species and group.

If we want to reshape the data, we can use the pivot_* functions. There are two main ones:

pivot_longer() creates a ‘long’ format data set; here each observation is a single row and data is repeated in the first column.
pivot_wider() creates a ‘wide’ format data set; here data is not repeated in the first column.

So, here we are using the pivot_wider() function. We need to tell it where the new column names are going to come from (names_from =). We also need to specify where the values are coming from that are going to be used to populate the new table (values_from =):

finches_wide <- finches %>% 
  count(species, group) %>% 
  pivot_wider(names_from = group, values_from = n)

finches_wide

# A tibble: 2 × 5
  species     early_blunt late_blunt late_pointed early_pointed
  <chr>             <int>      <int>        <int>         <int>
1 G. fortis            30         30           29            NA
2 G. scandens          NA         30           30            31

This gives us a ‘wide’ table, where the original data are split by the type of group. We have 4 distinct groups, so we end up with one column for each group plus the original one for species.

Long or wide?

Deciding which format to use can sometimes feel a bit tricky. Relating it to plotting can be helpful. Ask yourself the question: “what is going on the x and y axis?”. Each variable that you want to plot on either the x or y axis needs to be in its own column.

8.7 Exporting data

It can be useful to save data sets you create throughout your analysis.

We can do this using the write_csv() function. This will write a table to a .csv file (comma-separated values). The first part tells it which data set we’re saving. We’ll use the finches_wide as an example. The file = argument specifies where the file needs to be stored. Here, we are saving it in the data folder, under the name finches_wide.csv.

Note: the filename needs to be in quotes and needs to have a file extension.

write_csv(finches_wide, file = "data/finches_wide.csv")

8.8 Summary

Key points

A 3-step process (split - apply - combine) allows you to apply transformations on subgroups of your data.
The result can be combined in a single table.
We reshape our data based on our type of analysis.
Organise your data so that each variable has its own column and each observation is a row.