6  Working with data

Learning outcomes
  • Be able to import tabular data
  • Perform basic operations on data

6.1 Libraries and functions

6.1.1 Libraries

library(tidyverse)

6.1.2 Functions

Functions below are mostly shown in the following way:

package_name::name_of_function()

The reason why we’re doing this is two-fold:

  1. To make it explicit that functions are often packaged together into ‘umbrella’ packages. Tidyverse is one of those - it contains many packages such as tidyr, ggplot2, readr. This way it’s clear which package each particular function is coming from.
  2. Sometimes the same function name is used across different packages. We’ll see that later, where there is a filter() function in both the stats and dplyr packages. Throughout the course the correct one should be loaded automatically, but this way you can always check!
# read in a .csv (comma-separated values) file
readr::read_csv()

# show the first 6 rows of a table
head()

# select columns in a table
dplyr::select()

# filter rows in a table
dplyr::filter()

6.2 Purpose and aim

In this section we’re covering the basics of reading in using tabular data.

6.3 Darwin’s finches

We’ll look at some data that come from an analysis of gene flow across two finch species (Lamichhaney et al. 2020).

The data focus on two species, Geospiza fortis and G. scandens. The measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as early (pre-1983) and late (1983 onwards).

6.4 Reading in data

There are several functions to read data into R, we’re going to use one from the readr package, which is part of the tidyverse. As such, we first need to load the package into R’s memory, by using the library() function:

library(tidyverse)

This command has to be run every time you start a new R session. Typically you want to include the library() calls at the top of your script, so that a user knows which packages need to be installed to run the analysis.

Our data is provided in CSV format (comma separated values). This format is a regular text file, where each value (or column of the table) is separated by a comma. To read such a file, we use the read_csv() function, which needs at least one input: the path of the file we want to read. It is also good practice to explicitly define how missing data is encoded in the file with the na option. In our case, missing data are encoded as an empty string (imagine this as an empty cell in a spreadsheet).

Here’s the command:

finches <- read_csv("data/finches.csv")
Rows: 180 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): species, group
dbl (9): weight, wing, tarsus, blength, bdepth, bwidth, pc1_body, pc1_beak, ...
lgl (1): is_early

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We see a lot of output there, but this is not an error! It’s a message that read_csv() prints to inform us of what type of data it thinks each column of the data set is. We’ll discuss this in a while.

It’s always useful to have a glimpse at the first few rows of your data set, to see how it is structured. We can do that with the head() function.

head(finches)
# A tibble: 6 × 12
  species   group    weight  wing tarsus blength bdepth bwidth pc1_body pc1_beak
  <chr>     <chr>     <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 G. fortis early_b…   15.8  67.1   19.6    10.3   8.95   8.32    0.382   -0.431
2 G. fortis early_b…   15.2  66     18.3    10.4   8.7    8.4    -1.06    -0.452
3 G. fortis early_b…   18.0  68     18.9    11.2   9.6    8.83    0.839    0.955
4 G. fortis early_b…   18.5  70.3   19.7    11     9.7    8.73    2.16     0.824
5 G. fortis early_b…   15.7  69     18.9    10.9   9.8    9       0.332    1.08 
6 G. fortis early_b…   17.8  70.1   19.2    12.7  10.9    9.79    1.50     3.55 
# ℹ 2 more variables: pc2_beak <dbl>, is_early <lgl>

6.4.1 The data.frame object

A data.frame is the basic type of object that stores tabular data. The readr package reads data in an “extended” version of a data frame that it calls tibble (tbl for short). The details of their differences are not very important unless you are a programmer, but tibbles offer some user conveniences such as a better printing method. For the rest of the course we’ll refer to “data frames” and “tibbles” interchangeably.

6.5 Subsetting data

We can subset the data in our finches table by column or row. The tidyverse package has a series of useful functions that allow you to do this.

6.5.1 Subsetting by column

We can use the select() function to select certain columns, for example if we just wanted the country and year column. The first argument we give to the function is the data set, followed by the name of the columns we want:

select(finches, group, wing)
# A tibble: 180 × 2
   group        wing
   <chr>       <dbl>
 1 early_blunt  67.1
 2 early_blunt  66  
 3 early_blunt  68  
 4 early_blunt  70.3
 5 early_blunt  69  
 6 early_blunt  70.1
 7 early_blunt  69  
 8 early_blunt  68.5
 9 early_blunt  66.3
10 early_blunt  69  
# ℹ 170 more rows

6.5.2 Subsetting by row

Now let’s say we wanted to only keep certain observations - which are organised in rows. Here we can use the filter() function. For example, if we only wanted the data for the United Kingdom:

filter(finches, species == "G. fortis")
# A tibble: 89 × 12
   species   group   weight  wing tarsus blength bdepth bwidth pc1_body pc1_beak
   <chr>     <chr>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
 1 G. fortis early_…   15.8  67.1   19.6   10.3    8.95   8.32    0.382   -0.431
 2 G. fortis early_…   15.2  66     18.3   10.4    8.7    8.4    -1.06    -0.452
 3 G. fortis early_…   18.0  68     18.9   11.2    9.6    8.83    0.839    0.955
 4 G. fortis early_…   18.5  70.3   19.7   11      9.7    8.73    2.16     0.824
 5 G. fortis early_…   15.7  69     18.9   10.9    9.8    9       0.332    1.08 
 6 G. fortis early_…   17.8  70.1   19.2   12.7   10.9    9.79    1.50     3.55 
 7 G. fortis early_…   17.2  69     20.3   11.9    9.8    9       1.86     1.67 
 8 G. fortis early_…   17.2  68.5   19.2   11.4    9.8    8.6     0.879    1.00 
 9 G. fortis early_…   16.5  66.3   18.7    9.04   8.42   7.98   -0.227   -1.81 
10 G. fortis early_…   19.4  69     18.7   11.3    9.6    8.8     1.39     1.00 
# ℹ 79 more rows
# ℹ 2 more variables: pc2_beak <dbl>, is_early <lgl>

Here we’ve taken the finches data set and we asked R to give us the rows where species == "G. fortis" is TRUE. It goes through all the rows, in this case checking the species column. If the statement species == "G. fortis" is TRUE, it returns the row. Otherwise it doesn’t.

We could also use a different conditional statement, for example returning all the rows where the weight is larger than 18 grammes:

filter(finches, weight > 18)
# A tibble: 73 × 12
   species   group   weight  wing tarsus blength bdepth bwidth pc1_body pc1_beak
   <chr>     <chr>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
 1 G. fortis early_…   18.5  70.3   19.7    11      9.7   8.73    2.16     0.824
 2 G. fortis early_…   19.4  69     18.7    11.3    9.6   8.8     1.39     1.00 
 3 G. fortis early_…   18.0  68.2   18.8    12.3   10.6   9.5     0.826    2.88 
 4 G. fortis early_…   19.9  67     20      11     10     8.8     2.01     1.07 
 5 G. fortis early_…   18.4  70.9   20.1    11.4   10.8  10.1     2.57     3.02 
 6 G. fortis early_…   18.2  68     18.4    10.9    9.7   9.03    0.487    1.03 
 7 G. fortis early_…   18.4  70     19.7    11.8   10.3   9.4     2.06     2.29 
 8 G. fortis early_…   19.8  75.6   19.2    12.8    9.3   8.53    3.54     1.45 
 9 G. fortis early_…   18.8  71     19.2    11.8    9.9   8.5     2.08     1.20 
10 G. fortis late_b…   19.0  70     19.8    12     11.2   9.9     2.32     3.40 
# ℹ 63 more rows
# ℹ 2 more variables: pc2_beak <dbl>, is_early <lgl>

6.6 Chaining commands

Sometimes we need to perform many different operations before we have the right data in the correct format that we need. For example, we might want to filter for certain values and then only keep certain columns. We could perform these operations one by one and save the output of each into an object that we then use for the next operation.

But this is not very efficient. So it can be useful to chain certain operations together, performing them one by one.

In R we can do this with the pipe. We’ll be using the pipe operator for tidyverse (%>%). The pipe always starts with data, which it then “pipes through” to a function.

Let’s look at an example, recreating the filter() operation we did earlier, but this time with a pipe:

finches %>% 
  filter(weight > 18)
# A tibble: 73 × 12
   species   group   weight  wing tarsus blength bdepth bwidth pc1_body pc1_beak
   <chr>     <chr>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
 1 G. fortis early_…   18.5  70.3   19.7    11      9.7   8.73    2.16     0.824
 2 G. fortis early_…   19.4  69     18.7    11.3    9.6   8.8     1.39     1.00 
 3 G. fortis early_…   18.0  68.2   18.8    12.3   10.6   9.5     0.826    2.88 
 4 G. fortis early_…   19.9  67     20      11     10     8.8     2.01     1.07 
 5 G. fortis early_…   18.4  70.9   20.1    11.4   10.8  10.1     2.57     3.02 
 6 G. fortis early_…   18.2  68     18.4    10.9    9.7   9.03    0.487    1.03 
 7 G. fortis early_…   18.4  70     19.7    11.8   10.3   9.4     2.06     2.29 
 8 G. fortis early_…   19.8  75.6   19.2    12.8    9.3   8.53    3.54     1.45 
 9 G. fortis early_…   18.8  71     19.2    11.8    9.9   8.5     2.08     1.20 
10 G. fortis late_b…   19.0  70     19.8    12     11.2   9.9     2.32     3.40 
# ℹ 63 more rows
# ℹ 2 more variables: pc2_beak <dbl>, is_early <lgl>

What it’s done is taken the finches data set and then sent this to the filter() function. The function doesn’t need the data set specified explicitly, because it knows it is coming from the pipe.

We can combine this with other functions:

finches %>% 
  filter(weight > 18) %>% 
  select(species, weight)
# A tibble: 73 × 2
   species   weight
   <chr>      <dbl>
 1 G. fortis   18.5
 2 G. fortis   19.4
 3 G. fortis   18.0
 4 G. fortis   19.9
 5 G. fortis   18.4
 6 G. fortis   18.2
 7 G. fortis   18.4
 8 G. fortis   19.8
 9 G. fortis   18.8
10 G. fortis   19.0
# ℹ 63 more rows

Here we’ve performed the filtering, and then selected the species and weight columns.

Chaining operations can be a very powerful tool, since it allows you to break down a complex operation into smaller steps. This often makes the analysis a lot less daunting!

6.7 Summary

Key points
  • Tabular data are an excellent format for programming languages
  • Having variables in columns and observations in rows makes analysis easier
  • We can subset data across columns and rows