Data exploration workflow

When you are working on a project that requires data analysis, you will normally need to perform the following steps:

More information on this workflow can be found in the R for Data Science book. To understand better the workflow in the illustration above, let us go over each stage to see what each step entails:

  1. The first step in working with data is to first import your data into R. This connects the external file/database to your project in R.
  2. Cleaning or tidying the data will follow, which involves making sure that the data is consistent and that each row in the dataset is an observation and each column is a variable.
    e.g. In the surveys data frame the month column specifies months as an integer from 1 to 12. The dataset would have inconsistent data if there was a record in the dataset that had a month specified by name, e.g. September rather than 9. A month of 0 or any other number that is not in the range 1 to 12 would have also made the dataset inconsistent. Another common problem is capitalisation; the same word in the same column can be written with capitals or without; e.g. Bird or bird in the same taxa column is inconsistent data. During the tidying stage it is important to make the dataset consistent as much as possible so that you can focus on the questions you are trying to solve in your analysis.
  3. Once the dataset is tidy, we move to the transformation stage. To be able to transform your data you need to plan in advance what analyses you would like to perform on the dataset and what plots you would like to create. In this way, you are able to plan ahead what variables/columns you will be using from the dataset, what additional variables you will need to create and what variables you will not be using so that you can keep only the columns in the dataset that are relevant for your analyses. By the end of the transformation process you will have a dataset that is focused for your analyses and you can move on to the main exploratory mechanisms of this workflow which are visualisation and modelling. These two stages complement each other and when exploring your data you normally repeat these two stages several times.
  4. Visualising data is a powerful way to explore your data. Furthermore it helps you understand if there is any pattern in the data.
  5. Modelling the data involves applying statistics or other mathematical or computational models on your data to explore if there are correlations or patterns in the dataset to help you answer the scientific question you are trying to solve.
  6. The last step in the data exploration workflow is to communicate your results. This is very important as you will need to be able to communicate your results to others to have a successful project.

All these stages in the data exploration workflow can be achieved by programming in R. In this course we will not look into the Model and Communicate stages of the workflow in this course. You will be able to learn more about these in the following courses:

In the next sections we will be looking at the import, tidy, transform and visualise stages of the data exploration workflow by using one of the most popular packages in data science in R; Tidyverse.

.

Packages

So far we have learnt how to use R with R’s in-built functionality that we will refer to as R base. There is a way, however, to extend this functionality by using external functions through packages. Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or head(), come built into R; packages give you access to more of them. Below is an illustration of the concept of a package.

Tidyverse

The package that we will be using in this course is called tidyverse. It is an “umbrella-package” that contains several packages useful for data manipulation and visualisation which work well together such as readr, tidyr, dplyr, ggplot2, tibble, etc…

Tidyverse is a recent package (launched in 2016) when compared to R base (stable version in 2000), thus you will still come across R resources that do not use tidyverse. However, since its’ release, tidyverse has been increasing in popularity throughout the R programming community and it is now very popular in Data Science as it was designed with the aim to help Data Scientists perform their tasks more efficiently.

Some of the main advantages of tidyverse over R base are:

  1. Easier to read

    Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.

    e.g. Get only the rows that have species as albigula surveyed in the year 1977.
  1. Faster

    Using tidyverse is up to 10x faster1 when compared to the corresponding base R base functions.

  2. Strings are not converted to factor

    We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. We had to set stringsAsFactors to FALSE to avoid this hidden argument to convert our data type. With tidyverse, this does not happen.


Installing and loading packages

Before using a package for the first time you will need to install it on your machine, and then you should import it in every subsequent R session when you need it. To install a package in R on your machine you need to use the install.packages function. To install the tidyverse package type the following straight into the console:

It is better to install packages straight from the console then from your script as there’s no need to re-install packages every time you run the script.

Then, to load the package type:


Importing/Reading data from files

After loading the tidyverse package in R we are now able to use its’ functions. We will start working through the data exploration workflow by first importing data into R. To import the data into R as before, we will now use the read_csv function, from the tidyverse package readr, instead of using read.csv from R base. The readr package contains functions to read tabular data into R. Let us read in the same file we used before using tidyverse this time:


Tibble

After importing data into R we need to check if the data has been loaded into R correctly.

#> # A tibble: 6 x 13
#>   record_id month   day  year plot_id species_id sex   hindfoot_length
#>       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
#> 1         1     7    16  1977       2 NL         M                  32
#> 2        72     8    19  1977       2 NL         M                  31
#> 3       224     9    13  1977       2 NL         <NA>               NA
#> 4       266    10    16  1977       2 NL         <NA>               NA
#> 5       349    11    12  1977       2 NL         <NA>               NA
#> 6       363    11    12  1977       2 NL         <NA>               NA
#> # ... with 5 more variables: weight <dbl>, genus <chr>, species <chr>,
#> #   taxa <chr>, plot_type <chr>

Notice that the first line of the output shows the data structure used to store the data imported into: tibble. tibble is the main data structure used in tidyverse. You can look at tibble as the data.frame version of tidyverse. The first immediate difference from a data.frame is that a tibble displays the data type of each column under its name and it only prints as many columns as fit on one screen. Furthermore as mentioned before, the columns of class character are never converted into factor. Another difference is that printing a tibble will not print the whole dataset, but just the first 10 rows and only the columns that fit the screen (same as head but with 10 rows instead of 6). If you would like to print more than the first 10 rows use the print function.

#> # A tibble: 34,786 x 13
#>    record_id month   day  year plot_id species_id sex   hindfoot_length
#>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
#>  1         1     7    16  1977       2 NL         M                  32
#>  2        72     8    19  1977       2 NL         M                  31
#>  3       224     9    13  1977       2 NL         <NA>               NA
#>  4       266    10    16  1977       2 NL         <NA>               NA
#>  5       349    11    12  1977       2 NL         <NA>               NA
#>  6       363    11    12  1977       2 NL         <NA>               NA
#>  7       435    12    10  1977       2 NL         <NA>               NA
#>  8       506     1     8  1978       2 NL         <NA>               NA
#>  9       588     2    18  1978       2 NL         M                  NA
#> 10       661     3    11  1978       2 NL         <NA>               NA
#> 11       748     4     8  1978       2 NL         <NA>               NA
#> 12       845     5     6  1978       2 NL         M                  32
#> 13       990     6     9  1978       2 NL         M                  NA
#> 14      1164     8     5  1978       2 NL         M                  34
#> 15      1261     9     4  1978       2 NL         M                  32
#> # ... with 3.477e+04 more rows, and 5 more variables: weight <dbl>,
#> #   genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

Since printing tibble already gives you information about the data structure, the data types of each column and the size of the dataset, the str function is not as much useful as it was when using data.frame.

#> Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 34786 obs. of  13 variables:
#>  $ record_id      : num  1 72 224 266 349 363 435 506 588 661 ...
#>  $ month          : num  7 8 9 10 11 11 12 1 2 3 ...
#>  $ day            : num  16 19 13 16 12 12 10 8 18 11 ...
#>  $ year           : num  1977 1977 1977 1977 1977 ...
#>  $ plot_id        : num  2 2 2 2 2 2 2 2 2 2 ...
#>  $ species_id     : chr  "NL" "NL" "NL" "NL" ...
#>  $ sex            : chr  "M" "M" NA NA ...
#>  $ hindfoot_length: num  32 31 NA NA NA NA NA NA NA NA ...
#>  $ weight         : num  NA NA NA NA NA NA NA NA 218 NA ...
#>  $ genus          : chr  "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
#>  $ species        : chr  "albigula" "albigula" "albigula" "albigula" ...
#>  $ taxa           : chr  "Rodent" "Rodent" "Rodent" "Rodent" ...
#>  $ plot_type      : chr  "Control" "Control" "Control" "Control" ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   record_id = col_double(),
#>   ..   month = col_double(),
#>   ..   day = col_double(),
#>   ..   year = col_double(),
#>   ..   plot_id = col_double(),
#>   ..   species_id = col_character(),
#>   ..   sex = col_character(),
#>   ..   hindfoot_length = col_double(),
#>   ..   weight = col_double(),
#>   ..   genus = col_character(),
#>   ..   species = col_character(),
#>   ..   taxa = col_character(),
#>   ..   plot_type = col_character()
#>   .. )

Notice that rather than specifing tibble as the data structure of surveys, the first line of str’s output now specifies ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame' which can be a bit confusing. These are the classes tibble inherts from which in simple terms means that tibble is a data.frame with a few modifications. Therefore, most of the functions that were used with data.frame can also be used with tibble.


Visualising data in R

After inspecting the surveys dataset in R, the data looks tidy and we are happy with its format, so let us start understanding better our data by visualising it. ggplot2 is the visualisation package in tidyverse and we will be using this to create some plots. ggplot2 is a very popular package used for plotting mainly due to its simple way to create plots from tabular data.

To create a plot, we will use the following basic template.

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

As you can see there are 3 main elements that you need to create a plot:

The ggplot function takes 2 arguments:

  • data: This is the data frame to attach to the plot. The data frame must contain the variables to plot as columns and the rows must contain the observations that you need to plot.
  • mapping: Aesthetic mappings describe how variables in the data are mapped to visual properties of the plot.

Using the ggplot function on its own will not plot anything. We need to add a geom_function as a layer. Layers are added to plots by using +. They are added on top of the other previous layers that might be present.

  • geom_function: This specifies the type of plot would you like to plot. The greatest advantage of this is that you can easily change the plot type by just changing the geom_function and keeping everything else the same. You can see a whole list of plots that you can plot here.

Let us practice this on our surveys dataset. We would like to create a scatter plot with weight on the x-axis, hindfoot_length on the y-axis