When you are working on a project that requires data analysis, you will normally need to perform the following steps:
More information on this workflow can be found in the R for Data Science book. To understand better the workflow in the illustration above, let us go over each stage to see what each step entails:
All these stages in the data exploration workflow can be achieved by programming in R. In this course we will not look into the Model and Communicate stages of the workflow in this course. You will be able to learn more about these in the following courses:
In the next sections we will be looking at the import, tidy, transform and visualise stages of the data exploration workflow by using one of the most popular packages in data science in R; Tidyverse.
.
So far we have learnt how to use R with R’s in-built functionality that we will refer to as R base. There is a way, however, to extend this functionality by using external functions through packages. Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str()
or head()
, come built into R; packages give you access to more of them. Below is an illustration of the concept of a package.
The package that we will be using in this course is called tidyverse. It is an “umbrella-package” that contains several packages useful for data manipulation and visualisation which work well together such as readr
, tidyr
, dplyr
, ggplot2
, tibble
, etc…
Tidyverse is a recent package (launched in 2016) when compared to R base (stable version in 2000), thus you will still come across R resources that do not use tidyverse. However, since its’ release, tidyverse has been increasing in popularity throughout the R programming community and it is now very popular in Data Science as it was designed with the aim to help Data Scientists perform their tasks more efficiently.
Some of the main advantages of tidyverse over R base are:
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.
e.g. Get only the rows that have species as albigula surveyed in the year 1977.# R Base
surveys[surveys$species=="albigula" &
surveys$year==1977, ]
# tidyverse
filter(surveys, species=="albigula" & year==1977)
Using tidyverse is up to 10x faster1 when compared to the corresponding base R base functions.
factor
We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor
data type. We had to set stringsAsFactors
to FALSE
to avoid this hidden argument to convert our data type. With tidyverse, this does not happen.
Before using a package for the first time you will need to install it on your machine, and then you should import it in every subsequent R session when you need it. To install a package in R on your machine you need to use the install.packages
function. To install the tidyverse
package type the following straight into the console:
It is better to install packages straight from the console then from your script as there’s no need to re-install packages every time you run the script.
Then, to load the package type:
After loading the tidyverse
package in R we are now able to use its’ functions. We will start working through the data exploration workflow by first importing data into R. To import the data into R as before, we will now use the read_csv
function, from the tidyverse package readr
, instead of using read.csv
from R base
. The readr
package contains functions to read tabular data into R. Let us read in the same file we used before using tidyverse
this time:
After importing data into R we need to check if the data has been loaded into R correctly.
#> # A tibble: 6 x 13
#> record_id month day year plot_id species_id sex hindfoot_length
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1 7 16 1977 2 NL M 32
#> 2 72 8 19 1977 2 NL M 31
#> 3 224 9 13 1977 2 NL <NA> NA
#> 4 266 10 16 1977 2 NL <NA> NA
#> 5 349 11 12 1977 2 NL <NA> NA
#> 6 363 11 12 1977 2 NL <NA> NA
#> # ... with 5 more variables: weight <dbl>, genus <chr>, species <chr>,
#> # taxa <chr>, plot_type <chr>
Notice that the first line of the output shows the data structure used to store the data imported into: tibble
. tibble
is the main data structure used in tidyverse
. You can look at tibble
as the data.frame
version of tidyverse
. The first immediate difference from a data.frame
is that a tibble
displays the data type of each column under its name and it only prints as many columns as fit on one screen. Furthermore as mentioned before, the columns of class character
are never converted into factor
. Another difference is that printing a tibble
will not print the whole dataset, but just the first 10 rows and only the columns that fit the screen (same as head
but with 10 rows instead of 6). If you would like to print more than the first 10 rows use the print
function.
#> # A tibble: 34,786 x 13
#> record_id month day year plot_id species_id sex hindfoot_length
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1 7 16 1977 2 NL M 32
#> 2 72 8 19 1977 2 NL M 31
#> 3 224 9 13 1977 2 NL <NA> NA
#> 4 266 10 16 1977 2 NL <NA> NA
#> 5 349 11 12 1977 2 NL <NA> NA
#> 6 363 11 12 1977 2 NL <NA> NA
#> 7 435 12 10 1977 2 NL <NA> NA
#> 8 506 1 8 1978 2 NL <NA> NA
#> 9 588 2 18 1978 2 NL M NA
#> 10 661 3 11 1978 2 NL <NA> NA
#> 11 748 4 8 1978 2 NL <NA> NA
#> 12 845 5 6 1978 2 NL M 32
#> 13 990 6 9 1978 2 NL M NA
#> 14 1164 8 5 1978 2 NL M 34
#> 15 1261 9 4 1978 2 NL M 32
#> # ... with 3.477e+04 more rows, and 5 more variables: weight <dbl>,
#> # genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
Since printing tibble
already gives you information about the data structure, the data types of each column and the size of the dataset, the str
function is not as much useful as it was when using data.frame
.
#> Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 34786 obs. of 13 variables:
#> $ record_id : num 1 72 224 266 349 363 435 506 588 661 ...
#> $ month : num 7 8 9 10 11 11 12 1 2 3 ...
#> $ day : num 16 19 13 16 12 12 10 8 18 11 ...
#> $ year : num 1977 1977 1977 1977 1977 ...
#> $ plot_id : num 2 2 2 2 2 2 2 2 2 2 ...
#> $ species_id : chr "NL" "NL" "NL" "NL" ...
#> $ sex : chr "M" "M" NA NA ...
#> $ hindfoot_length: num 32 31 NA NA NA NA NA NA NA NA ...
#> $ weight : num NA NA NA NA NA NA NA NA 218 NA ...
#> $ genus : chr "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
#> $ species : chr "albigula" "albigula" "albigula" "albigula" ...
#> $ taxa : chr "Rodent" "Rodent" "Rodent" "Rodent" ...
#> $ plot_type : chr "Control" "Control" "Control" "Control" ...
#> - attr(*, "spec")=
#> .. cols(
#> .. record_id = col_double(),
#> .. month = col_double(),
#> .. day = col_double(),
#> .. year = col_double(),
#> .. plot_id = col_double(),
#> .. species_id = col_character(),
#> .. sex = col_character(),
#> .. hindfoot_length = col_double(),
#> .. weight = col_double(),
#> .. genus = col_character(),
#> .. species = col_character(),
#> .. taxa = col_character(),
#> .. plot_type = col_character()
#> .. )
Notice that rather than specifing tibble
as the data structure of surveys, the first line of str
’s output now specifies ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame'
which can be a bit confusing. These are the classes tibble
inherts from which in simple terms means that tibble
is a data.frame
with a few modifications. Therefore, most of the functions that were used with data.frame
can also be used with tibble
.
After inspecting the surveys
dataset in R, the data looks tidy and we are happy with its format, so let us start understanding better our data by visualising it. ggplot2
is the visualisation package in tidyverse
and we will be using this to create some plots. ggplot2
is a very popular package used for plotting mainly due to its simple way to create plots from tabular data.
To create a plot, we will use the following basic template.
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
As you can see there are 3 main elements that you need to create a plot:
The ggplot
function takes 2 arguments:
Using the ggplot
function on its own will not plot anything. We need to add a geom_function as a layer. Layers are added to plots by using +
. They are added on top of the other previous layers that might be present.
Let us practice this on our surveys
dataset. We would like to create a scatter plot with weight
on the x-axis, hindfoot_length
on the y-axis
As you can see if you just specify the ggplot
function with the data and aesthetic mappings, it will just create an empty plot. Let us now add the geom_function for the scatter plot (geom_point
) as a layer to the plot:
You can customise some of the visualisations of the plot to extract more information from it. For instance, we can add transparency (alpha
) to avoid overplotting:
You can find a list of aesthetics for each type of plot in the ggplot2 cheat sheet.
We can also add colors for all the points:
ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, color = "blue")
If we would like to try other type of plots on the data, the best thing is to save the ggplot into a variable as below:
# Assign plot to a variable
surveys_plot <- ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length))
# Draw a scatter plot
surveys_plot +
geom_point()
Now draw a geom_smooth
plot. This plot is good when you need to see if there is any pattern between the two variables being plotted that you would not normally see in a scatter plot due to overplotting.
Rather than seeing each plot separately, sometimes plotting multiple plots on top of each other is a better way. You can add multiple plots as layers on top of each other as follows:
Note
- Anything you put in the
ggplot()
function can be seen by any geom layers that you add (i.e., these are universal plot settings).- You can also specify mappings for a given geom independently of the mappings defined globally in the
ggplot()
function.- The
+
sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the+
sign is added at the beginning of the line containing the new layer, >ggplot2
will not add the new layer and will return an error message.
Challenge
Scatter plots can be useful exploratory tools for small datasets. For data sets with large numbers of observations, such as the
surveys
data set, overplotting of points can be a limitation of scatter plots. We have already seen how we can visualise data better when we have overplotting with thegeom_smooth
plot. Another way for handling overplotting is to display the density of the data through contours. As this challenge’s task create a script calledplot_density2d.R
which loads the file data/portal_data_joined.csv into the variablesurveys
. It then uses this dataset to plot theweight
on the x-axis andhindfoot_length
on the y-axis in ageom_density2d
plot.
To save a plot to file use the ggsave
function. If you look at the documentation of ggsave you can see the different arguments the ggsave function takes. Let us save the plot present in the surveys_plot variable into a file called plot_weight_hindfoot_density2d.png
into a folder in this project called img_output
.
#save plot that you would like to save into a variable
out_plot <- surveys_plot + geom_density2d()
#save plot to file
ggsave(filename="img_output/plot_weight_hindfoot_density2d.png", plot=out_plot)
Note
- You do not need to save the plot into a variable before saving it to file. If you do not specify the
plot
argument of theggsave
function,ggsave
will take the last plot that you plotted and save it into thefilename
specified.
- You can create folders straight from RStudio from the right bottom pane in the Files section > New Folder icon.
In most of the cases you will need to change the format of your dataset because it will not be in the right format that you will need to plot or analyse the data. tidyverse
has a package called dplyr
which contains functions that help you to select columns/rows, sort, combine and perform other data types of data transformations. In the next sections we will look at different ways to transform our dataset. Now that we already know the basics of visualing data with ggplot
we will also learn how to visualise other plots with the transformed dataset as we go along.
To learn more about
dplyr
please look at the following resources:If you find these resources difficult to understand, return to these after completing the course.
To select columns of a data frame or tibble, use theselect
function. The first argument is the data frame or tibble you are working on (in our example it is surveys
), and the subsequent arguments are the columns to keep.
#extract species_id, weight, hindfoot_lenth, year and sex columns from surveys dataset.
select(surveys, species_id, weight, hindfoot_length, year, sex)
To select all columns except certain ones, put a -
in front of the column to exclude it.
#select all columns of the surveys dataset apart from record_id and species_id columns.
select(surveys, -record_id, -species_id)
To remove rows from a data frame or tibble use the filter
function from the dplyr
package. The first argument is the data frame or tiblle to perform the filtering on and the next arguments are the conditions on which to keep the rows.
To choose rows based on a specific condition, use the filter
function as follows:
#keep only the observations of animals collected from 1995 onwards from the surveys dataset.
filter(surveys, year >= 1995)
You can filter on multiple conditions:
NA
(missing) valuesWhen we were plotting weight
against hindfoot_length
in the previous section, you must have noticed that we were getting a warning message:
#> Warning: Removed 4048 rows containing missing values (geom_point).
This is because some of the values in the weight
and hindfoot_length
are NA
. NA
is short for Not Available and essentially it means that there is no data for that particular index in the table. We also refer to this as missing data. ggplot
does not plot the observations that have missing data and outputs the warning above which shows the number of observations that have missing data in the dataset we are plotting. We can filter these rows before we plot them so that ggplot
will have all the values for the observations it is plotting and so no warning will be disaplyed.
The is.na
function returns TRUE
if the value passed to it is NA
. Applied to a vector or data frame it will return TRUE
or FALSE
for each index in the vector or data frame depending on whether the value at each index is missing or not. The !
symbol negates the result, so !is.na
can be interpreted as is not NA. See how this can be used in the code below:
#which values of the weight column are missing?
is.na(surveys$weight)
#which values of the weight column are not missing?
!is.na(surveys$weight)
Now let us apply the is.na
function in dplyr
’s filter
function to remove the rows that have weight
or hindfoot_length
as NA
from the surveys
tibble.
filter(surveys,
!is.na(weight) & #remove rows that have weight as NA
!is.na(hindfoot_length)) #remove rows that have hindfoot_length as NA
Another way to remove rows that have NA
values is by using the drop_na
function in the tidyr
package. The code above can be replaced by the following code which gives the same answer:
Using drop_na()
without specifying any columns will remove all the rows that have NA
in any of the columns.
Let us save the results of this in surveys_complete
variable which contains only the rows that have all the values in all the columns present and use this for the remaining of the course.
What if you want to select
and filter
at the same time? There are three ways to do this:
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
surveys2 <- select(surveys_complete, species_id, weight, hindfoot_length, year, sex)
surveys_recent <- filter(surveys2, year >= 1995)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e., one function inside of another), like this:
surveys_recent <- filter(
select(surveys_complete, species_id, weight, hindfoot_length, year, sex)
, year >= 1995)
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, selecting, then filtering).
The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>%
and are made available via the magrittr
package, installed automatically with dplyr
.
surveys_complete %>%
select(species_id, weight, hindfoot_length, year, sex) %>%
filter(year >= 1995)
In the above code, we use the pipe to send the surveys_complete
dataset first through select
and then through filter
. Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the tibble surveys_complete
, then we select
ed columns species_id
, weight
, hindfoot_length
, year
and sex
. We then filter
ed the rows and only kept the ones that have year >= 1995
.
Since %>%
takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame/tibble as an argument to the select
and filter
functions any more. This is one of the biggest advantages of using pipes as it allows us to perform all the operations that we need to do with the need to create useless variables and store useless data. Furthermore, the code is more readable when using pipes. The dplyr
functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames/tibbles.
If we want to create a new object with the transformed data we can assign it a new name as below:
surveys_recent <- surveys_complete %>%
select(species_id, weight, hindfoot_length, year, sex) %>%
filter(year >= 1995)
surveys_recent
Challenge 1 - pipes
Subset the
surveys_complete
data to keep only thespecies_id
,weight
,hindfoot_length
,year
andsex
columns and the animals collected on and after 1995. Then plot a scatter plot ofweight
(x-axis) againsthindfoot_length
(y-axis) using this transformed dataset. Do all the above using pipes, without creating any variables.
Challenge 2 - plotting subset with different colour
Plot all the animals in the
surveys_complete
dataset asweight
(x-axis) againsthindfoot_length
(y-axis). Use the dataset created above which contains only the animals that were collected on and after 1995 and highlight these points in red in the plot.Answer
survey_recent <- surveys_complete %>% select(species_id, weight, hindfoot_length, year, sex) %>% filter(year >= 1995) ggplot(mapping=aes(x=weight, y=hindfoot_length)) + geom_point(data=surveys_complete) + geom_point(data=survey_recent, aes(color="red")) + #remove lengend theme(legend.position = "none")
Note: In the example above we did not specify the
data
argument in theggplot
function because the arguments in theggplot
function should be common throughout the whole plot. In this case we specified the respective data as separate layers in 2 differentgeom_point
plots; the first plot is the data with all the animals insurveys_complete
(points in black), the second plot is layered on top of the first plot which is a subset of thesurveys_complete
dataset (points in red).
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use the mutate
function.
To create a new column of weight in kg:
You can also create a second new column based on the first new column within the same call of mutate()
:
There are other ways on how to create new columns. Refer to the dplyr cheat sheet Make New Variables section.
Challenge - the boxplot
Using the
surveys_complete
dataset, create a boxplot for each year on the x-axis and weight in kg on the y-axis. Use the subset dataset that was transformed in the previous challenge, i.e. the one that has the recent observations (from 1995 onwards). As before try to do all the operations using pipes, without creating variables.
Answer
Your first attempt would be to use the code we used before to subset the
surveys_complete
dataset, but also adding themutate
function to add a newweight_kg
column. Further more you will need to also add theggplot
function to draw a boxplot:surveys_complete %>% #add weight_kg column mutate(weight_kg = weight / 1000) %>% #select columns select(species_id, weight_kg, hindfoot_length, year, sex) %>% #filter rows filter(year >= 1995) %>% #plot transformed data ggplot(mapping=aes(x=year, y=weight_kg)) + geom_boxplot()
#> Warning: Continuous x aesthetic -- did you forget aes(group=...)?
You might notice that this code produces a warning message about the continuous data type of the x-axis and also creates a box plot with one single box and whisker averaged over all the years on the x-axis, rather than one for each year in the dataset as requested. As shown in the
ggplot2
cheat sheet, box plots display a discrete variable on the x-axis and a continuous variable on the y-axis. At the momentyear
is treated as a continuous variable since its data type isdouble
(you can see this from the tibble). To fix this, we need to convert the data type of theyear
column to a categorical one (factor
) so that each year can be plotted as a separate box and whisker.surveys_complete %>% #add weight_kg column mutate(weight_kg = weight / 1000) %>% #select columns select(species_id, weight_kg, hindfoot_length, year, sex) %>% #filter rows filter(year >= 1995) %>% #convert the year column to a factor mutate(year=as.factor(year)) %>% #plot transformed data ggplot(mapping=aes(x=year, y=weight_kg)) + geom_boxplot()
Notice that since we need to keep only observations that have been observed from 1995 onwards, we need to convert the
year
column to afactor
after we filter the rows, as otherwise the row filtering operation will not be possible. This is because the newyear
column of typefactor
does not have any ordering in its categories so perfomingyear >= 1995
would not be possible.Note on the boxplot
As mentioned the boxplot is used to visualise the distribution of a continuous variable. In the example above we displayed the distribution of
weight
over the different years. The figure below illustrates the different components of the boxplot. The bottom line of the box represents the 25th quantile, middle line the 50th quantile (or median) and the top line of the box is the 75th quantile. The whiskers can be thought of as defining the boundary of the confidence intervals of the distribution. The dots outside these whiskers are the outliers.
To sort your data dplyr
provides function arrange
.
To sort your data in descending order you will need to use desc()
.
You can sort your dataset based on the values of multiple columns:
#sort weight in ascending order and hindfoot_length in descending order
surveys_complete %>%
arrange(weight, desc(hindfoot_length))
As you can see from the result returned, the animals with the smallest weight are at the top. When there is a tie, i.e., more than one animal has the same weight, the animals are sorted in descending order of hindfoot_length
. As you can see, the subset of animals with weight
of 4 have been sorted in descending order based on hindfoot_length
.
Creating summaries of your data would be a good way to start describing the variable you are working with. Summary statistics are a good example of how one can summarise data. We will not cover details about summary statistics in this course, but we will look at how we can summarise data in R. When working with continuous variables, one of the most popular summary statistic is the mean. If we try to caclulate the mean on weight
in the surveys_complete
dataset we get:
#> Error in mean(weight): object 'weight' not found
This is because in dplyr
you will need to use the summarise
function to be able to create summaries of your data.
The summarise
function is used when you want to reduce multiple values in your columns to a single value. If we try to calculate the mean using summarise()
this time:
#> # A tibble: 1 x 1
#> mean_weight
#> <dbl>
#> 1 41.8
As seen from the result, the mean
of the values present in the weight
column is 41.8.
count
Obtaining the frequency of your data is another common way of summarising data. Frequencies are normally calculated when working with discrete variables that have a finite number of values, such as categorical data. In our surveys_complete
dataset, let us obtain the frequecies of male and female animals present. We can do this by counting the number of “M” and “F” present in the dataset. To do this use the dplyr
function count
as follows:
#> # A tibble: 2 x 2
#> sex n
#> <chr> <int>
#> 1 F 14584
#> 2 M 16092
As you can see count
has grouped the categories present in the sex
column and returned the frequency of each category. If we wanted to count combination of factors, such as sex
and species
, we would specify the first and the second factor as the arguments of count()
:
#> # A tibble: 41 x 3
#> sex species n
#> <chr> <chr> <int>
#> 1 F albigula 606
#> 2 F baileyi 1617
#> 3 F eremicus 539
#> 4 F flavus 711
#> 5 F fulvescens 55
#> 6 F fulviventer 15
#> 7 F hispidus 91
#> 8 F leucogaster 436
#> 9 F leucopus 16
#> 10 F maniculatus 354
#> # ... with 31 more rows
Challenge
- How many animals were observed in each
plot_type
surveyed?
- What is the frequency of each species of each sex observed? Sort each species in descending order of freqency.
geom_line()
Now that we know how to obtain frequencies, let us create a time series plot with ggplot
. A time series plot displays values over time with the aim to show how data changes over time. Let us plot years on the x-axis and the frequencies of the yearly observations per genus on the y-axis.
First we need to get the frequencies of the yearly observations per genus:
yearly_counts
now contains the following results:
#> # A tibble: 199 x 3
#> year genus n
#> <dbl> <chr> <int>
#> 1 1977 Chaetodipus 3
#> 2 1977 Dipodomys 222
#> 3 1977 Onychomys 3
#> 4 1977 Perognathus 22
#> 5 1977 Peromyscus 2
#> 6 1977 Reithrodontomys 2
#> 7 1978 Chaetodipus 23
#> 8 1978 Dipodomys 629
#> 9 1978 Neotoma 23
#> 10 1978 Onychomys 80
#> # ... with 189 more rows
Let us plot this in a line plot:
Unfortunately, this does not work because `ggplot
plotted data for all the genera together. We need to tell ggplot
to draw a line for each genus by modifying the aesthetic function to include group = genus
:
This creates a line for each genus. However, since they are all in the same colour we are not able to distinguish which genus is which. If we use a different colour for each genus the plot should be clear. This is done by using the argument color
in the aesthetic function (using color
also automatically groups the data):
If you would like to plot the distribution of a single continuous variable the frequency will be automatically calculated, so you do not need to use count()
to calculate the frequency beforehand. The x-axis is automatically divided into bins and the number of observations of the continuous variable in each bin is shown as a bar in the histogram. In the ggplot2
package a histogram can be plotted using the geom_histogram
function.
Let us plot a histogram for the continuous variable weight
:
You can identify display categories in the histogram by plotting a stacked histogram which will show categories for each group stacked on top of each other. This is done by using the fill
argument in the aesthetic function. If we want to display sex
in our weight
histogram:
Note that the default number of bins in a histogram is 30. To get have a granular display you can increase the number of bins by using the argument bins
in the geom_histogram
function as above.
There are other plots that can be used for a single continuous variable (see ONE VARIABLE continuous section on ggplot2 cheat sheet).
Challenge -
geom_freqpoly()
Use the
geom_freqpoly()
to display frequency polygons instead of bars for the distribution ofweight
. Showsex
category as before in a different colour.
Hint: Use the argumentcolor
in the aesthetic function rather thanfill
to display a frequency polygon for eachsex
category in a different colour.
In the examples above we learnt how to summarise data over all observations, e.g., we calculated the mean over all observations using the summarise
function. However, in data analysis, especially when dealing with big data, a common approach to data exploration is the split-apply-combine strategy. The idea behind this strategy is to split the data into more managable pieces, apply any operations required on the data independently on each piece and then combine the results together. The figure below illustrates the approach that is done in the split-apply-combine approach.
Let us work on an example on how we can apply the split-apply-combine strategy on the surveys_complete
dataset. We would like to split the data by the different categories present in the sex
column and calculate the mean weight for each category. We can do this as follows:
#> # A tibble: 1 x 1
#> mean_weight
#> <dbl>
#> 1 41.5
#> # A tibble: 1 x 1
#> mean_weight
#> <dbl>
#> 1 42.1
However, this would be a very tedious process to do if we had several categories. We can do this easily by using the group_by
function in the dplyr
package:
#> # A tibble: 2 x 2
#> sex mean_weight
#> <chr> <dbl>
#> 1 F 41.5
#> 2 M 42.1
You can also group by multiple columns:
#> # A tibble: 46 x 3
#> # Groups: sex [2]
#> sex species_id mean_weight
#> <chr> <chr> <dbl>
#> 1 F BA 9.16
#> 2 F DM 41.6
#> 3 F DO 48.5
#> 4 F DS 117.
#> 5 F NL 154.
#> 6 F OL 30.8
#> 7 F OT 24.8
#> 8 F OX 22
#> 9 F PB 30.2
#> 10 F PE 22.8
#> # ... with 36 more rows
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
surveys_complete %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight))
#> # A tibble: 46 x 4
#> # Groups: sex [2]
#> sex species_id mean_weight min_weight
#> <chr> <chr> <dbl> <dbl>
#> 1 F BA 9.16 6
#> 2 F DM 41.6 10
#> 3 F DO 48.5 12
#> 4 F DS 117. 45
#> 5 F NL 154. 32
#> 6 F OL 30.8 10
#> 7 F OT 24.8 5
#> 8 F OX 22 22
#> 9 F PB 30.2 12
#> 10 F PE 22.8 11
#> # ... with 36 more rows
Challenge
- Use
group_by()
andsummarise()
to find the mean, min, and max hindfoot length for each species (usingspecies_id
). Also add the number of observations (hint: see?n
).Answer
surveys_complete %>% group_by(species_id) %>% summarise( mean_hindfoot_length = mean(hindfoot_length), min_hindfoot_length = min(hindfoot_length), max_hindfoot_length = max(hindfoot_length), n = n() )
#> # A tibble: 24 x 5 #> species_id mean_hindfoot_len~ min_hindfoot_leng~ max_hindfoot_len~ n #> <chr> <dbl> <dbl> <dbl> <int> #> 1 BA 13 6 16 45 #> 2 DM 36.0 16 50 9727 #> 3 DO 35.6 26 64 2790 #> 4 DS 50.0 39 58 2023 #> 5 NL 32.2 21 42 1045 #> 6 OL 20.5 12 39 905 #> 7 OT 20.3 13 50 2081 #> 8 OX 20.4 19 21 5 #> 9 PB 26.1 2 47 2803 #> 10 PE 20.2 11 30 1198 #> # ... with 14 more rows
- What was the heaviest animal measured in each year? Return the columns
year
,genus
,species
, andweight
.Answer
surveys_complete %>% group_by(year) %>% filter(weight == max(weight)) %>% select(year, genus, species_id, weight) %>% arrange(year)
#> # A tibble: 28 x 4 #> # Groups: year [26] #> year genus species_id weight #> <dbl> <chr> <chr> <dbl> #> 1 1977 Dipodomys DS 149 #> 2 1978 Neotoma NL 232 #> 3 1978 Neotoma NL 232 #> 4 1979 Neotoma NL 274 #> 5 1980 Neotoma NL 241 #> 6 1980 Neotoma NL 241 #> 7 1981 Neotoma NL 251 #> 8 1982 Neotoma NL 252 #> 9 1983 Neotoma NL 256 #> 10 1984 Neotoma NL 259 #> # ... with 18 more rows
The ggplot2
package has a way of creating different plots based on the different categories in the data. This is known as facetting. With facetting we do not need to use group_by()
to split the data into different groups to be able to plot the different categories in different plots as ggplot2
does this automatically.
There are two types of facet
functions:
facet_wrap()
arranges the different plots into muliple rows and columns to cleanly fit on one page.facet_grid()
plots all the categories in 1 row or 1 column.Let us see this in action. When we plotted a time series plot, we created a line for each different genus. Given there are several genera, it would be more clearer if we plotted each line is a seperate plot, one plot for each genus. Facetting will do this very easily. Let us start with facet_wrap()
. We supply the variable that we would like to group upon within vars()
as following:
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus))
As you can see, each genus has been plotted as a separate plot. It is now clear which are the genera that were observed the most. Another advantage of facetting is that it uses a common axes and all plots are aligned to the same values on the axes, making the different plots comparable. If you want to have different axes for each plot you can do so by using the scales
argument.
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus), scales="free")
The pattern of the graphs that before were hardly visible, e.g., Baiomys, is now clear as the axes have been rescaled to fit the data. This is the main advantage of using free scales. The disadvantage is that the different plots are not comparable as before.
If we would like to see if there is any difference between the sex
, we can do this by adding sex
as another grouping to count()
.
yearly_sex_counts
will now look like:
#> # A tibble: 389 x 4
#> year genus sex n
#> <dbl> <chr> <chr> <int>
#> 1 1977 Chaetodipus F 3
#> 2 1977 Dipodomys F 103
#> 3 1977 Dipodomys M 119
#> 4 1977 Onychomys F 2
#> 5 1977 Onychomys M 1
#> 6 1977 Perognathus F 14
#> 7 1977 Perognathus M 8
#> 8 1977 Peromyscus M 2
#> 9 1977 Reithrodontomys F 1
#> 10 1977 Reithrodontomys M 1
#> # ... with 379 more rows
This should now allow us to also split by sex
. We can use colour to distinguish between the sex
categories:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(facets = vars(genus))
Let us do the same thing with facet_grid()
so that we can understand the difference between the two facetting techniques in the ggplot2
package. With facet_grid()
you specify what variable you would like to split on as in the rows
or cols
arguments:
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line() +
#display the genera as columns
facet_grid(cols = vars(genus))
As you can see facet_grid()
placed all the categories of genus
in 1 row, unlike facet_wrap()
which have spread them over multiple rows to fit well in 1 page. Let us split the plots by sex
as well by plotting sex
as the rows:
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n)) +
geom_line() +
facet_grid(rows = vars(sex), cols = vars(genus))
More information on further functionality of facetting can be found in the
facet_wrap()
andfacet_grid()
documentation.
Challenge
Instead of splitting the plots based on
sex
display thesex
as different coloured line graphs in the same plot.
Though the default visualisation of ggplot2
plots is already at a good standard, there are several ways one can improve even further the visualisations.
### Labels
Let us start customising the last plot we have plotted by renaming the axes and adding a title to the plot. This is done by using the labs
function:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals")
The major item that needs fixing in the plot is the text on the x-axis as this crammed and is not readable at the moment. This is mainly due to the fact that the size of the plot is dependent on the size of the window (in this case RStudio). You can work around this by saving your plot to a file and specifying the width of the plot (see Saving a plot to a file section). Themes in the ggplot2
package control the display of all non-data elements of the plot. Let us start customising the text on the x-axis by changing its size and position using the theme
function. Note that theme()
has several other arguments and you can read more about them in the theme()
documentation.
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color=sex)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals") +
theme(axis.text.x = element_text(size=7, angle=90, vjust=0.5),
axis.text.y = element_text(size=7),
strip.text=element_text(size=7, angle=45))
With the plot already looking better, the last thing we would like to change is the legend. Legends are very tricky in ggplot2
as the fuction to use is determined by the data that is being displayed. In this case the legend has been created based on color groupings. Therefore we can change the lengend title, categories and color as follows:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color=sex)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals") +
theme(axis.text.x = element_text(size=7, angle=90, vjust=0.5),
axis.text.y = element_text(size=7),
strip.text=element_text(size=7, angle=45)) +
scale_color_brewer("Sex",
palette="Set1",
breaks=c("F", "M"),
labels=c("Female", "Male"))
Note: If you would like to see what other palettes are available please see http://colorbrewer2.org/#type=qualitative&scheme=Set1&n=3.
ggplot2
has a set of themes that can be used to change the overall appearance of the graph without much effort. For example, if we create the first plot again and apply the theme_bw()
theme we get a more simpler white background:
A list of themes can be found in the ggplot2
documentation.
Challenge
Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.
Now that you have learned how to use dplyr
to transform your raw data, you may want to export these new datasets to share them with your collaborators or for archival.
Similar to the read_csv
function used for reading CSV files into R, there is a write_csv
function that generates CSV files from data frames and tibbles which is also present in the readr
package.
Before using write_csv()
, we are going to create a new folder, data_output
, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data
folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output
directory, so even if the files it contains are deleted, we can always re-generate them.
Let us save the surveys_complete
tibble in data_output/surveys_complete.csv
file: