# A collection of R packages designed for data science
library(tidyverse)
<- read_csv("data/surveys.csv") surveys
11 Grouped operations
- Learn how to use grouped operations in data analysis.
- Be able to distinguish between grouped summaries and other types of grouped operations.
11.1 Context
We’ve done different types of operations, all on the entire data set. Sometimes there is structure within the data, such as different groups (e.g. genotypes, patient cohorts, geographical areas etc). We might then want information on a group-by-group basis.
11.2 Section setup
We’ll continue this section with the script named da3-08-grouped-operations.R
. If needed, add the following code to the top of your script and run it.
We’ll continue this section with the script named da3-08-grouped-operations.py
. If needed, add the following code to the top of your script and run it.
# A Python data analysis and manipulation tool
import pandas as pd
# Python equivalent of `ggplot2`
from plotnine import *
= pd.read_csv("data/surveys.csv") surveys
11.3 Split-apply-combine
Conceptually, this kind of operation can be referred to as split-apply-combine, because we split the data, apply some function and then combine the outcome.
Let’s illustrate this with an example. Figure 11.1 shows a hypothetical data set, where we have temperature and rainfall measurements for different cities.
Let’s assume we were interested in the average temperature for each city. We would have to do the following:
- Split the data by
city
- Calculate the average
temperature
- Combine the outcome together in a new table
This is visualised in Figure 11.2.
11.4 Summary operations
Let’s put this into practice with our data set.
11.4.1 Summarising data
A common task in data analysis is to summarise variables to get the mean and the variation around it.
For example, let’s calculate what the mean and standard deviation are for weight
.
We can achieve this task using the summarise()
function.
|>
surveys summarise(weight_mean = mean(weight, na.rm = TRUE),
weight_sd = sd(weight, na.rm = TRUE))
# A tibble: 1 × 2
weight_mean weight_sd
<dbl> <dbl>
1 42.7 36.6
A couple of things to notice:
The output of summarise
is a new table, where each column is named according to the input to summarise()
.
Within summarise()
we should use functions for which the output is a single value. Also notice that, above, we used the na.rm
option within the summary functions, so that they ignored missing values when calculating the respective statistics.
There are many functions whose input is a vector (or a column in a table) and the output is a single number. Here are several common ones:
mean(x)
- arithmetic meanmedian(x)
- mediansd(x)
- standard deviationvar(x)
- varianceIQR(x)
- interquartile rangemad(x)
- median absolute deviationmin(x)
andmax(x)
- minimum and maximumquantile(x, probs = 0.75)
- quantile (use theprobs
option to set the quantile of your choosing)sum(x)
- addition of all values inx
n_distinct(x)
(fromdplyr
) - the number of distinct values in the vectorx
All of these have the option na.rm
, which tells the function remove missing values before doing the calculation.
11.4.2 Grouped summaries
In most cases we want to calculate summary statistics across groups of our data.
We can achieve this by combining summarise()
with the group_by()
function. For example, let’s modify the previous example to calculate the summary for each sex
group:
|>
surveys group_by(sex) |>
summarise(weight_mean = mean(weight, na.rm = TRUE),
weight_sd = sd(weight, na.rm = TRUE))
# A tibble: 3 × 3
sex weight_mean weight_sd
<chr> <dbl> <dbl>
1 F 42.2 36.8
2 M 43.0 36.2
3 <NA> 64.7 62.2
The table output now includes both the columns we defined within summarise()
as well as the grouping columns defined within group_by()
.
11.5 Counting data
Counting or tallying data is an extremely useful way of getting to know your data better.
11.5.1 Simple counting
We can use the count()
function from dplyr
to count data. It always returns the number of rows it counts.
For example, this gives us the total number of observations (rows) in our data set:
count(surveys)
# A tibble: 1 × 1
n
<int>
1 35549
We can also do that using conditional statements:
# count the observations from the year 1982
|>
surveys filter(year == 1982) |>
count()
# A tibble: 1 × 1
n
<int>
1 1978
11.5.2 Counting by group
Counting really comes into its own when we’re combining this with some grouping. For example, we might be interested in the number of observations for each year.
|>
surveys count(year)
# A tibble: 26 × 2
year n
<dbl> <int>
1 1977 503
2 1978 1048
3 1979 719
4 1980 1415
5 1981 1472
6 1982 1978
7 1983 1673
8 1984 981
9 1985 1438
10 1986 942
# ℹ 16 more rows
We can also easily visualise this (we can pipe straight into ggplot()
). We use geom_col()
to create a bar chart of the number of observations per year. We count by sex
and use this variable to fill the colour of the bars.
|>
surveys count(sex, year) |>
ggplot(aes(x = year, y = n, fill = sex)) +
geom_col()
Often we want to do counting when we’re creating summaries. The count()
function can’t be used within summarise()
, but there is a special helper function called n()
. Look at the following example, where we group by year, filter the data, create some summary statistic and also count the number of rows within each group.
|>
surveys group_by(year) |> # group the data
filter(year %in% c(1981, 1982)) |> # filter a subset of years
summarise(mean_weight = mean(weight, na.rm = TRUE), # calculate mean weight
n_obs = n()) |> # number of rows
ungroup() # drop the grouping
# A tibble: 2 × 3
year mean_weight n_obs
<dbl> <dbl> <int>
1 1981 65.8 1472
2 1982 53.8 1978
11.5.3 Counting missing data
Oh, missing data! How we’ve missed you. For something that isn’t there, is has quite the presence. But, it is an important consideration in data analysis. We’ve already seen how we can remove missing data from and also explored ways to visualise them.
We have seen how to use the summary()
function to find missing values. Here we’ll see (even more) ways to tally them.
We can use the is.na()
function to great effect, within a summarise()
pipeline. We can negate with !is.na()
to find non-missing values. Again, using the sum()
function then enables us to tally how many missing / non-missing values there are.
|>
surveys summarise(obs_present = sum(!is.na(species_id)), # count non-missing data
obs_absent = sum(is.na(species_id)), # count missing data
n_obs = n(), # total number of rows
precentage_absent =
/ n_obs) * 100) |> # percentage of missing data
(obs_absent ungroup()
# A tibble: 1 × 4
obs_present obs_absent n_obs precentage_absent
<int> <int> <int> <dbl>
1 34786 763 35549 2.15
11.6 Grouped operations
11.6.1 Grouped filters
Sometimes it can be really handy to filter data, by group. In our surveys
data, for example, you might be interested to find out what the minimum weight
value is for each year
. We can do that as follows:
|>
surveys group_by(year) |>
filter(weight == min(weight, na.rm = TRUE)) |>
ungroup()
# A tibble: 94 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 218 9 13 1977 1 PF M 13 4
2 1161 8 5 1978 6 PF F 16 6
3 1230 9 3 1978 19 PF M 16 6
4 1265 9 4 1978 15 PF F 16 6
5 1282 9 4 1978 4 PF F 16 6
6 1343 10 7 1978 11 PF F 15 6
7 1351 10 8 1978 2 PF M 15 6
8 1380 10 8 1978 3 PF F 16 6
9 1400 11 4 1978 19 PF M 15 6
10 1427 11 4 1978 21 PF F 15 6
# ℹ 84 more rows
You can see that this outputs the minimum value, but if there are multiple entries for each year (such as in 1978
), multiple rows returned. If we only wanted to get a single row per minimum value, per year, then we can use slice(1)
. This slices the first row of each group:
|>
surveys group_by(year) |>
filter(weight == min(weight, na.rm = TRUE)) |>
slice(1) |>
ungroup()
# A tibble: 26 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 218 9 13 1977 1 PF M 13 4
2 1161 8 5 1978 6 PF F 16 6
3 1923 7 25 1979 18 PF F NA 6
4 2506 2 25 1980 3 PF F 15 5
5 4052 4 5 1981 3 PF F 15 4
6 5346 2 22 1982 21 PF F 14 4
7 8736 12 8 1983 19 RM M 17 4
8 8809 2 4 1984 16 RM F 14 7
9 9790 1 19 1985 16 RM F 16 4
10 11299 3 9 1986 3 RM F 16 7
# ℹ 16 more rows
11.6.2 Grouped changes
Sometimes you might need to add a new variable to our table, based on different groups.
Let’s assume we’d have the following summary data:
|>
surveys group_by(year) |>
summarise(n_obs_f = sum(sex == "F", na.rm = TRUE),
n_obs_m = sum(sex == "M", na.rm = TRUE)) |>
ungroup()
# A tibble: 26 × 3
year n_obs_f n_obs_m
<dbl> <int> <int>
1 1977 204 214
2 1978 503 433
3 1979 327 324
4 1980 605 727
5 1981 631 745
6 1982 823 1027
7 1983 771 797
8 1984 445 443
9 1985 636 716
10 1986 414 455
# ℹ 16 more rows
Now, let’s say we’d be interested in the percentage of female observations out of the total of observations where it was scored. We’d have to add a new column. Adding new columns is, as we’ve seen before, a job for mutate()
.
|>
surveys group_by(year) |>
summarise(n_obs_f = sum(sex == "F", na.rm = TRUE),
n_obs_m = sum(sex == "M", na.rm = TRUE)) |>
ungroup() |>
mutate(female_pct = n_obs_f / (n_obs_f + n_obs_m) * 100)
# A tibble: 26 × 4
year n_obs_f n_obs_m female_pct
<dbl> <int> <int> <dbl>
1 1977 204 214 48.8
2 1978 503 433 53.7
3 1979 327 324 50.2
4 1980 605 727 45.4
5 1981 631 745 45.9
6 1982 823 1027 44.5
7 1983 771 797 49.2
8 1984 445 443 50.1
9 1985 636 716 47.0
10 1986 414 455 47.6
# ℹ 16 more rows
The nice thing about chaining all these commands is that we can quickly build up what we want. We could, for example, easily plot the outcome of this.
11.6.3 To ungroup or not ungroup
Each time you do a grouped operation, it’s good practice to remove the grouping afterwards. If you don’t, then you might unintentionally be doing operations within the groups later on. Let’s illustrate this with an example.
We’ll take out any missing values, to simplify things.
<- surveys |>
obs_count drop_na() |>
group_by(sex, year) |>
summarise(n_obs = n())
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
obs_count
# A tibble: 52 × 3
# Groups: sex [2]
sex year n_obs
<chr> <dbl> <int>
1 F 1977 123
2 F 1978 430
3 F 1979 297
4 F 1980 442
5 F 1981 482
6 F 1982 672
7 F 1983 758
8 F 1984 436
9 F 1985 624
10 F 1986 400
# ℹ 42 more rows
Let’s say we now wanted to transform the n_obs
variable to a percentage of the total number of observations in the entire data set (which is 30676).
<- obs_count |>
obs_count mutate(n_obs_pct = n_obs / sum(n_obs) * 100)
We’d expect these values in n_obs_pct
to add up to 100%.
sum(obs_count$n_obs_pct)
[1] 200
However, they add up to 200 instead! Why? That’s because the table was still grouped by sex
and as such, the percentages were calculated by each sex
group. There are two of them (F
, M
- we filtered out the missing values), so the percentages add up to 100% within each sex
group.
The way to avoid this issue is to ensure we remove any groups from our table, which we can do with ungroup()
. Here’s the full string of commands, with the ungrouping step added:
<- surveys |>
obs_count drop_na() |> # remove all NAs
group_by(sex, year) |> # group by sex, year
summarise(n_obs = n()) |> # get number of rows
ungroup() |> # ungroup here
mutate(n_obs_pct = n_obs / sum(n_obs) * 100) # calculate percentage
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
We can check the percentages again and see that all is well:
sum(obs_count$n_obs_pct)
[1] 100
11.7 Summary
- We can split our data into groups and apply operations to each group.
- We can then combine the outcomes in a new table.
- We use
summarise()
to calculate summary statistics (e.g. mean, median, maximum, etc) - Using pipes with groups (e.g.
group_by() |> summarise()
) we can calculate those summaries across groups. - We can also filter (
group_by() |> filter()
) or create new columns (group_by() |> mutate()
). - It is good practice to remove grouping (with
ungroup()
) from tables aftergroup_by()
operations, to avoid issues with retained groupings.