11  Grouped operations

Learning objectives
  • Learn how to use grouped operations in data analysis.
  • Be able to distinguish between grouped summaries and other types of grouped operations.

11.1 Context

We’ve done different types of operations, all on the entire data set. Sometimes there is structure within the data, such as different groups (e.g. genotypes, patient cohorts, geographical areas etc). We might then want information on a group-by-group basis.

11.2 Section setup

We’ll continue this section with the script named da3-08-grouped-operations.R. If needed, add the following code to the top of your script and run it.

# A collection of R packages designed for data science
library(tidyverse)

surveys <- read_csv("data/surveys.csv")

We’ll continue this section with the script named da3-08-grouped-operations.py. If needed, add the following code to the top of your script and run it.

# A Python data analysis and manipulation tool
import pandas as pd

# Python equivalent of `ggplot2`
from plotnine import *

surveys = pd.read_csv("data/surveys.csv")

11.3 Split-apply-combine

Conceptually, this kind of operation can be referred to as split-apply-combine, because we split the data, apply some function and then combine the outcome.

Let’s illustrate this with an example. Figure 11.1 shows a hypothetical data set, where we have temperature and rainfall measurements for different cities.

Figure 11.1: An example of a table with groups

Let’s assume we were interested in the average temperature for each city. We would have to do the following:

  1. Split the data by city
  2. Calculate the average temperature
  3. Combine the outcome together in a new table

This is visualised in Figure 11.2.

Figure 11.2: Split-apply-combine

11.4 Summary operations

Let’s put this into practice with our data set.

11.4.1 Summarising data

A common task in data analysis is to summarise variables to get the mean and the variation around it.

For example, let’s calculate what the mean and standard deviation are for weight.

We can achieve this task using the summarise() function.

surveys |> 
  summarise(weight_mean = mean(weight, na.rm  = TRUE),
            weight_sd = sd(weight, na.rm = TRUE))
# A tibble: 1 × 2
  weight_mean weight_sd
        <dbl>     <dbl>
1        42.7      36.6

A couple of things to notice:

The output of summarise is a new table, where each column is named according to the input to summarise().

Within summarise() we should use functions for which the output is a single value. Also notice that, above, we used the na.rm option within the summary functions, so that they ignored missing values when calculating the respective statistics.

Summary functions

There are many functions whose input is a vector (or a column in a table) and the output is a single number. Here are several common ones:

  • mean(x) - arithmetic mean
  • median(x) - median
  • sd(x) - standard deviation
  • var(x) - variance
  • IQR(x) - interquartile range
  • mad(x) - median absolute deviation
  • min(x) and max(x) - minimum and maximum
  • quantile(x, probs = 0.75) - quantile (use the probs option to set the quantile of your choosing)
  • sum(x) - addition of all values in x
  • n_distinct(x) (from dplyr) - the number of distinct values in the vector x

All of these have the option na.rm, which tells the function remove missing values before doing the calculation.

11.4.2 Grouped summaries

In most cases we want to calculate summary statistics across groups of our data.

We can achieve this by combining summarise() with the group_by() function. For example, let’s modify the previous example to calculate the summary for each sex group:

surveys |> 
  group_by(sex) |> 
  summarise(weight_mean = mean(weight, na.rm  = TRUE),
            weight_sd = sd(weight, na.rm = TRUE))
# A tibble: 3 × 3
  sex   weight_mean weight_sd
  <chr>       <dbl>     <dbl>
1 F            42.2      36.8
2 M            43.0      36.2
3 <NA>         64.7      62.2

The table output now includes both the columns we defined within summarise() as well as the grouping columns defined within group_by().

11.5 Counting data

Counting or tallying data is an extremely useful way of getting to know your data better.

11.5.1 Simple counting

We can use the count() function from dplyr to count data. It always returns the number of rows it counts.

For example, this gives us the total number of observations (rows) in our data set:

count(surveys)
# A tibble: 1 × 1
      n
  <int>
1 35549

We can also do that using conditional statements:

# count the observations from the year 1982
surveys |> 
  filter(year == 1982) |> 
  count()
# A tibble: 1 × 1
      n
  <int>
1  1978

11.5.2 Counting by group

Counting really comes into its own when we’re combining this with some grouping. For example, we might be interested in the number of observations for each year.

surveys |> 
  count(year)
# A tibble: 26 × 2
    year     n
   <dbl> <int>
 1  1977   503
 2  1978  1048
 3  1979   719
 4  1980  1415
 5  1981  1472
 6  1982  1978
 7  1983  1673
 8  1984   981
 9  1985  1438
10  1986   942
# ℹ 16 more rows

We can also easily visualise this (we can pipe straight into ggplot()). We use geom_col() to create a bar chart of the number of observations per year. We count by sex and use this variable to fill the colour of the bars.

surveys |> 
  count(sex, year) |> 
  ggplot(aes(x = year, y = n, fill = sex)) +
  geom_col()
Figure 11.3: Number of observations per year, by sex.
Counting within a summary pipeline

Often we want to do counting when we’re creating summaries. The count() function can’t be used within summarise(), but there is a special helper function called n(). Look at the following example, where we group by year, filter the data, create some summary statistic and also count the number of rows within each group.

surveys |> 
  group_by(year) |>                                   # group the data
  filter(year %in% c(1981, 1982)) |>                  # filter a subset of years
  summarise(mean_weight = mean(weight, na.rm = TRUE), # calculate mean weight
            n_obs = n()) |>                           # number of rows
  ungroup()                                           # drop the grouping
# A tibble: 2 × 3
   year mean_weight n_obs
  <dbl>       <dbl> <int>
1  1981        65.8  1472
2  1982        53.8  1978

11.5.3 Counting missing data

Oh, missing data! How we’ve missed you. For something that isn’t there, is has quite the presence. But, it is an important consideration in data analysis. We’ve already seen how we can remove missing data from and also explored ways to visualise them.

We have seen how to use the summary() function to find missing values. Here we’ll see (even more) ways to tally them.

We can use the is.na() function to great effect, within a summarise() pipeline. We can negate with !is.na() to find non-missing values. Again, using the sum() function then enables us to tally how many missing / non-missing values there are.

surveys |> 
  summarise(obs_present = sum(!is.na(species_id)),    # count non-missing data
            obs_absent = sum(is.na(species_id)),      # count missing data
            n_obs = n(),                              # total number of rows
            precentage_absent = 
              (obs_absent / n_obs) * 100) |>          # percentage of missing data
  ungroup()
# A tibble: 1 × 4
  obs_present obs_absent n_obs precentage_absent
        <int>      <int> <int>             <dbl>
1       34786        763 35549              2.15

11.6 Grouped operations

11.6.1 Grouped filters

Sometimes it can be really handy to filter data, by group. In our surveys data, for example, you might be interested to find out what the minimum weight value is for each year. We can do that as follows:

surveys |> 
  group_by(year) |> 
  filter(weight == min(weight, na.rm = TRUE)) |> 
  ungroup()
# A tibble: 94 × 9
   record_id month   day  year plot_id species_id sex   hindfoot_length weight
       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
 1       218     9    13  1977       1 PF         M                  13      4
 2      1161     8     5  1978       6 PF         F                  16      6
 3      1230     9     3  1978      19 PF         M                  16      6
 4      1265     9     4  1978      15 PF         F                  16      6
 5      1282     9     4  1978       4 PF         F                  16      6
 6      1343    10     7  1978      11 PF         F                  15      6
 7      1351    10     8  1978       2 PF         M                  15      6
 8      1380    10     8  1978       3 PF         F                  16      6
 9      1400    11     4  1978      19 PF         M                  15      6
10      1427    11     4  1978      21 PF         F                  15      6
# ℹ 84 more rows

You can see that this outputs the minimum value, but if there are multiple entries for each year (such as in 1978), multiple rows returned. If we only wanted to get a single row per minimum value, per year, then we can use slice(1). This slices the first row of each group:

surveys |> 
  group_by(year) |> 
  filter(weight == min(weight, na.rm = TRUE)) |> 
  slice(1) |> 
  ungroup()
# A tibble: 26 × 9
   record_id month   day  year plot_id species_id sex   hindfoot_length weight
       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
 1       218     9    13  1977       1 PF         M                  13      4
 2      1161     8     5  1978       6 PF         F                  16      6
 3      1923     7    25  1979      18 PF         F                  NA      6
 4      2506     2    25  1980       3 PF         F                  15      5
 5      4052     4     5  1981       3 PF         F                  15      4
 6      5346     2    22  1982      21 PF         F                  14      4
 7      8736    12     8  1983      19 RM         M                  17      4
 8      8809     2     4  1984      16 RM         F                  14      7
 9      9790     1    19  1985      16 RM         F                  16      4
10     11299     3     9  1986       3 RM         F                  16      7
# ℹ 16 more rows

11.6.2 Grouped changes

Sometimes you might need to add a new variable to our table, based on different groups.

Let’s assume we’d have the following summary data:

surveys |> 
  group_by(year) |> 
  summarise(n_obs_f = sum(sex == "F", na.rm = TRUE),
            n_obs_m = sum(sex == "M", na.rm = TRUE)) |> 
  ungroup()
# A tibble: 26 × 3
    year n_obs_f n_obs_m
   <dbl>   <int>   <int>
 1  1977     204     214
 2  1978     503     433
 3  1979     327     324
 4  1980     605     727
 5  1981     631     745
 6  1982     823    1027
 7  1983     771     797
 8  1984     445     443
 9  1985     636     716
10  1986     414     455
# ℹ 16 more rows

Now, let’s say we’d be interested in the percentage of female observations out of the total of observations where it was scored. We’d have to add a new column. Adding new columns is, as we’ve seen before, a job for mutate().

surveys |> 
  group_by(year) |> 
  summarise(n_obs_f = sum(sex == "F", na.rm = TRUE),
            n_obs_m = sum(sex == "M", na.rm = TRUE)) |> 
  ungroup() |> 
  mutate(female_pct = n_obs_f / (n_obs_f + n_obs_m) * 100)
# A tibble: 26 × 4
    year n_obs_f n_obs_m female_pct
   <dbl>   <int>   <int>      <dbl>
 1  1977     204     214       48.8
 2  1978     503     433       53.7
 3  1979     327     324       50.2
 4  1980     605     727       45.4
 5  1981     631     745       45.9
 6  1982     823    1027       44.5
 7  1983     771     797       49.2
 8  1984     445     443       50.1
 9  1985     636     716       47.0
10  1986     414     455       47.6
# ℹ 16 more rows

The nice thing about chaining all these commands is that we can quickly build up what we want. We could, for example, easily plot the outcome of this.

surveys |> 
  group_by(year) |> 
  summarise(n_obs_f = sum(sex == "F", na.rm = TRUE),
            n_obs_m = sum(sex == "M", na.rm = TRUE)) |> 
  ungroup() |> 
  mutate(female_pct = n_obs_f / (n_obs_f + n_obs_m) * 100) |> 
  ggplot(aes(x = year, y = female_pct)) +
  geom_line()

11.6.3 To ungroup or not ungroup

Each time you do a grouped operation, it’s good practice to remove the grouping afterwards. If you don’t, then you might unintentionally be doing operations within the groups later on. Let’s illustrate this with an example.

We’ll take out any missing values, to simplify things.

obs_count <- surveys |> 
  drop_na() |> 
  group_by(sex, year) |> 
  summarise(n_obs = n())
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
obs_count
# A tibble: 52 × 3
# Groups:   sex [2]
   sex    year n_obs
   <chr> <dbl> <int>
 1 F      1977   123
 2 F      1978   430
 3 F      1979   297
 4 F      1980   442
 5 F      1981   482
 6 F      1982   672
 7 F      1983   758
 8 F      1984   436
 9 F      1985   624
10 F      1986   400
# ℹ 42 more rows

Let’s say we now wanted to transform the n_obs variable to a percentage of the total number of observations in the entire data set (which is 30676).

obs_count <- obs_count |> 
  mutate(n_obs_pct = n_obs / sum(n_obs) * 100)

We’d expect these values in n_obs_pct to add up to 100%.

sum(obs_count$n_obs_pct)
[1] 200

However, they add up to 200 instead! Why? That’s because the table was still grouped by sex and as such, the percentages were calculated by each sex group. There are two of them (F, M - we filtered out the missing values), so the percentages add up to 100% within each sex group.

The way to avoid this issue is to ensure we remove any groups from our table, which we can do with ungroup(). Here’s the full string of commands, with the ungrouping step added:

obs_count <- surveys |> 
  drop_na() |>                                  # remove all NAs
  group_by(sex, year) |>                        # group by sex, year
  summarise(n_obs = n()) |>                     # get number of rows
  ungroup() |>                                  # ungroup here
  mutate(n_obs_pct = n_obs / sum(n_obs) * 100)  # calculate percentage
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.

We can check the percentages again and see that all is well:

sum(obs_count$n_obs_pct)
[1] 100

11.7 Summary

Key points
  • We can split our data into groups and apply operations to each group.
  • We can then combine the outcomes in a new table.
  • We use summarise() to calculate summary statistics (e.g. mean, median, maximum, etc)
  • Using pipes with groups (e.g. group_by() |> summarise()) we can calculate those summaries across groups.
  • We can also filter (group_by() |> filter()) or create new columns (group_by() |> mutate()).
  • It is good practice to remove grouping (with ungroup()) from tables after group_by() operations, to avoid issues with retained groupings.