13  Abundance Visualisation

Learning Objectives
  • Perform basic exploratory analysis of variant/lineage abundance data using the software R.
  • Generate several plots showing the change in variant abundance over time.
  • Explore which lineages are detected and assess the uncertainty in their estimates due to sequencing.
  • Recognise the pros and cons of analysing data from individual lineages or summarised across variants of concern.

13.1 Data exploration with R

In the previous section we have generated a CSV file that aggregated all the lineage/variant abundances into a single table. Although some basic exploration of these data can be done in Excel, we can perform more advanced and customised visualisations using the R software (see the R fundamentals appendix for a quick introduction to this software).

The main functions used in this section are:

  • Data import (Posit cheatsheet):
    • read_csv() to import a CSV file as a data.frame/tibble object.
  • Data manipulation (Posit cheatsheet):
    • filter() to subset rows of the table that match a particular condition.
    • arrange() to sort the table by the values of selected columns.
    • count() to count the unique values of selected columns.
    • mutate() to add new columns to the table or modify the values of existing ones.
  • Working with categorical variables (Posit cheatsheet):
    • fct_reorder() to order categorical values based on a numeric variable (rather than the default alphabetical ordering).
  • Working with dates (Posit cheatsheet):
    • floor_date() to round a date down to the time unit specified (e.g. “week” or “month”).
  • Visualisation with ggplot2 (Posit cheatsheet).

We start by loading the tidyverse package, which contains several functions for data manipulation and visualisation:

library(tidyverse)
theme_set(theme_bw())

We have also set a “black-and-white” theme for our ggplot plots, instead of the default “grey” theme.

The next step is to read our data in:

vocs <- read_csv("results/tidyfreyja/vocs_abundances.csv")

We can check the first few rows of our data, to check that is was imported correctly:

head(vocs)
# A tibble: 6 × 10
  sample      name   abundance boot_lo boot_up date       country city  latitude
  <chr>       <chr>      <dbl>   <dbl>   <dbl> <date>     <chr>   <chr>    <dbl>
1 SRR18541074 Delta    0.964    0.960  0.968   2021-12-01 United… San …     32.7
2 SRR18541074 Omicr…   0.0197   0.0132 0.0228  2021-12-01 United… San …     32.7
3 SRR18541074 Other    0.00122  0      0.00872 2021-12-01 United… San …     32.7
4 SRR18541043 Omicr…   0.876    0.873  0.881   2021-12-26 United… San …     32.7
5 SRR18541043 Other    0.0786   0.0683 0.0807  2021-12-26 United… San …     32.7
6 SRR18541043 Delta    0.0338   0.0343 0.0447  2021-12-26 United… San …     32.7
# ℹ 1 more variable: longitude <dbl>

We can start with some basic exploration of our data, answering some simple questions. Some examples are given here.

How many VOCs have >75% frequency?

vocs |> 
  # keep rows with >= 75% abundance
  filter(abundance >= 0.75) |> 
  # sort them by date
  arrange(date)
# A tibble: 12 × 10
   sample      name  abundance boot_lo boot_up date       country city  latitude
   <chr>       <chr>     <dbl>   <dbl>   <dbl> <date>     <chr>   <chr>    <dbl>
 1 SRR18541092 Delta     0.966   0.936   0.971 2021-09-05 United… San …     32.7
 2 SRR18541114 Delta     0.982   0.976   0.984 2021-09-06 United… San …     32.7
 3 SRR18541096 Delta     0.905   0.898   0.923 2021-09-07 United… Caer…     40.1
 4 SRR18541099 Delta     0.888   0.876   0.960 2021-09-07 United… Caer…     40.1
 5 SRR18541098 Delta     0.973   0.961   0.977 2021-09-13 United… Caer…     40.1
 6 SRR18541049 Delta     0.962   0.960   0.973 2021-11-28 United… San …     32.7
 7 SRR18541074 Delta     0.964   0.960   0.968 2021-12-01 United… San …     32.7
 8 SRR18541042 Delta     0.795   0.782   0.820 2021-12-06 United… San …     32.7
 9 SRR18541043 Omic…     0.876   0.873   0.881 2021-12-26 United… San …     32.7
10 SRR18541028 Omic…     0.932   0.909   0.945 2021-12-27 United… San …     32.7
11 SRR18541027 Omic…     0.886   0.883   0.890 2022-01-04 United… San …     32.7
12 SRR18541030 Omic…     0.915   0.905   0.946 2022-01-23 United… San …     32.7
# ℹ 1 more variable: longitude <dbl>

What was the count of each detected variant?

vocs |> 
  count(name)
# A tibble: 8 × 2
  name        n
  <chr>   <int>
1 A           7
2 Alpha       6
3 Beta        7
4 Delta      17
5 Epsilon     1
6 Gamma       8
7 Omicron     9
8 Other      15

We can also start doing some visualisations. For example, the previous question can be visualised with a barplot:

vocs |> 
  # count occurrence of each VOC
  count(name) |> 
  # visualise
  ggplot(aes(x = n, y = name)) +
  geom_col()

Even better, we can sort our variants by count rather than alphabetically:

vocs |> 
  # count occurrence of each VOC
  count(name) |> 
  # reorder the names by their count
  mutate(name = fct_reorder(name, n)) |>
  # visualise
  ggplot(aes(x = n, y = name)) +
  geom_col()

We may also want to break this down by city:

vocs |> 
  # count by city and name
  count(city, name) |> 
  # reorder the names by their count
  mutate(name = fct_reorder(name, n)) |> 
  # visualise
  ggplot(aes(x = n, y = name)) +
  geom_col(aes(fill = city))

It seems like Omicron was only detected in San Diego. Could this be because of the dates when the samples were collected in each city?

vocs |> 
  # create a variable with the start of the week for each sample
  mutate(week = floor_date(date, "week")) |> 
  # count how many samples per week and city
  count(week, city) |> 
  # barplot
  ggplot(aes(week, n)) +
  geom_col() +
  facet_grid(rows = vars(city))

Indeed, it seems like San Diego has a wider coverage across time. It is also clear from this plot that we have a time gap in our sampling, missing samples in Oct and Nov. 

Exercise

Given that San Diego has better sampling through time, let’s create a new table for our downstream visualisations.

  1. Create a new table called sandiego:
    • Retain observations from San Diego only.
    • Add a new column with the start of the month that each sample was collected in.
    • Order the sample IDs and variant IDs by their date of collection.
  2. Make a new barplot with the counts of each variant observed in this city.

For the first task, the following functions can be used:

  • filter() to subset rows
  • mutate() to add or modify columns
  • fct_reorder() to order categorical variables based on the date

The following code creates a new table as requested:

sandiego <- vocs |> 
  filter(city == "San Diego") |> 
  mutate(month = floor_date(date, "month"),
         sample = fct_reorder(sample, date),
         name = fct_reorder(name, date))

We can produce a barplot of variant counts as we did before:

sandiego |> 
  count(name) |> 
  mutate(name = fct_reorder(name, n)) |> 
  ggplot(aes(n, name)) +
  geom_col()

13.2 Variant abundances

Relative lineage abundances can be visualised as a barplot, where the samples are ordered by their date of collection. Using ggplot:

sandiego |> 
  ggplot(aes(x = sample, y = abundance)) +
  geom_col(aes(fill = name)) +
  scale_x_discrete(guide = guide_axis(angle = 45))

To add further information about the date of collection, we can also “facet” the plot by the month when the samples were collected:

sandiego |> 
  ggplot(aes(x = sample, y = abundance)) +
  geom_col(aes(fill = name)) +
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  facet_grid(cols = vars(month), scales = "free_x", space = "free_x")

You may notice that some of the bars don’t quite add up to 1. This is simply a rounding issue from Freyja output.

We can also visualise the abundances in a heatmap-style plot, which may be useful if the number of samples is very large:

sandiego |> 
  ggplot(aes(sample, name)) +
  geom_tile(aes(fill = abundance)) +
  scale_fill_viridis_c(limits = c(0, 1)) +
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  facet_grid(~ month, scales = "free_x", space = "free_x")

We can clearly see the transition between Delta and Omicron. Delta is dominant in Sep and Nov, then samples start to appear more mixed in Dec, and finally replaced by Omicron by Jan.

Another way to visualise these data is using a line plot, with a line for each variant:

sandiego |> 
  ggplot(aes(date, abundance, colour = name)) +
  geom_point(size = 2) +
  geom_line(linewidth = 1)

This plot also shows the clear shift between Delta and Omicron. We can also see the other variants are much less frequent in these samples.

13.3 Lineage abundances

So far, we have been analysing the frequencies summarised by Variant of Concern (VOC). We could do similar analyses using the individual lineages, however these tend to be less clear.

Let’s start by importing our data:

lineages <- read_csv("results/tidyfreyja/lineage_abundances.csv")

Here is an example of the heatmap-style visualisation for the San Diego samples:

lineages |> 
  # keep only samples from San Diego
  filter(city == "San Diego") |> 
  # order the samples and lineages by date
  mutate(sample = fct_reorder(sample, date),
         name = fct_reorder(name, date)) |> 
  # visualise
  ggplot(aes(sample, name)) +
  geom_tile(aes(fill = abundance)) +
  scale_fill_viridis_c(limits = c(0, 1)) +
  theme_classic() +
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  scale_y_discrete(guide = guide_axis(check.overlap = TRUE))

In this case, there are too many lineages to be easily visible on a plot like this (not all lineage names are shown on the y-axis, as they were overlapping each other). Therefore, using the VOCs summaries is more suited for these types of visualisations.

We can also see that the abundance of these lineages is generally very low, most lineages have a low frequency. This may be due to a mixture of sub-lineages being present in the sample, or even due to imprecisions in the estimates from our data. This can happen for lineages that have very similar mutation profiles.

Here is a histogram showing the distribution of lineage abundances:

lineages |> 
  ggplot(aes(abundance)) +
  geom_histogram(binwidth = 0.01)

As we can see, most lineages have an abundance of less than 1% (the first bar in the plot). Since many of these lineages are, in fact, part of the same clade, the summarised VOCs table gives a clearler picture for these types of visualisation.

However, the lineage data may be useful to investigate specific samples in more detail. For example, what were the lineages present in the latest sample collected?

lineages |> 
  filter(sample == "SRR18541030")
# A tibble: 10 × 10
   sample      name  abundance boot_lo boot_up date       country city  latitude
   <chr>       <chr>     <dbl>   <dbl>   <dbl> <date>     <chr>   <chr>    <dbl>
 1 SRR18541030 BA.1…   0.738    0.693  0.799   2022-01-23 United… San …     32.7
 2 SRR18541030 BA.1…   0.0991   0.0674 0.102   2022-01-23 United… San …     32.7
 3 SRR18541030 XD      0.0626   0.0282 0.0769  2022-01-23 United… San …     32.7
 4 SRR18541030 BA.1…   0.0604   0.0513 0.0690  2022-01-23 United… San …     32.7
 5 SRR18541030 BA.1…   0.0161   0      0.0349  2022-01-23 United… San …     32.7
 6 SRR18541030 XS      0.00684  0      0.0137  2022-01-23 United… San …     32.7
 7 SRR18541030 AY.1…   0.00260  0      0.00540 2022-01-23 United… San …     32.7
 8 SRR18541030 BA.1…   0.00129  0      0.00389 2022-01-23 United… San …     32.7
 9 SRR18541030 AY.2…   0.00128  0      0.00278 2022-01-23 United… San …     32.7
10 SRR18541030 AY.1…   0.00125  0      0.00177 2022-01-23 United… San …     32.7
# ℹ 1 more variable: longitude <dbl>

We can plot their frequency and bootstrap uncertainty interval:

lineages |> 
  # keep rows for sample of interest only
  filter(sample == "SRR18541030") |> 
  # sort lineage by abundance
  mutate(name = fct_reorder(name, abundance)) |> 
  # make the plot
  ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +
  geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) +
  labs(x = "Lineage", y = "Abundance (95% CI)", colour = "< 5%")

We can see several lineages with low frequency (less than 5%), which we should interpret carefully as those tend to less precise (see Sutcliffe et al. 2023). For higher frequency lineages the confidence intervals are relatively narrow, suggesting the uncertainty due to sequencing depth is not problematic for this sample.

13.3.1 Exercise

Lineages abundance uncertainty
  • Based on the code just shown above, make a similar lineage abundance plot for the each of the following samples: “SRR18541092” and “SRR18541114”.
  • What do you think about the uncertainty in the lineage estimates of these two samples? Can you hypothesise the reason for this difference? (Hint: go back to the MultiQC report and check these samples’ quality).
  • For both samples, make a similar plot but for the summaried VOC abundances. What do you think about the uncertainty in this case? Discuss with your neighbours.

Here is the code for these two variants:

lineages |> 
  filter(sample == "SRR18541092") |> 
  mutate(name = fct_reorder(name, abundance)) |> 
  ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +
  geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) +
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  labs(x = "Lineage", y = "Abundance (95% CI)", colour = "< 5%", 
       main = "Sample: SRR18541092")

lineages |> 
  filter(sample == "SRR18541114") |> 
  mutate(name = fct_reorder(name, abundance)) |> 
  ggplot(aes(x = name, y = abundance, colour = abundance < 0.05)) +
  geom_pointrange(aes(ymin = boot_lo, ymax = boot_up)) +
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  labs(x = "Lineage", y = "Abundance (95% CI)", colour = "< 5%",
       main = "Sample: SRR18541114")

We can see that the uncertainty in sample SRR18541114 is substantially higher compared to SRR18541092.

The reason is possibly the difference in sequencing depth between these samples. If we go back to the MultiQC report from viralrecon we will see that the median coverage for SRR18541114 is only 94x, compared to 377x for the other sample. Looking at the Mosdepth cumulative coverage graph on the report is even more illustrative:

We can see from this plot that, for example, only ~50% of the genome in SRR18541114 is covered at >100x, compared to ~70% on the other sample. This likely leads to several mutations that are missed during sequencing, leading to less stable abundance estimates.

A similar code could be used to visualise the abundance of the VOCs in these two samples. However, we show a modified version of the code, combining both samples in the same plot:

sandiego |> 
  # keep rows for sample of interest only
  filter(sample %in% c("SRR18541092", "SRR18541114")) |> 
  # make the plot
  ggplot(aes(x = name, y = abundance, colour = sample)) +
  geom_pointrange(aes(ymin = boot_lo, ymax = boot_up), 
                  position = position_dodge(width = 0.5))

We’ve used a few tricks for this visualisation:

  • The %in% operator is used to select rows that match either of those samples.
  • We’ve added colour = sample aesthetic, to colour points based on the sample.
  • position_dodge() is used to shift the points of the two samples so they are not overlapping.

This visualisation reveals much less uncertainty when summarising the lineages into variants of concern. This makes some sense, as multiple lineages are combined together for a given VOC, so even though the uncertainty of individual lineages may be high, this uncertainty is reduced when looking at the summarised results.

This reveals a clear tradeoff between looking at individual lineages compared to summarised VOC abundances. On the one hand we get more information about specific lineages detected, but their abundances may have high uncertainty due to sequencing. On the other hand we get more precise abundance estimates for VOCs, but loose the detail of which specific lineages of those variants were present in our samples.

In conclusion, looking at both results is useful. We can start with an analysis of VOCs to get the bigger picture of which variants are circulating, and later explore individual lineages detected in the samples. When looking at individual lineages, we should be mindful of considering samples with higher sequencing depth, to get a more precise picture.

13.4 Freyja dashboard

In this chapter we covered the use of R to generate these visualisations. This is because of the flexibility of this language, which allows us to perform many different types of visualisation and data exploration.

However, the Freyja developers provide a command to generate a dashboard with a barplot of VOC abundances. We have found this solution less flexible, and it requires configuring the metadata file in a very specific format.

There are two steps involved:

  • freyja aggregate is used to combine the results of multiple samples into a single file.
  • freyja dash is then used to create the dashboard.

The Freyja documentation gives more details about how to use these commands. You will need to pay attention to the specific file format for metadata, which requires certain columns to be present in order for the commands to work.

13.5 Summary

Key points
  • There are several useful exploratory analyses that can be done on variant abundances:
    • Which variants were detected at high frequency.
    • How many times each variant was detected across all samples.
    • How many samples were collected weekly in different regions.
  • Variant abundance over time can be displayed as barplots, heatmaps or line plots.
  • Samples with higher sequencing depth usually have narrower confidence intervals compared to those with low sequencing depth.
  • Analysis of both lineage and VOC abundance is useful:
    • VOC abundance gives more stable estimates and provides a “big picture” of the variants circulating the population.
    • Lineage abundance gives more detail about which exact lineages of a variant are circulating in the population, however the estimates are less precise with wider confidence intervals.