# load the libraries
library(tidyverse) # data manipulation
library(patchwork) # to compose plots
library(janitor) # to clean column names
theme_set(theme_minimal()) # change default ggplot2 theme
10 Visualising association results
These materials are still under development
- Summarise what a Manhattan plot displays and why it is useful in the context of GWAS.
- Produce Manhattan plots using
ggplot2
. - Identify the top variants associated with a trait.
- Calculate linkage disequilibrium statistics between a variant of interest and its neighbours.
- Produce a regional association plot that focuses on a region of interest.
10.1 Manhattan plots
The most iconic data visualisation related to GWAS is the so-called Manhattan plot, which displays each variant as a point, with its genome position along the x-axis and its associated -log₁₀(p-value) on the y-axis. Significant associations show as “peaks” passing through the genome-wide significance threshold, represented as a horizontal line. Additionally, each chromosome is shown a separate panel, giving a complete genome-wide view of the association results.
In this section we show how to produce this and other visualisations of our association test results in R.
If you haven’t done so already, start an R session with the following packages loaded:
As an example, we will continue with the results for the “blood pressure” trait. Here is how to read it into R, if you haven’t done so already:
# read the association test results including PCA covariates
<- read_tsv("results/1000G_subset_pca.blood.glm.linear") |>
blood_covar clean_names()
# retain only the SNP test
<- blood_covar |>
blood_covar filter(test == "ADD")
10.2 Visualise GWAS results with ggplot2
To make our Manhattan plot, we can use standard ggplot2
functionality. We use several features of this plotting library to make our plot more effective (see box below for details).
ggplot2
code explanation
Here is the break down of our plotting code:
filter()
is used to retain only p-values below 0.01 before plotting. This is done to reduce the number of points being plotted, to avoid crashing the plotting device. As we are not interested in high p-values, we retain only those below 0.01.ggplot()
initiates the plot, with genome position as the x-axis and the -log₁₀(p-value) as our y-axis.geom_point()
displays points on the plot.geom_hline()
adds a horizontal line at the genome-wide significance threshold of \(5 \times 10^{-8}\), which is often used in human GWAS.facet_grid()
splits the plot into panels, one per chromosome. We split the facets by “column”, and we make sure that both the scale and the space allocated to each facet is allowed to vary for each chromosome. Finally, we switch the facet labels to appear at the bottom of the plot, for aesthetic reasons. You can try removing those options to see what happens.labs()
is used to edit the x-axis label and add a title to the plot.theme_minimal()
andtheme()
are used together to make the plot more aesthetically pleasing, by removing x-axis labels, tick marks and gridlines.
To save time, you can save some of this code in a variable, for example:
<- theme_minimal() +
manhattan_theme theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid = element_blank(),
panel.spacing = unit(0.1, "lines"),
strip.background = element_blank()
)
|>
blood_covar filter(p < 0.01) |>
ggplot(aes(pos, -log10(p))) +
geom_point() +
geom_hline(yintercept = -log10(5e-8), linetype = "dashed") +
facet_grid(~ number_chrom,
scale = "free_x",
space = "free_x",
switch = "x") +
labs(x = "Chromosome",
title = "Manhattan plot") +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid = element_blank(),
panel.spacing = unit(0.1, "lines"),
strip.background = element_blank()
)
The plot shows a genome-wide significant association in chromosome 5. There are other regions that seem to contain peaks of association (e.g. on chromosomes 10 and 12), but these do not pass the genome-wide threshold.
10.3 Regional plots
Now that we have found an association with our trait, we may want to investigate it further. One common visalisation is to make a “regional plot”, where we zoom-in on a SNV of interest and make a Manhattan plot, colouring the points by the linkage coefficient to the target SNP.
First, let’s identify our top-most associated SNV:
|>
blood_covar arrange(p) |>
select(number_chrom, pos, id, beta, p)
# A tibble: 761,360 × 5
number_chrom pos id beta p
<dbl> <dbl> <chr> <dbl> <dbl>
1 5 32788263 rs1158715 0.513 0.00000000728
2 5 32789286 rs2287661 0.513 0.00000000728
3 5 32792663 rs13362538 0.510 0.00000000875
4 5 32790510 rs1345597 0.509 0.0000000104
5 5 32796486 rs4555828 0.503 0.0000000156
6 5 32796761 rs6450927 0.503 0.0000000156
7 5 32797551 rs7703872 0.503 0.0000000156
8 5 32800911 rs75824420 0.503 0.0000000156
9 5 32802025 rs6890651 0.503 0.0000000156
10 5 32792218 rs924612 0.500 0.0000000171
# ℹ 761,350 more rows
We can see at the top we have two SNPs with the same estimated β coefficient and p-value. These must be two SNVs that have the same genotype across all individuals (i.e. they are in perfect linkage). We arbitrarily choose “rs1158715” to proceed with our analysis.
We can calculate the linkage score for our target SNP using PLINK:
plink2 --pfile data/plink/1000G_subset --out results/1000G_subset_rs1158715 \
--geno 0.05 --maf 0.01 --hwe 0.001 keep-fewhet \
--mind 0.05 --keep results/1000G_subset.king.cutoff.in.id \
--r2-unphased --ld-snp rs1158715 \
--ld-window-kb 500 --ld-window-r2 0.05
--r2-unphased
calculates the correlation between pairs of SNVs.--ld-snp
indicates which is the target SNV we are interested in.--ld-window-kb
restricts the calculation to SNVs within 500 kbp of the target SNV.--ld-window-r2
indicates what is the minimum r² we want reported. By default this is 0.2, here we lower this threshold for illustration purposes. In real analysis, it may be sensible to truncate the calculation at 0.2, to reduce the size of the output file.
The --r2-unphased
option outputs a file with .vcor
(variant correlation) extension. As usual, this is a tab-delimited file, which we can read into R:
<- read_tsv("results/1000G_subset_rs1158715.vcor") |>
rs1158715 clean_names()
head(rs1158715)
# A tibble: 6 × 7
number_chrom_a pos_a id_a chrom_b pos_b id_b unphased_r2
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 5 32788263 rs1158715 5 32618642 rs72739273 0.0564
2 5 32788263 rs1158715 5 32626318 rs16889987 0.0608
3 5 32788263 rs1158715 5 32631602 rs10805572 0.0548
4 5 32788263 rs1158715 5 32638605 rs16890002 0.0563
5 5 32788263 rs1158715 5 32665481 rs6881969 0.0525
6 5 32788263 rs1158715 5 32670764 rs4530767 0.0544
The table contains the correlation between genotypes in our SNV of interest (id_a
) and each other SNV (id_b
) within 500 kbp of it. We can quickly look at the distribution of the correlation values:
|>
rs1158715 ggplot(aes(unphased_r2)) +
geom_histogram(breaks = seq(0, 1, 0.02))
We can see that most SNVs have low correlation with our target SNP, but a few have very high correlation.
We can also see the expected decay in LD with distance from the target SNV:
|>
rs1158715 # add column with distance to target variant
mutate(dist = abs(pos_a - pos_b)) |>
# plot with a trend line added
ggplot(aes(dist, unphased_r2)) +
geom_point() +
geom_smooth(se = FALSE)
Finally, we can produce a regional association plot, which displays the strength of association as well as the correlation with the variants surrounding our peak variant. We do this by:
- Filtering our GLM results table to focus on the region of interest.
- Joining it with the correlation table (using the matching variant ids)
- Producing a plot with genomic position on the x-axis, -log₁₀(p-value) on the y-axis, and setting the points colour and size by their correlation value.
|>
blood_covar # retain variants within 250kb each side of our target
filter(number_chrom == 5 & pos > 32788263 - 250e3 & pos < 32788263 + 250e3) |>
# join with our LD table
left_join(rs1158715, by = c("id" = "id_b")) |>
# for SNVs with no correlation value (below 0.05)
# we set them to 0 for plotting purposes
mutate(unphased_r2 = ifelse(is.na(unphased_r2), 0, unphased_r2)) |>
# for the target variant itself, we set it to 1
mutate(unphased_r2 = ifelse(id == "rs1158715", 1, unphased_r2)) |>
# plot
ggplot(aes(pos, -log10(p))) +
geom_point(aes(colour = unphased_r2, size = unphased_r2)) +
geom_hline(yintercept = -log10(5e-8), linetype = "dashed") +
scale_colour_gradient2(low = "#313695",
mid = "#ffffbf",
high = "#a50026",
midpoint = 0.5)
This visualisation clearly shows a cluster of variants in close proximity of the target variant and with high genotype correlation to it.
This visualisation is useful not only to display our results, but also to help us prioritise which variants to focus on in downstream analyses.
10.4 Summary
- Manhattan plots display the results of the association tests for each variant:
- Each variant is displayed along the x-axis according to its genome position.
- The y-axis shows the strenght of association as -log₁₀(p-value).
- Manhattan plots can be produced using standard
ggplot2
functionality, in particular taking advantage of thefacet_grid()
function and custom themes. - Once a variant of interest is identified, its correlation with neighbouring variants can be calculated (a measure of LD between neighbouring variants).
- A combination of PLINK’s options can be used to calculate LD around a target variant:
--r2-unphased
,--ld-snp
,--ld-window-kb
and--ld-window-r2
.
- A combination of PLINK’s options can be used to calculate LD around a target variant:
- Regional association plots are a zoomed-in version of a Manhattan plot, additionally displaying the LD with a target variant using a colour scale.