# load the libraries
library(tidyverse) # data manipulation
library(patchwork) # to compose plots
library(janitor) # to clean column names
theme_set(theme_minimal()) # change default ggplot2 theme
5 Variant QC
These materials are still under development
- List metrics that can be used for variant-level quality control and filtering.
- Use PLINK to calculate genotype call rates, minor allele frequency and deviations from Hardy-Weinberg equilibrium.
- Analyse quality metric results in R and assess the need for filtering in downstream analyses.
- Discuss how these quality metrics are impacted by population structure.
5.1 Per-variant metrics
Before proceeding with downstream analyses, it’s good practice to investigate quality issues in our variants. We will consider the following metrics for each variant:
- Call rate: Fraction of missing genotypes. Variants with a high fraction of missing data may indicate overall low quality and thus be removed from downstream analysis.
- Allele frequency: The frequency of the allele in the population of samples. Variants with low frequency (rare alleles) are usually excluded from downstream analysis as they incur low statistical power for association tests.
- Hardy-Weinberg deviations: In randomly mating populations, there is a theoretical expectation of how many homozygous and heterozygous individuals there should be given the frequency of the two alleles. Deviations from this expectation may be due to genotyping errors.
We will cover each of these below.
If you haven’t done so already, start an R session with the following packages loaded:
5.2 Call rates
One way to assess the genotype quality of each variant is to calculate how many missing genotypes each sample has. This can then be used to assess the need to exclude variants from our downstream analysis. There are no set rules as to what constitutes a “good call rate”, but typically we may exclude variants with more than ~5% of missing genotypes. This threshold may vary, however, depending on the nature of data you have.
In the previous exercise, you should have produced a file containing the counts and frequency of missing genotypes for each variant. As we did before for the allele frequency file, we can import this table into R:
<- read_tsv("results/1000G_subset.vmiss") |>
vmiss clean_names()
# inspect the table
head(vmiss)
# A tibble: 6 × 5
number_chrom id missing_ct obs_ct f_miss
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 rs141149254 0 904 0
2 1 rs76735897 0 904 0
3 1 rs1451704981 0 904 0
4 1 rs556374459 22 904 0.0243
5 1 rs372315362 0 904 0
6 1 rs867544879 0 904 0
We can tabulate how many variants have missing genotypes:
table(vmiss$missing_ct > 0)
FALSE TRUE
4119671 1234004
Around 23% of variants have a missing genotype in at least one of the samples. For those, we can plot the missing rate distribution:
|>
vmiss filter(f_miss > 0) |>
ggplot(aes(f_miss)) +
geom_histogram()
We can see most of these SNVs have relatively low rates of missing data.
We can see how many variants would be discarded if we used the conventional 5% threshold:
sum(vmiss$f_miss > 0.05)/nrow(vmiss)
[1] 0.000653383
At this threshold we will discard a very small fraction of our variants. We can therefore be satisfied that, in general, there are no major issues with our call rates.
In our downstream analyses, we can exclude variants with >5% missing data by adding the option --geno 0.05
to PLINK.
5.3 Allele frequency
In the previous chapter we already saw how to calculate the allele frequency of our variants using the --freq
option.
We can read this file into R as usual:
<- read_tsv("results/1000G_subset.afreq") |>
afreq clean_names()
head(afreq)
# A tibble: 6 × 6
number_chrom id ref alt alt_freqs obs_ct
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 . G A 0.0775 1806
2 1 . A G 0.315 1806
3 1 . A T 0.00332 1806
4 1 . T G 0 1762
5 1 . C G 0.00443 1806
6 1 . G A 0.119 1806
By default PLINK calculates the allele frequency of the alternative allele. However, this is somewhat arbitrary, as the alternative allele is simply defined as the allele that different from whichever happens to be the reference genome. A more common approach is to visualise the minor allele frequency (MAF), i.e. the frequency of the least-common alelle in the population.
We can calculate the MAF for each variant, adding it as a new column to our data frame, followed by a new histogram:
# add a column of minor allele frequency
<- afreq |>
afreq mutate(maf = ifelse(alt_freqs > 0.5, 1 - alt_freqs, alt_freqs))
# MAF histogram
|>
afreq ggplot(aes(maf)) +
geom_histogram(binwidth = 0.01)
We can see the histogram is quite skewed, with many SNPs having very low frequency. In fact, some of them are not variable at all in our samples! We can quickly tabulate how many SNPs are above the commonly-used 1% threshold of allele frequency:
table(afreq$maf > 0.01)
FALSE TRUE
4587526 766149
We can see that the majority of variants have very low frequency. These must be variants that have been found to vary in other individuals of the 1000 genomes project, but happen to be invariant in our relatively small collection of samples.
Low frequency variants are often filtered out when performing downstream analyses, such as the association test. This is because they have low statistical power, leading to noisy estimates (you can think of it as having a low sample size for one of the classes of genotypes).
To exclude variants with low minor allele frequency, for example at a 1% threshold, we can use PLINK’s option --maf 0.01
.
5.4 Hardy–Weinberg
Another quality control step is to check whether SNPs significantly deviate from Hardy-Weinberg equilibrium, which is expected if individuals mate randomly.
plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --hardy
This outputs a file with .hardy
extension, which we can read into R:
<- read_tsv("results/1000G_subset.hardy") |>
hardy clean_names()
head(hardy)
# A tibble: 6 × 10
number_chrom id a1 ax hom_a1_ct het_a1_ct two_ax_ct o_het_a1 e_het_a1
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 . G A 771 124 8 0.137 0.143
2 1 . A G 434 369 100 0.409 0.432
3 1 . A T 897 6 0 0.00664 0.00662
4 1 . T G 881 0 0 0 0
5 1 . C G 895 8 0 0.00886 0.00882
6 1 . G A 708 175 20 0.194 0.210
# ℹ 1 more variable: p <dbl>
One possible visualisation of these data is to plot the expected heterozygosity versus the observed heterozygosity and colour the points as to whether they are below a chosen p-value threshold:
|>
hardy # randomply sample SNPs
# to avoid plot window from crashing
sample_n(10e3) |>
ggplot(aes(e_het_a1, o_het_a1)) +
geom_point(aes(colour = p < 0.001)) +
geom_abline() +
labs(x = "Expected heterozygosity", y = "Observed heterozygosity")
From this plot, we can see an excess of SNVs with lower heterozygosity than expected compared to those with higher heterozygosity. This discrepancy is because our samples originate from diverse global regions that do not form a “randomly mating population”.
As an example, consider a variant present in one geographical area (e.g., individuals from a specific country) but absent elsewhere. The Hardy-Weinberg equilibrium assumes random mating across the entire population, therefore variants with limited geographic distribution may appear to have an excess of homozygotes. In reality, these variants are simply missing from certain populations.
So, while variants might fit Hardy-Weinberg expectations within randomly mating sub-populations, they will seem to deviate from it when these groups are pooled together. This phenomenon is known as the Wahlund effect and results from population structure, a topic which we return to in a later chapter.
In downstream analysis, we can exclude SNPs with a low p-value for the Hardy-Weinberg deviation test. However, due to the population structure issue just discussed, we only exclude SNPs with higher-than-expected heterozygosity. High rates of heterozygosity may indicate genotyping errors, which we want to eliminate. Wehreas low rates of heterozygosity may simply be due to population structure, which we want to retian. To discard only high heterozygosity SNVs having p-value < 0.001, we can use the option --hwe 0.001 keep-fewhe
.
We have seen a few PLINK commands that are useful for checking properties of our genotype data:
--missing
to assess genotype missingness both across SNPs and samples.--freq
to assess the allele frequency across SNPs.--hardy
to assess genotype frequency deviations from the Hardy-Weinberg equilibrium expectation.
So far, we have run each of these options individually, however you can run multiple options simultaneously. For example, our previous analyses could have been run with a single command:
plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --freq --hardy --missing
This would produce all three respective results files in one go.
5.5 Summary
- Key metrics for variant-level quality control include: call rate, minor allele frequency and Hardy-Weinberg equilibrium.
- Call rates (
--missing
): variants with low call rates (e.g. <95% or >5% missing data) are typically excluded.- To remove variants with missing data above a certain threshold use
--geno X
(replaceX
with the desired fraction, e.g. 0.05).
- To remove variants with missing data above a certain threshold use
- Minor allele frequency (
--maf
): very rare variants (e.g. <1%) have low statistical power and are usually removed from downstream analysis.- To remove variants with MAF below a certain threshold use
--maf X
(replaceX
with the desired frequency, e.g. 0.01).
- To remove variants with MAF below a certain threshold use
- Hardy-Weinberg equilibrium: variants for which genotype frequencies of homozygotes and heterozygous individuals deviate from expectation are removed as they may be due to genotyping errors, inbreeding, and other causes.
- Care should be taken with this statistic, as an excess of homozygotes is expected if there are different sub-populations within the sample being analysed (e.g. samples from different geographic regions).
- Genotyping errors usually result in an excess of heterozygous, and these can be removed using
--hwe X keep-fewhe
(replaceX
with a p-value threshold, typically a low value such as 0.001).