5  Variant QC

Caution

These materials are still under development

Learning objectives
  • List metrics that can be used for variant-level quality control and filtering.
  • Use PLINK to calculate genotype call rates, minor allele frequency and deviations from Hardy-Weinberg equilibrium.
  • Analyse quality metric results in R and assess the need for filtering in downstream analyses.
  • Discuss how these quality metrics are impacted by population structure.

5.1 Per-variant metrics

Before proceeding with downstream analyses, it’s good practice to investigate quality issues in our variants. We will consider the following metrics for each variant:

  • Call rate: Fraction of missing genotypes. Variants with a high fraction of missing data may indicate overall low quality and thus be removed from downstream analysis.
  • Allele frequency: The frequency of the allele in the population of samples. Variants with low frequency (rare alleles) are usually excluded from downstream analysis as they incur low statistical power for association tests.
  • Hardy-Weinberg deviations: In randomly mating populations, there is a theoretical expectation of how many homozygous and heterozygous individuals there should be given the frequency of the two alleles. Deviations from this expectation may be due to genotyping errors.

We will cover each of these below.

Set up your R session

If you haven’t done so already, start an R session with the following packages loaded:

# load the libraries
library(tidyverse) # data manipulation
library(patchwork) # to compose plots
library(janitor)   # to clean column names
theme_set(theme_minimal()) # change default ggplot2 theme

5.2 Call rates

One way to assess the genotype quality of each variant is to calculate how many missing genotypes each sample has. This can then be used to assess the need to exclude variants from our downstream analysis. There are no set rules as to what constitutes a “good call rate”, but typically we may exclude variants with more than ~5% of missing genotypes. This threshold may vary, however, depending on the nature of data you have.

Exercise 1 - Missing genotype data
  • Look at PLINK’s documentation to find the option that calculates missing data reports.

  • Run PLINK with that option, recalling the basic command structure:

    plink2 --pfile data/plink/1000G_subset --out results/1000G_subset OPTION-HERE
  • Look at the top lines of the output files from the terminal (using head), to see if you understand their structure. You can also consult PLINK’s file format documentation.

The option we were being asked to use is called --missing, which the documentation says: “produces sample-based and variant-based missing data reports”.

We therefore run the command:

plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --missing

This generates two files with extension .smiss (for sample-missingness report) and .vmiss (for variant-missingness report). The file format documentation details the columns present in each of these files.

We can quickly look at the top rows of each file using the standard head command from our terminal:

head results/1000G_subset.vmiss
#CHROM  ID  MISSING_CT  OBS_CT  F_MISS
1       .   0           904     0
1       .   0           904     0
1       .   0           904     0
1       .   17          904     0.0188053
1       .   0           904     0
1       .   0           904     0
1       .   0           904     0
1       .   0           904     0
1       .   0           904     0

In the previous exercise, you should have produced a file containing the counts and frequency of missing genotypes for each variant. As we did before for the allele frequency file, we can import this table into R:

vmiss <- read_tsv("results/1000G_subset.vmiss") |> 
  clean_names()
  
# inspect the table
head(vmiss)
# A tibble: 6 × 5
  number_chrom id           missing_ct obs_ct f_miss
         <dbl> <chr>             <dbl>  <dbl>  <dbl>
1            1 rs141149254           0    904 0     
2            1 rs76735897            0    904 0     
3            1 rs1451704981          0    904 0     
4            1 rs556374459          22    904 0.0243
5            1 rs372315362           0    904 0     
6            1 rs867544879           0    904 0     

We can tabulate how many variants have missing genotypes:

table(vmiss$missing_ct > 0)

  FALSE    TRUE 
4119671 1234004 

Around 23% of variants have a missing genotype in at least one of the samples. For those, we can plot the missing rate distribution:

vmiss |> 
  filter(f_miss > 0) |> 
  ggplot(aes(f_miss)) +
  geom_histogram()

We can see most of these SNVs have relatively low rates of missing data.

We can see how many variants would be discarded if we used the conventional 5% threshold:

sum(vmiss$f_miss > 0.05)/nrow(vmiss)
[1] 0.000653383

At this threshold we will discard a very small fraction of our variants. We can therefore be satisfied that, in general, there are no major issues with our call rates.

In our downstream analyses, we can exclude variants with >5% missing data by adding the option --geno 0.05 to PLINK.

5.3 Allele frequency

In the previous chapter we already saw how to calculate the allele frequency of our variants using the --freq option.

We can read this file into R as usual:

afreq <- read_tsv("results/1000G_subset.afreq") |> 
  clean_names()
  
head(afreq)
# A tibble: 6 × 6
  number_chrom id    ref   alt   alt_freqs obs_ct
         <dbl> <chr> <chr> <chr>     <dbl>  <dbl>
1            1 .     G     A       0.0775    1806
2            1 .     A     G       0.315     1806
3            1 .     A     T       0.00332   1806
4            1 .     T     G       0         1762
5            1 .     C     G       0.00443   1806
6            1 .     G     A       0.119     1806

By default PLINK calculates the allele frequency of the alternative allele. However, this is somewhat arbitrary, as the alternative allele is simply defined as the allele that different from whichever happens to be the reference genome. A more common approach is to visualise the minor allele frequency (MAF), i.e. the frequency of the least-common alelle in the population.

We can calculate the MAF for each variant, adding it as a new column to our data frame, followed by a new histogram:

# add a column of minor allele frequency
afreq <- afreq |> 
  mutate(maf = ifelse(alt_freqs > 0.5, 1 - alt_freqs, alt_freqs)) 

# MAF histogram
afreq |> 
  ggplot(aes(maf)) +
  geom_histogram(binwidth = 0.01)

We can see the histogram is quite skewed, with many SNPs having very low frequency. In fact, some of them are not variable at all in our samples! We can quickly tabulate how many SNPs are above the commonly-used 1% threshold of allele frequency:

table(afreq$maf > 0.01)

  FALSE    TRUE 
4587526  766149 

We can see that the majority of variants have very low frequency. These must be variants that have been found to vary in other individuals of the 1000 genomes project, but happen to be invariant in our relatively small collection of samples.

Low frequency variants are often filtered out when performing downstream analyses, such as the association test. This is because they have low statistical power, leading to noisy estimates (you can think of it as having a low sample size for one of the classes of genotypes).

To exclude variants with low minor allele frequency, for example at a 1% threshold, we can use PLINK’s option --maf 0.01.

5.4 Hardy–Weinberg

Another quality control step is to check whether SNPs significantly deviate from Hardy-Weinberg equilibrium, which is expected if individuals mate randomly.

plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --hardy

This outputs a file with .hardy extension, which we can read into R:

hardy <- read_tsv("results/1000G_subset.hardy") |> 
  clean_names()
  
head(hardy)
# A tibble: 6 × 10
  number_chrom id    a1    ax    hom_a1_ct het_a1_ct two_ax_ct o_het_a1 e_het_a1
         <dbl> <chr> <chr> <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
1            1 .     G     A           771       124         8  0.137    0.143  
2            1 .     A     G           434       369       100  0.409    0.432  
3            1 .     A     T           897         6         0  0.00664  0.00662
4            1 .     T     G           881         0         0  0        0      
5            1 .     C     G           895         8         0  0.00886  0.00882
6            1 .     G     A           708       175        20  0.194    0.210  
# ℹ 1 more variable: p <dbl>

One possible visualisation of these data is to plot the expected heterozygosity versus the observed heterozygosity and colour the points as to whether they are below a chosen p-value threshold:

hardy |> 
  # randomply sample SNPs 
  # to avoid plot window from crashing
  sample_n(10e3) |> 
  ggplot(aes(e_het_a1, o_het_a1)) + 
  geom_point(aes(colour = p < 0.001)) +
  geom_abline() +
  labs(x = "Expected heterozygosity", y = "Observed heterozygosity")

From this plot, we can see an excess of SNVs with lower heterozygosity than expected compared to those with higher heterozygosity. This discrepancy is because our samples originate from diverse global regions that do not form a “randomly mating population”.

As an example, consider a variant present in one geographical area (e.g., individuals from a specific country) but absent elsewhere. The Hardy-Weinberg equilibrium assumes random mating across the entire population, therefore variants with limited geographic distribution may appear to have an excess of homozygotes. In reality, these variants are simply missing from certain populations.

So, while variants might fit Hardy-Weinberg expectations within randomly mating sub-populations, they will seem to deviate from it when these groups are pooled together. This phenomenon is known as the Wahlund effect and results from population structure, a topic which we return to in a later chapter.

In downstream analysis, we can exclude SNPs with a low p-value for the Hardy-Weinberg deviation test. However, due to the population structure issue just discussed, we only exclude SNPs with higher-than-expected heterozygosity. High rates of heterozygosity may indicate genotyping errors, which we want to eliminate. Wehreas low rates of heterozygosity may simply be due to population structure, which we want to retian. To discard only high heterozygosity SNVs having p-value < 0.001, we can use the option --hwe 0.001 keep-fewhe.

Tip: Running multiple options at once

We have seen a few PLINK commands that are useful for checking properties of our genotype data:

  • --missing to assess genotype missingness both across SNPs and samples.
  • --freq to assess the allele frequency across SNPs.
  • --hardy to assess genotype frequency deviations from the Hardy-Weinberg equilibrium expectation.

So far, we have run each of these options individually, however you can run multiple options simultaneously. For example, our previous analyses could have been run with a single command:

plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --freq --hardy --missing

This would produce all three respective results files in one go.

5.5 Summary

Key Points
  • Key metrics for variant-level quality control include: call rate, minor allele frequency and Hardy-Weinberg equilibrium.
  • Call rates (--missing): variants with low call rates (e.g. <95% or >5% missing data) are typically excluded.
    • To remove variants with missing data above a certain threshold use --geno X (replace X with the desired fraction, e.g. 0.05).
  • Minor allele frequency (--maf): very rare variants (e.g. <1%) have low statistical power and are usually removed from downstream analysis.
    • To remove variants with MAF below a certain threshold use --maf X (replace X with the desired frequency, e.g. 0.01).
  • Hardy-Weinberg equilibrium: variants for which genotype frequencies of homozygotes and heterozygous individuals deviate from expectation are removed as they may be due to genotyping errors, inbreeding, and other causes.
    • Care should be taken with this statistic, as an excess of homozygotes is expected if there are different sub-populations within the sample being analysed (e.g. samples from different geographic regions).
    • Genotyping errors usually result in an excess of heterozygous, and these can be removed using --hwe X keep-fewhe (replace X with a p-value threshold, typically a low value such as 0.001).