6  Sample QC

Caution

These materials are still under development

Learning objectives
  • List metrics that can be used for sample-level quality control and filtering.
  • Use PLINK to calculate genotype call rates, heterozygosity, discordant sex and sample relatedness.
  • Analyse quality metric results in R and assess the need for filtering in downstream analyses.
  • Discuss how these quality metrics are impacted by population structure.

6.1 Per-sample metrics

Similarly to what we did for variants, we also investigate quality issues in our samples. We will consider the following metrics for each sample:

  • Call rate: Fraction of missing genotypes. Variants with a high fraction of missing data may indicate overall low quality for that sample and thus be removed from downstream analysis.
  • Heterozygosity: The fraction of SNVs that are heterozygous in a given sample. We expect most individuals to have a mixture of both homozygous and heterozygous SNVs. Outliers may indicate genotyping errors.
  • Relatedness: Individuals who are related to each other (e.g. siblings, cousins, parents and children) may affect the association test results and create false positive hits. It is therefore good to assess if there are potential close family members before proceeding.
  • Discordant sex: In humans, we expect males to have only one X chromosome and thus no heterozygous variants, while the converse is true for females. This expectation can be used to identify potential mis-matches and correct them before proceeding.

We will cover each of these below.

Set up your R session

If you haven’t done so already, start an R session with the following packages loaded:

# load the libraries
library(tidyverse) # data manipulation
library(patchwork) # to compose plots
library(janitor)   # to clean column names
theme_set(theme_minimal()) # change default ggplot2 theme

6.2 Call rates

Similarly to what we did for variants, we can calculate how many missing genotypes each individual sample has. This can then be used to assess the need to exclude samples from downstream analyses.

There are no set rules as to what constitutes a “good call rate”, but typically we may exclude samples with greater than ~5% of missing genotypes. This threshold may vary, however, depending on the nature of data you have.

As we saw before, the option --missing generates missingness files for both samples and variants, so we don’t need to re-run the PLINK command. However, here it is as a reminder:

plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --missing

For the sample missingness report the file extension is .smiss, which we can import into R as usual:

smiss <- read_tsv("results/1000G_subset.smiss") |> 
  clean_names()
  
# inspect the table
head(smiss)
# A tibble: 6 × 5
  number_fid iid     missing_ct  obs_ct  f_miss
  <chr>      <chr>        <dbl>   <dbl>   <dbl>
1 HG00096    HG00096          0 5353675 0      
2 HG00097    HG00097          0 5353675 0      
3 HG00099    HG00099          0 5353675 0      
4 HG00100    HG00100      42076 5353675 0.00786
5 HG00101    HG00101          0 5353675 0      
6 HG00102    HG00102          0 5353675 0      

We can tabulate how many samples have missing genotypes:

table(smiss$missing_ct > 0)

FALSE  TRUE 
  724   180 

Around 20% of our samples have missing data in at least one of the variants.

For those samples with some missing data, we can check the distribution of the fraction of missing genotypes:

smiss |> 
  filter(f_miss > 0) |> 
  ggplot(aes(f_miss)) +
  geom_histogram()

From this distribution it seems that most individuals have a low fraction of missing genotypes, indicating no problematic samples.

A filter is probably not even necessary in this case, as no sample seems to have more than 5% missing data. In any case, individuals with high rates of missing data, e.g. using the 5% threshold, can be excluded using PLINK’s option --mind 0.05.

6.3 Heterozygosity

Another useful metric to assess issues with sample quality is to look at the fraction of variants that are heterozygous. In general, we expect an individual to have both homozygous and heterozygous genotypes. Individuals with high heterozygosity may represent poor quality samples (e.g. contaminated samples composed of a mixture of DNA from two individuals). Conversely, individuals with high homozygosity may indicate inbreeding, which can impact the balance of allele frequencies in the population (as discussed in the Hardy-Weinberg equilibrium section) and potentially bias the results of our association tests.

There isn’t necessarily a clear value, but within a population we should expect the distribution of heterozygosity to be consistent across individuals. If an individual is an outlier (e.g. with too many homozygous or heterozygous sites), then we may infer some quality issues may have ocurred with that sample.

PLINK can calculate per-sample heterozygosity rates using the option --het, applying the variant filters defined in the variant QC section:

plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --geno 0.05 --maf 0.01 --het

The output file has extension .het, which we can import into R:

het <- read_tsv("results/1000G_subset.het") |> 
  clean_names()

head(het)
# A tibble: 6 × 6
  number_fid iid      o_hom  e_hom obs_ct      f
  <chr>      <chr>    <dbl>  <dbl>  <dbl>  <dbl>
1 HG00096    HG00096 618655 604746 765962 0.0863
2 HG00097    HG00097 621546 604746 765962 0.104 
3 HG00099    HG00099 618405 604746 765962 0.0847
4 HG00100    HG00100 615209 599977 759947 0.0952
5 HG00101    HG00101 618268 604746 765962 0.0839
6 HG00102    HG00102 620031 604746 765962 0.0948

We can make a histogram of the fraction of homozygous individuals:

het |> 
  ggplot(aes(f)) +
  geom_histogram()

This gives a very suspicious distribution. We clearly have different groups of samples, some with higher rates of homozygosity and others with lower rates.

This is likely a consequence of the fact we have individuals from different populations (geographic areas). This is an issue we will come back to in the population structure chapter.

For now, it is clear that we cannot easily filter samples based on their heterozygosity, as the populations are heterogeneous.

6.4 Sample relatedness

One other thing that can be considered is whether there are potential relationships within your samples. In our dataset all samples are supposed to be unrelated.

plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --geno 0.05 --maf 0.01 --mind 0.05 --make-king-table

This outputs a file with format .kin0, which contains pairwise kinship coefficients for each pair of samples. We can read this table into R as usual:

king <- read_tsv("results/1000G_subset.kin0") |> 
  clean_names()
  
head(king)
# A tibble: 6 × 8
  number_fid1 iid1    fid2    iid2      nsnp hethet   ibs0  kinship
  <chr>       <chr>   <chr>   <chr>    <dbl>  <dbl>  <dbl>    <dbl>
1 HG00097     HG00097 HG00096 HG00096 765962 0.0682 0.0362 -0.0161 
2 HG00099     HG00099 HG00096 HG00096 765962 0.0703 0.0338  0.00680
3 HG00099     HG00099 HG00097 HG00097 765962 0.0674 0.0347 -0.0106 
4 HG00100     HG00100 HG00096 HG00096 759947 0.0693 0.0357 -0.00771
5 HG00100     HG00100 HG00097 HG00097 759947 0.0703 0.0383 -0.0189 
6 HG00100     HG00100 HG00099 HG00099 759947 0.0701 0.0330  0.00793

For each pair of samples we now have the kinship coefficient calculated using the KING method. The authors of this method indicate that unrelated individuals should have a kinship coefficient of zero, but they recommend using a threshold of ~0.044 (cf. Table 1 in Manichaikul et al. 2010).

We can check how many individuals are above this threshold:

table(king$kinship > 0.044)

 FALSE   TRUE 
408137     19 

Only 19 individuals are above this threshold. We can also look at the distribution of this kinship coefficient:

king |> 
  ggplot(aes(kinship)) +
  geom_histogram(bins = 100) +
  geom_vline(xintercept = 1/16)

Similar to what we’ve seen before for heterozygosity, we get a multi-modal distribution of kinship values. This is again likely because of population structure, i.e. the fact that our samples come from different geographic areas and thus may have differing base levels of “residual relatedness”.

However, we can see all most values are negative, which can happen with this coefficient and essentially indicates no relatedness between individuals.

More worringly, we can see some individuals seem closely related if we sort the table in descending order of kinship and look at the top few rows:

king |> 
  arrange(desc(kinship)) |> 
  head(n = 15)
# A tibble: 15 × 8
   number_fid1 iid1    fid2    iid2      nsnp hethet     ibs0 kinship
   <chr>       <chr>   <chr>   <chr>    <dbl>  <dbl>    <dbl>   <dbl>
 1 NA20900     NA20900 NA20882 NA20882 765962 0.100  0.000256  0.251 
 2 NA20900     NA20900 NA20891 NA20891 765962 0.0971 0.000317  0.243 
 3 2481        NA20321 2481a   NA20320 758628 0.112  0.000414  0.242 
 4 NA21135     NA21135 NA21109 NA21109 760411 0.0819 0.0168    0.122 
 5 NA19042     NA19042 NA19027 NA19027 753225 0.0940 0.0206    0.112 
 6 HG00881     HG00881 HG00851 HG00851 765962 0.0739 0.0222    0.0771
 7 HG00120     HG00120 HG00116 HG00116 765962 0.0759 0.0238    0.0738
 8 NA19312     NA19312 NA19307 NA19307 765962 0.0927 0.0286    0.0723
 9 LWK003      NA19434 NA19355 NA19355 765962 0.0862 0.0285    0.0628
10 HG00240     HG00240 HG00238 HG00238 765962 0.0773 0.0265    0.0611
11 NA19384     NA19384 NA19025 NA19025 763323 0.0871 0.0297    0.0598
12 HG02179     HG02179 HG01795 HG01795 765962 0.0712 0.0256    0.0548
13 NA19452     NA19452 NA19451 NA19451 762593 0.0867 0.0307    0.0532
14 NA20891     NA20891 NA20864 NA20864 765962 0.0756 0.0271    0.0510
15 NA19376     NA19376 NA19347 NA19347 765962 0.0862 0.0308    0.0491

We see a few individuals have a kinship coefficient of ~1/4, which is indicative of full siblings. Others are close to ~1/8 (second degree relationships, e.g. cousins) and a few close to ~1/16 (third degree relationship).

To eliminate close relationships from downstream analysis, we can use PLINK’s option --king-cutoff 0.125 to eliminate at least second degree relationships.

6.5 Discordant sex

Work in progress

Sorry, this section isn’t yet finished. There’s a brief explanation below, but we are still working on having data suitable to demonstrate this analysis.

Another thing that can be checked is whether the sex assigned to each of our samples is correct, based on the genetic data. This relies on having genotypes in a sex chromosome and checking whether the heterozygosity matches the expectation. For example, in humans males have one copy of the X chromosome, and females two. Therefore, all males should be homozygous for variants in the X chromosome, whereas females may have a mixture of heterozygous and homozygous sites.

We can do this quality check using PLINK’s --check-sex option:

plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --check-sex

Unfortunately this does now work in our example data, as we are missing genotypes on the X chromosome.

6.6 Summary

Key Points
  • Key metrics for sample-level quality control include: call rate, heterozygosity, relatedness and sex assignment.
  • Call rates (--missing): samples with low call rates (e.g. <95% or >5% missing genotypes) are typically excluded.
    • To remove samples missing genotypes above a certain threshold use --mind X (replace X with the desired fraction, e.g. 0.05).
  • Heterozygosity (--het): samples that are outlier with regards to the number of heterozygous variants relative to other samples are typically excluded as they may represent errors (e.g. mixed samples or genotyping errors).
    • It is common practice to remove samples a certain number of standard deviations away from the observed distribution (e.g. +/- 3 SDs).
    • However, this should only be applied to samples from relatively homogenous populations, as different sub-populations may have different distributions.
  • Sample relatedness (--king): even when individuals are thought to be unrelated, there may be some criptic relatedness that was previously unknown. Excluding very close relatives is best-practice to avoid biasing downstream association tests.
    • To exlude individuals above a certain kinship score threshold use --king-cutoff X (replace X with the desired threshold). Common thresholds include 1/4 for full siblings, 1/8 for second degree relatives and 1/16 for third degree relatives.
  • Discordant sex (--check-sex): it is good practice to check if the sex assigned to the individual (specified in the .psam/.fam file) matches the sex from genetic data. This can be determined from genotypes in the sex chromosomes. Individual with discordant sex may either be corrected or removed from the analysis.