# load the libraries
library(tidyverse) # data manipulation
library(patchwork) # to compose plots
library(janitor) # to clean column names
theme_set(theme_minimal()) # change default ggplot2 theme
6 Sample QC
These materials are still under development
- List metrics that can be used for sample-level quality control and filtering.
- Use PLINK to calculate genotype call rates, heterozygosity, discordant sex and sample relatedness.
- Analyse quality metric results in R and assess the need for filtering in downstream analyses.
- Discuss how these quality metrics are impacted by population structure.
6.1 Per-sample metrics
Similarly to what we did for variants, we also investigate quality issues in our samples. We will consider the following metrics for each sample:
- Call rate: Fraction of missing genotypes. Variants with a high fraction of missing data may indicate overall low quality for that sample and thus be removed from downstream analysis.
- Heterozygosity: The fraction of SNVs that are heterozygous in a given sample. We expect most individuals to have a mixture of both homozygous and heterozygous SNVs. Outliers may indicate genotyping errors.
- Relatedness: Individuals who are related to each other (e.g. siblings, cousins, parents and children) may affect the association test results and create false positive hits. It is therefore good to assess if there are potential close family members before proceeding.
- Discordant sex: In humans, we expect males to have only one X chromosome and thus no heterozygous variants, while the converse is true for females. This expectation can be used to identify potential mis-matches and correct them before proceeding.
We will cover each of these below.
If you haven’t done so already, start an R session with the following packages loaded:
6.2 Call rates
Similarly to what we did for variants, we can calculate how many missing genotypes each individual sample has. This can then be used to assess the need to exclude samples from downstream analyses.
There are no set rules as to what constitutes a “good call rate”, but typically we may exclude samples with greater than ~5% of missing genotypes. This threshold may vary, however, depending on the nature of data you have.
As we saw before, the option --missing
generates missingness files for both samples and variants, so we don’t need to re-run the PLINK command. However, here it is as a reminder:
plink2 \
--pfile data/plink/1000G_subset \
--out results/1000G_subset \
--missing
For the sample missingness report the file extension is .smiss
, which we can import into R as usual:
<- read_tsv("results/1000G_subset.smiss") |>
smiss clean_names(replace = c("#" = ""))
# inspect the table
head(smiss)
# A tibble: 6 × 6
iid super_pop population missing_ct obs_ct f_miss
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 HG00097 N N 0 4467560 0
2 HG00107 N N 104 4474704 0.0000232
3 HG00108 N N 96 4474704 0.0000215
4 HG00109 N N 35707 4474704 0.00798
5 HG00110 N N 0 4467560 0
6 HG00111 N N 0 4467560 0
We can tabulate how many samples have missing genotypes:
table(smiss$missing_ct > 0)
FALSE TRUE
408 629
Around 61% of our samples have missing data in at least one of the variants.
For those samples with some missing data, we can check the distribution of the fraction of missing genotypes:
|>
smiss filter(f_miss > 0) |>
ggplot(aes(f_miss)) +
geom_histogram()
From this distribution it seems that most individuals have a low fraction of missing genotypes, indicating no problematic samples.
A filter is probably not even necessary in this case, as no sample seems to have more than 5% missing data. In any case, individuals with high rates of missing data, e.g. using the 5% threshold, can be excluded using PLINK’s option --mind 0.05
.
6.3 Heterozygosity and inbreeding
Another useful metric to assess issues with sample quality is to look at the fraction of variants that are heterozygous, or conversely what the inbreeding coefficient of each individual is. In general, we expect an individual to have both homozygous and heterozygous genotypes. Individuals with high heterozygosity may represent poor quality samples (e.g. contaminated samples composed of a mixture of DNA from two individuals). Conversely, individuals with high homozygosity may indicate inbreeding, which can impact the balance of allele frequencies in the population (as discussed in the Hardy-Weinberg equilibrium section) and potentially bias the results of our association tests.
There isn’t necessarily a clear value, but within a population we should expect the distribution of heterozygosity to be consistent across individuals. If an individual is an outlier (e.g. with too many homozygous or heterozygous sites), then we may infer some quality issues may have ocurred with that sample.
PLINK can calculate per-sample heterozygosity rates using the option --het
:
plink2 \
--pfile data/plink/1000G_subset \
--out results/1000G_subset \
--extract results/1000G_subset.prune.in \
--geno 0.05 --maf 0.01 --mind 0.05 \
--het
In this command we also apply the variant filters defined in the variant QC section. Namely, retaining genotypes and samples with a call rate of at least 95% (--geno 0.05 --mind 0.05
) and minor allele frequency of at least 1% (--maf 0.01
). And use the --extract
option to only include uncorrelated variants, obtained by LD prunning.
The output file has extension .het
, which we can import into R:
<- read_tsv("results/1000G_subset.het") |>
het clean_names(replace = c("#" = ""))
head(het)
# A tibble: 6 × 5
iid o_hom e_hom obs_ct f
<chr> <dbl> <dbl> <dbl> <dbl>
1 HG00097 511803 494932 602310 0.157
2 HG00107 509813 494932 602310 0.139
3 HG00108 510716 494932 602310 0.147
4 HG00109 504334 490975 597513 0.125
5 HG00110 513805 494932 602310 0.176
6 HG00111 511396 494932 602310 0.153
We have four values for each individual:
o_hom
is the number of observed homozygous genotypes.e_hom
is the expected number of homozygous genotypes, based on the allele frequencies in the population and under Hardy-Weinberg equilibrium (HWE).obs_ct
is the number of non-missing genotypes.f
is the inbreeding coefficient, which represent the individual heterozygosity relative to the expectation under HWE (see box below for details on its calculation).
Let’s call \(H_{obs}\) the observed fraction of heterozygote variants in an individual and \(H_{exp}\) the expected heterozygosity under Hardy-Weinberg equilibrium (HWE).
The inbreeding coefficient is usually defined as:
\[ F = 1 - \frac{H_{obs}}{H_{exp}} \]
I.e., it is the difference between the expected and observed heterozygosity, divided by the expected heterozygosity.
- A value of 0 would indicate that the individual is heterozygous at the expected rate, since \(H_{obs} = H_{exp}\).
- A value of 1 would indicate that the individual is completely homozygous, since \(H_{obs} = 0\).
- If \(H_{obs} < H_{exp}\), then the inbreeding coefficient will be between 0 and 1, indicating different levels of homozygosity or inbreeding.
- If \(H_{obs} > H_{exp}\), then the individual is more heterozygous than expected under HWE and the inbreeding coefficient takes negative values.
Rather than reporting the observed and expected heterozygosity, PLINK reports things as counts of homozygous variants. Because all of these can be calculated from each other, we can rearrange the equation above to express the inbreeding coefficient in terms of the number of homozygous genotypes.
Let’s call \(O_{hom}\) the number of observed homozygous genotypes, \(E_{exp}\) the number of expected homozygous genotypes and \(O_{total}\) the total number of variants. Then, the equation above could be written as:
\[ F = 1 - \frac{O_{total} - O_{hom}}{O_{total} - E_{hom}} \]
Which we can further rearrange to:
\[ F = \frac{O_{hom} - E_{hom}}{O_{total} - E_{hom}} \]
We can confirm this is how PLINK calculates the inbreeding value:
|>
het mutate(f2 = (o_hom - e_hom) / (obs_ct - e_hom))
# A tibble: 1,037 × 6
iid o_hom e_hom obs_ct f f2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 HG00097 511803 494932 602310 0.157 0.157
2 HG00107 509813 494932 602310 0.139 0.139
3 HG00108 510716 494932 602310 0.147 0.147
4 HG00109 504334 490975 597513 0.125 0.125
5 HG00110 513805 494932 602310 0.176 0.176
6 HG00111 511396 494932 602310 0.153 0.153
7 HG00112 504590 488807 594863 0.149 0.149
8 HG00114 510423 494932 602310 0.144 0.144
9 HG00118 508905 494932 602310 0.130 0.130
10 HG00119 511823 494932 602310 0.157 0.157
# ℹ 1,027 more rows
Let’s consider the distribution of inbreeding coefficients for the individuals in our sample. A value of 0 indicates that the individual is heterozygous at the expected rate, while a value of 1 indicates that the individual is completely homozygous. Negative values indicate that the individual is more heterozygous than expected under HWE.
We can make a histogram of the inbreeding coefficient:
|>
het ggplot(aes(f)) +
geom_histogram()
In general we can see there are no individuals with very extreme inbreeding values (e.g. close to 1). We can see some individuals with slightly negative values, indicating they are more heterozygous than expected under HWE.
The main issue we see is that the distribution is multi-modal. This indicates that we have different groups of samples, some with higher rates of inbreeding and others with lower rates.
This is likely a consequence of the fact we have individuals from different populations (geographic areas). This is an issue we will come back to in the population structure chapter.
For now, it is clear that we cannot easily filter samples based on their heterozygosity, as the populations are heterogeneous.
6.5 Discordant sex
Sorry, this section isn’t yet finished. There’s a brief explanation below, but we are still working on having data suitable to demonstrate this analysis.
Another thing that can be checked is whether the sex assigned to each of our samples is correct, based on the genetic data. This relies on having genotypes in a sex chromosome and checking whether the heterozygosity matches the expectation. For example, in humans males have one copy of the X chromosome, and females two. Therefore, all males should be homozygous for variants in the X chromosome, whereas females may have a mixture of heterozygous and homozygous sites.
We can do this quality check using PLINK’s --check-sex
option:
plink2 --pfile data/plink/1000G_subset --out results/1000G_subset --check-sex
Unfortunately this does now work in our example data, as we are missing genotypes on the X chromosome.
6.6 Summary
- Key metrics for sample-level quality control include: call rate, heterozygosity, relatedness and sex assignment.
- Call rates (
--missing
): samples with low call rates (e.g. <95% or >5% missing genotypes) are typically excluded.- To remove samples missing genotypes above a certain threshold use
--mind X
(replaceX
with the desired fraction, e.g. 0.05).
- To remove samples missing genotypes above a certain threshold use
- Heterozygosity (
--het
): samples that are outlier with regards to the number of heterozygous variants relative to other samples are typically excluded as they may represent errors (e.g. mixed samples or genotyping errors).- It is common practice to remove samples a certain number of standard deviations away from the observed distribution (e.g. +/- 3 SDs).
- However, this should only be applied to samples from relatively homogenous populations, as different sub-populations may have different distributions.
- Sample relatedness (
--king
): even when individuals are thought to be unrelated, there may be some criptic relatedness that was previously unknown. Excluding very close relatives is best-practice to avoid biasing downstream association tests.- To exlude individuals above a certain kinship score threshold use
--king-cutoff X
(replaceX
with the desired threshold). Common thresholds include 1/4 for full siblings, 1/8 for second degree relatives and 1/16 for third degree relatives.
- To exlude individuals above a certain kinship score threshold use
- Discordant sex (
--check-sex
): it is good practice to check if the sex assigned to the individual (specified in the.psam
/.fam
file) matches the sex from genetic data. This can be determined from genotypes in the sex chromosomes. Individual with discordant sex may either be corrected or removed from the analysis.