10  Introduction to QC

Learning Objectives
  • Assess the quality of sequencing data based on common quality metrics and graphs.
  • Interpret and critically evaluate data quality reports.
  • Recognise why contamination poses a problem in bacterial genomics applications.

10.1 Introduction

Before we look at our genomic data, lets take time to explore what to look out for when performing Quality Control (QC) checks on our sequence data. For this course, we will largely focus on next generation sequences obtained from Illumina sequencers. As you may already know from Introduction to NGS, the main output files expected from our Illumina sequencer are FASTQ files.

10.2 QC assessment of NGS data

QC is an important part of any analysis and, in this section, we’re going to look at some of the metrics and graphs that can be used to assess the QC of NGS data.

10.2.1 Base quality

Illumina sequencing technology relies on sequencing by synthesis. One of the most common problems with this is dephasing. For each sequencing cycle, there is a possibility that the replication machinery slips and either incorporates more than one nucleotide or perhaps misses to incorporate one at all. The more cycles that are run (i.e. the longer the read length gets), the greater the accumulation of these types of errors gets. This leads to a heterogeneous population in the cluster, and a decreased signal purity, which in turn reduces the precision of the base calling. The figure below shows an example of this.

Base Quality

Because of dephasing, it is possible to have high-quality data at the beginning of the read but really low-quality data towards the end of the read. In those cases you can decide to trim off the low-quality reads. In this course, we’ll do this using the tool fastp. In addition to trimming and removing low quality reads, fastp will also be used to trim Illumina adapter/primer sequences.

The figures below show an example of high-quality read data (left) and poor quality read data (right).

High-quality read data

Poor quality read data

In addition to Phasing noise and signal decay resulting from dephasing issues described above, there are several different reasons for a base to be called incorrectly. You can lookup these later by clicking here.

10.2.2 Mismatches per cycle

Aligning reads to a high-quality reference genome can provide insights into the quality of a sequencing run by showing you the mismatches to the reference sequence. In particular, this can help you detect cycle-specific errors. Mismatches can occur due to two main causes: sequencing errors and differences between your sample and the reference genome; this is important to bear in mind when interpreting mismatch graphs. The figures below show an example of a good run and a bad one. In the first figure, the distribution of the number of mismatches is even between the cycles, which is what we would expect from a good run. However, in the second figure, two cycles stand out with a lot of mismatches compared to the other cycles.

Good run

Poor run

10.2.3 GC bias

It is a good idea to compare the GC content of the reads against the expected distribution in a reference sequence. The GC content varies between species, so a shift in GC content like the one seen below (right image) could be an indication of sample contamination. In the left image below, we can see that the GC content of the sample is about the same as for the theoretical reference, at ~65%. However, in the right figure, the GC content of the sample shows two distributions: one is closer to 40% and the other closer to 65%, indicating that there is an issue with this sample, likely contamination.

Single GC distribution

Double GC distribution

10.2.4 GC content by cycle

Looking at the GC content per cycle can help detect if the adapter sequence was trimmed. For a random library, there is expected to be little to no difference between the different bases of a sequence run, so the lines in this plot should be parallel with each other like in the first of the two figures below. In the second of the figures, the initial spikes are likely due to adapter sequences that have not been removed.

Good run

Poor run

10.2.5 Insert size

For paired-end sequencing the size of DNA fragments also matters. In the first of the examples below, the insert size peaks around 440 bp. In the second however, there is also a peak at around 200 bp. This indicates that there was an issue with the fragment size selection during library prep.

Good run

Poor run

10.2.6 Insertions/Deletions per cycle

Sometimes, air bubbles occur in the flow cell, and this can manifest as false indels. The spike in the second image provides an example of how this can look.

Good run

Poor run

10.3 Assessment of species composition

Understanding the species composition of sequence data is crucial for the accuracy and reliability of bioinformatics analyses, especially in the context of de novo genome assembly and metagenomics. In particular, for de novo genome assembly, knowing the species present in a sample can help identify and filter out contaminant sequences that do not belong to the target organism, improving the quality of the assembly. An abundance of non-target sequences also means fewer reads belonging to the target species leading to lower coverage when mapping these reads to a reference genome.

10.4 Summary

Key Points
  • Common metrics to assess the quality of raw sequencing data include: base quality, mismatches per cycle, GC bias, GC content per cycle, insert size and indels per cycle.
  • Contamination of sequencing data with other organisms is problematic for applications such as de novo genome assembly.
  • Screening the sequencing data for known species can help to remove potential contaminants.

10.5 References

Information on this page has been adapted and modified from the following sources:

  • https://github.com/sanger-pathogens/QC-training

  • https://www.bioinformatics.babraham.ac.uk/projects/fastqc/