5 Quality Control
- Interpret and critically evaluate data quality reports from the assembled sequences and identify sequences suitable for downstream analyses.
5.1 Output Files
After running our pipeline, we get several output directories and files. The directories we get depend on which version of the workflow was used (Illumina or Nanopore). The description of the output is detailed in the pipeline documentation. Although there are many output files, most of these contain results that are aggregated in an interactive MultiQC report, which makes their analysis easier. We highlight some of the main files of interest below.
- The file
multiqc/multiqc_report.html
contains a MultiQC quality and analysis report for the consensus assemblies generated by the pipeline. - The folder
variants/bowtie2/
contains individual BAM files, which can be visualised with IGV if we want to look at the reads mapped to the reference genome. - The folder
variants/ivar/consensus/bcftools
contains individual FASTA files (named*.consensus.fa
) for each sample’s consensus genome sequence. - The file
variants/ivar/variants_long_table.csv
contains a table with the aggregated results of all the variants detected in each sample.
- The file
multiqc/medaka/multiqc_report.html
contains a MultiQC quality and analysis report for the consensus assemblies generated by the pipeline. - The folder
medaka/
contains:- Individual BAM files (named
*.primertrimmed.rg.sorted.bam
), which can be visualised with IGV if we want to look at the reads mapped to the reference genome. - Individual FASTA files (named
*.consensus.fasta
) for each sample’s consensus genome sequence. - A file called
variants_long_table.csv
with a table of all the variants detected in each sample.
- Individual BAM files (named
5.2 Quality Control
The viralrecon
pipeline produces many quality control metrics, which are conveniently compiled in an interactive report with MultiQC, as mentioned above. We will not detail here every section of the report (check the pipeline documentation for a full description), but only highlight some of the sections that can be used for a first assessment of the quality of our samples.
5.2.1 Variant Calling Metrics
The first section or the report - Variant Calling Metrics - contains a summary table with several statistics for each sample, including the total number of reads, the number/percentage of reads mapped to the reference genome, the depth of coverage, the number of SNPs (single-nucleotide polymorphisms) and INDELs (insertions/deletions), the number of missense variants (mutations that result in an aminoacid change) and the fraction of ambiguous bases ‘N’ per 100kb.
Looking at these basic metrics gives us a good first idea of whether some of the samples have a high fraction of missing bases (marked as N
in the FASTA file), leading to a poorer assembly. We can quickly check this by sorting the table by the column named “# Ns per 100kb consensus” (you can convert the values in this column to a percentage by dividing the numbers by 100). There is also the ability to produce simple scatterplots from the data in this table, which can be useful to look at the relationship between the different metrics (see an example in the figure below).
This table also contains information about the lineage/clade assigned to each sample by the programs Pangolin and Nextclade. This gives us an idea of which samples may be more similar to each other, and where they fit in the global context of other sequences available publicly. We will talk more about this topic in the lineage assignment section.
The word coverage is sometimes used in an ambiguous way by bioinformaticians. It can mean two things:
- The number of reads aligning to a position of the genome. This is sometimes called sequencing depth or depth of coverage (we will use these terms in the materials to avoid confusion). For example, we may say that the average depth of coverage of the genome is 20x, meaning that on average there are 20 reads aligned to a position of the genome.
- The fraction of a genome that has a sufficient number of reads for analysis. For example, if 90% of the genome has a sufficient number of reads to be analysed (for example, at a threshold of 10x), we would say it has 90% coverage (we “covered” 90% of the genome with enough data). In the context of genome assembly, we sometimes use the word “coverage” to refer to the percentage of the consensus genome without ambiguous ‘N’ bases.
In the viralrecon
MultiQC report the word “coverage” is used to mean “depth of coverage”.
5.2.2 Amplicon Depth of Coverage
The next section of the report - Amplicon Coverage Heatmap - contains a graphical representation of the average depth of coverage for each PCR amplicon in the ARTIC protocol (i.e. the average number of reads mapped to each amplicon interval). This plot is extremely useful to identify common PCR dropouts occurring across many samples. This may be an indication that our PCR primer scheme is not working for some of the forms of the virus circulating in our population (for example due to mutations that accumulate in the primer site).
We can investigate primer dropout in more detail, for example by looking at the BAM files with mapped reads using IGV.
From our heatmap (shown in the Figure above) we can see one of the PCR fragments - ncov-2019_83 - seems to have poor amplification across many of our samples. Let’s investigate this in more detail by looking at the alignment file:
- Open IGV and go to File → Load from file….
- In the file browser that opens go to the folder
results/viralrecon/variants/bowtie2/
and select the fileGB36.ivar_trim.sorted.bam
to open it. - Go back to File → Load from file… and this time load the BED files containing the primer locations. These can be found in
resources/primers/artic_primers_pool1.bed
andresources/primers/artic_primers_pool2.bed
.
Once you have IGV open, you can navigate to the region where this amplicon is located by searching for the name of one of the primers - “ncov-2019_83_LEFT” - in the search box at the top. By zooming in to the region where this primer is located, we can see there is a mutation right at the start of this primer, which suggests that this may be the reason why this PCR fragment fails to amplify in this sample.
5.3 Mutation/Variant Analysis
One of the output files produced by the pipeline is a table SNP and indel variants detected in our samples saved as a CSV file in variants/ivar/variants_long_table.csv
. This table can be useful to investigate if particular mutations are particularly frequent in our samples and what their likely effects are.
The columns in this table are:
SAMPLE
is the name of our sequenced sample.CHROM
is the name of the “chromosome”, which in our case is just the name of the reference genome (“MN908947.3”).POS
is the position of the genome where the mutation occurs.REF
is the reference nucleotide.ALT
is the alternative nucleotide, that is the nucleotide that was observed in our sample.FILTER
indicates whether the SNP passed quality filters from the variant calling software (“PASS”) or whether some other issue was observed (for example “dp” means that the depth of sequencing was unusually high or low).DP
is the total depth of sequencing at this position, meaning how many reads in total were aligned there.REF_DP
is the depth of sequencing of the reference allele, meaning how many reads contained the reference nucleotide.ALT_DP
is the depth of sequencing of the alternative allele, meaning how many reads contained the alternative nucleotide.AF
is the allele frequence of the alternative allele, meaning the proportion of reads that contained the alternative nucleotide (this column is equivalent toALT_DP
/(ALT_DP
+REF_DP
)).GENE
is the name of the gene in the SARS-CoV-2 annotation.EFFECT
this is the predicted effect of the mutation in the gene. This output comes from thesnpeff
software and uses The Sequence Ontology nomenclature (follow the link to search for each term).HGVS_C
,HGVS_P
andHGVS_P_1LETTER
is the DNA or amino acid change using HGVS nomenclature. For example, “c.1112C>T” means a C changed to a T at position 1112 of the genome; and “p.Pro371Leu” would mean that a Proline changed to a Leucine at position 371 of the respective protein.CALLER
is the software used for variant calling.LINEAGE
is the Pangolin lineage that the sample was assigned to. Note that this column should usually be ignored, sinceviralrecon
doesn’t use the latest Pangolin data version by default.
This table can be opened in a spreadsheet program such as Excel to look at particular features. In terms of quality-control, we can filter the table:
- for mutations with intermediate allele frequency (say < 70%) – if a sample has too many such mutations, that could indicate cross-contamination between samples.
- for frameshift mutations – these mutations should be rare because they are highly disruptive to the functioning of the virus, and their occurrence is more often due to errors. The presence of these mutations are usually not critical for downstream analysis (such as lineage assignment and phylogenetics), but we should make a note that these mutations may be false-positives in our data.
This variants/mutations table can also be used to explore reasons for amplicon dropout. For example, we can identify how many samples contain the mutation we previously saw in the “ncov-2019_83_LEFT” primer (we saw this was in position 25,003 of the genome). We can do this by sorting our table by the column “POS” (position) and then scrolling down to find the samples with this mutation. We will find that only 3 of the samples contain this mutation, suggesting some other additional causes are leading to the dropout in this PCR fragment.
Finally, for the Illumina pipeline (--platform illumina
), it is important to note that not all of the mutations on this table are included in the final consensus. Only mutations with allele frequency >75% (AF
>= 0.75) and minimum depth of 10 (DP
>= 10) are retained in the final consensus. Therefore, to see which mutations are actually present in our consensus sequences, we need to filter this table for these criteria. Again, we note that this only applies to the Illumina pipeline. For the Nanopore pipeline (--platform nanopore
) all the mutations included in this table are retained in the consensus sequence.
The annotation of the ORF1b gene had some changes between the first original assembly and newer versions of this annotation. In particular, there is a frameshift that is not considered in the annotation used by most software, and this causes a discrepancy in the results between viralrecon
and other tools such as Nextclade (which we will cover in the next section).
This issue is under investigation by the developers of viralrecon
and may be corrected in future versions of the pipeline.
For now, the advice is to ignore the variant effects of ORF1b as these correspond to an older version of the annotated gene.
When you analyse the results of a CSV file in a spreadsheet software (such as Excel), it may be a good idea to create a copy of your file to avoid accidentally modifying the original table. For example, if you accidentally sort one column of the table but forget to sort other columns, of if you accidentally filter the results and delete some of the rows from the table, etc.
You can save a copy of the file by going to File → Save As…. You may want to save the copy as an Excel file format, to include graphs and colours. (Remember that the CSV format is a plain text file - it does not support graphs or coloured cells.)
5.4 Cleaning FASTA Files (Optional)
To proceed with our analysis, we need a FASTA file containing all of our consensus sequences. However, our viralrecon
pipeline outputs separate FASTA files for each sample. We can see this by running (from within the uk_illumina/
directory):
ls results/viralrecon/variants/ivar/consensus/bcftools/
Also, the workflow modifies our original sample names in the FASTA file, by adding extra information to the sequence name. For example:
head -n 1 results/viralrecon/variants/ivar/consensus/bcftools/ERR5728910.consensus.fa
>ERR5728910 MN908947.3
What we want to do is clean these sample names, so that we end up with:
>ERR5728910
We also want to make sure to combine all the samples into a single FASTA file.
We can the command cat
to combine (concatenate) the individual files and the sed
command to substitute text and clean our sample names. Let’s do this step by step.
First, we can use the *
wildcard to combine all the FASTA files with the cat
command:
cat results/viralrecon/variants/ivar/consensus/bcftools/*.fa
Running this command will print all of the sequences on the screen! To see what happened a little better, we could pipe this command to less
to browse up-and-down through the file:
cat results/viralrecon/variants/ivar/consensus/bcftools/*.fa | less
We could also check that we now have all our samples combined, we could pass the results to grep
and search for the word >
, which in the FASTA format indicates the sequence name:
cat results/viralrecon/variants/ivar/consensus/bcftools/*.fa | grep ">" | wc -l
This should give us 7 as the result (which makes sense, since we have 7 samples).
We can now proceed with cleaning the names of the sequences, by using sed
:
cat results/viralrecon/variants/ivar/consensus/bcftools/*.fa | sed 's| MN908947.3||' > results/clean_sequences.fa
Notice that in this last command we make sure to redirect the result to a new file using >
.
5.5 Missing data intervals
When we align reads to the reference genome, there may be regions that we were not able to sequence (e.g. due to amplicon dropout, discussed above) and therefore cannot tell what the base sequence in those positions is. In those cases, the viralrecon
pipeline includes the missing character ‘N’ in the sequence.
As a further quality check, it is useful to check how many contiguous intervals of missing bases we have in each of our assemblies, as well as how long those intervals are. For example, for this small sequence:
T A N N N G C T N N A T
We have two missing intervals, from positions 3-5 and from positions 9-10.
To obtain a list of missing intervals in a sequence, we can use the software seqkit
, which is a toolkit of commands to do many operations on FASTA/FASTQ files (check out its documentation). In particular, we can use the following command:
seqkit locate -i -P -G -M -r -p "N+" results/clean_sequences.fa
seqID patternName pattern strand start end
GB09 N+ N+ + 1 342
GB09 N+ N+ + 9246 9502
GB09 N+ N+ + 10738 11331
GB09 N+ N+ + 21428 21543
GB09 N+ N+ + 27400 27462
GB09 N+ N+ + 27618 27644
GB09 N+ N+ + 29827 29893
... more output omitted...
The output is a tabular file specifying the consensus sequence name (first column) and the location of intervals in the genome where the pattern of one or more “N” were found.
To see what all the options we used with this command are, see the tool’s documentation.
5.6 Exercises
5.7 Summary
- The output of the pipeline includes, among others:
- A detailed MultiQC report, including information about genome coverage, the depth of sequencing for each amplicon, as well as other quality metrics that can help to troubleshoot problematic samples.
- A table with SNP and indel mutation variants identified in each sample.
- Information about SARS-CoV-2 lineages/clades/variants, which is detailed in the next section.