13 The nf-core/bactmap pipeline

Learning Objectives

Summarise the steps performed by the nf-core/bactmap pipeline
Memorise where the output files created by pipeline are located.
Describe the main output files created by the pipeline.
Interpret the output QC report generated by the pipeline and use it make a decision about whether to exclude poor quality samples.
Apply the nf-core/bactmap to a set of samples.

13.1 Pipeline Overview

nf-core/bactmap variant calling pipeline diagram from nf-core (https://nf-co.re/bactmap).

nf-core/bactmap is a bioinformatics analysis pipeline for mapping short reads from bacterial WGS to a reference sequence, creating filtered VCF files, making pseudogenomes based on high quality positions in the VCF files and optionally creating a phylogeny from an alignment of the pseudogenomes.

It runs the following tools:

BWA index - indexes reference fasta file
fastp - trims reads for quality and adapter sequences (Optional)
mash sketch - estimates genome size if not provided
Rasusa - downsamples fastq files to 100X by default (Optional)
BWA mem - maps reads to the reference
SAMtools - sorts and indexes alignments
BCFtools - calls and filters variants
vcf2pseudogenome.py - converts filtered bcf to pseudogenome FASTA
calculate_fraction_of_non_GATC_bases.py - creates whole genome alignment from pseudogenomes by concatenating fasta files having first checked that the sample sequences are high quality
Gubbins - identifies recombinant regions (Optional)
SNP-sites - extracts variant sites from whole genome alignment
RapidNJ, FastTree2, IQ-TREE, RAxML-NG - construct phylogenetic tree (Optional)

See Course Software for a more detailed description of each tool.

Along with the outputs produced by the above tools, the pipeline produces the following summary containing results for all samples run through the pipeline:

multiqc_report.html - final summary of trimming and mapping statistics for input files in HTML format

13.2 Running nf-core/bactmap

The bactmap pipeline requires a samplesheet CSV file in the same format as the one we used for bacQC so we can re-use that samplesheet CSV file. If you decided to remove any samples because they didn’t pass the QC, then edit the samplesheet CSV file accordingly. There are many options that can be used to customise the pipeline but a typical command is shown below:

nextflow run nf-core/bactmap \
  -r "1.0.0" \
  -profile singularity \
  --max_memory '16.GB' --max_cpus 8 \
  --input SAMPLESHEET \
  --outdir results/bactmap \
  --reference REFERENCE \
  --genome_size 4.3M

The options we used are:

-profile singularity - indicates we want to use the Singularity program to manage all the software required by the pipeline (another option is to use docker). See Data & Setup for details about their installation.
--max_memory and --max_cpus - sets the available RAM memory and CPUs. You can check this with the commands free -h and nproc --all, respectively.
--input - the samplesheet with the input files, as explained above.
--outdir - the output directory for the results.
--reference - the path and name of the reference genome.
--genome_size - estimated size of the genome - Rasusa uses this value to calculate the genome coverage.

Exercise 1 - Running nf-core/bactmap

Your next task is to run the bactmap pipeline on your data. In the folder scripts (within your analysis directory) you will find a script named 03-run_bactmap.sh. This script contains the code to run bactmap.

Edit this script, adjusting it to fit your input files and the name and location of the reference you’re going to map to:
- use the same samplesheet as before with the avantonder/bacQC pipeline.
- the reference genome is located in resources/reference.
Activate the Nextflow software environment: mamba activate nextflow.
Run the script using bash scripts/03-run_bactmap.sh.

If the script is running successfully it should start printing the progress of each job in the bactmap pipeline. The pipeline will take a while to run. You can continue working through the materials by using preprocessed data detailed in the following sections.

Answer

The fixed script is:

#!/bin/bash

nextflow run nf-core/bactmap \
  -r "1.0.0" \
  -profile singularity \
  --max_memory '16.GB' --max_cpus 8 \
  --input samplesheet.csv \
  --outdir results/bactmap \
  --reference resources/reference/MTBC0.fasta \
  --genome_size 4.3M

We ran the script as instructed using:

bash scripts/03-run_bactmap.sh

While it was running it printed a message on the screen:

N E X T F L O W  ~  version 23.04.1
Launching `https://github.com/nf-core/bactmap` [cranky_swartz] DSL2 - revision: e83f8c5f0e [master]


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/bactmap v1.0.0
------------------------------------------------------

13.3 `bactmap` results

After running the pipeline, we can look at the output directory in results/bactmap (if your pipeline finished running), or you can also use the preprocessed/bactmap results. There are various directories containing output files:

Directory	Description
`bwa/index`	Contains the index of the reference sequence
`fastp`	Contains the results of the trimming and adapter removal performed by `fastp`
`fastqc`	Contains QC metrics for the fastq files generated with `fastQC`
`multiqc`	Contains a html file containing summaries of the various outputs
`pipeline_info`	Contains information about the pipeline run
`pseudogenomes`	Contains consensus fasta files for each sample which have the sample variants compared to the reference included. The alignment we’ll use for the next step can also be found in this directory (`aligned_pseudogenomes.fas`)
`rasusa`	Contains the subsampled post-trimmed fastq files
`samtools`	Contains the sorted bam files and indices created by `bwa` and `samtools` as part of the mapping process
`snpsites`	Contains a variant alignment file created from `aligned_pseudogenomes.fas` with `snp-sites`that can be used as input for tree inference tools
`variants`	Contains filtered `vcf` files which contain the variants for each sample

13.3.1 The MultiQC summary report

The first thing we’ll check is the HTML report file created by MultiQC. Open your File Explorer , navigate to preprocessed/bactmap/multiqc/ and double click on multiqc_report.html. This will open the file in your web browser of choice:

General statistics

Let’s go through each section starting with the “General Statistics”:

This is a compilation of statistics collected from the outputs of tools such as fastp, samtools and BCFtools. Sequencing metrics such as the % of duplicated reads and GC content of the reads are shown alongside the results of the mapping (% reads mapped, num). This is a useful way of quickly identifying samples that are of lower quality or perhaps didn’t map very well due to species contamination.

fastp

There are a number of plots showing the results of the fastp step in the pipeline. These plots are explained in The bacQC pipeline.

Samtools

The plots in this section are created from the results of running samtool stats on the sorted bam files produce during the mapping process. The first shows the number or percentage of reads that mapped to the reference.

The second plot shows the overall alignment metrics for each sample. Hover over each dot to see more detailed information.

nf-core/bactmap MultiQC samtools alignment

BCFtools

The plots in this section provide information about the variants called using bcftools. The first plot shows the numbers or percentage of each type of variant in each sample.

The second plot shows the quality of each variant called by bcftools. The majority of variants in each sample are high quality.

nf-core/bactmap MultiQC bcftools variant quality

The third plot shows the distribution of lengths of Indels (insertions are positive values and deletions are negative values). This is useful information to have, but in practice we tend to exclude indels when building alignments for phylogenetic tree building.

nf-core/bactmap MultiQC bcftools indel distribution

The final bcftools plot shows the distribution of the number of reads mapping to each variant position and is one of the metrics used to filter out low quality variants (the fewer the reads mapping to a variant position, the lower the confidence we have that the variant is in fact real).

nf-core/bactmap MultiQC bcftools variant depth

Software versions

This section of the report shows the software run as part of nf-core/bactmap and the versions used. This is particularly important when reproducing the analysis on a different system or when writing the methods section of a paper.

13.4 Check how much of the reference was mapped

It’s good practice to do an additional check of our mapping results before proceeding to the next step of our analysis, phylogenetic tree inference. We can do this by checking how much of the reference genome was mapped in each sample and use a tool called seqtk to do this. If a position in the reference genome is not found in the sample it is marked as a N and any regions of the reference that were not mapped to sufficient quality are marked as -. So we use seqtk comp to count the number of A’s, C’s, G’s and T’s, sum the totals and then divide by the length of the reference sequence. This gives us a proportion (we can convert to a % by multiplying by 100) of the reference sequence that was mapped in each sample. Ideally, we would like to see more than 90% of the reference mapped but this will depend on the species and how close the reference is to the samples you’re analysing. However, anything more than 75% should be sufficient for most applications. The main impact of less mapping is fewer phylogenetically informative positions for constructing phylogenetic trees.

We’ll start by activating the seqtk software environment to make seqtk available:

mamba activate seqtk

To run seqtk comp on a single sample (in this example we’ll analyse ERX9450498_ERR9907670), the following commands can be used:

# create output directory
mkdir -p results/bactmap/pseudogenomes_check

# run seqtk comp
seqtk comp preprocessed/bactmap/pseudogenomes/ERX9450498_ERR9907670.fas > results/bactmap/pseudogenomes_check/ERX9450498_ERR9907670.tsv

If you open the output file ERX9450498_ERR9907670.tsv in the terminal with cat:

cat results/bactmap/pseudogenomes_check/ERX9450498_ERR9907670.tsv

You should see output like this:

ERX9450498_ERR9907670   4435783 715550  1348517 1343684 715401  0   0   312631  1040588 0   0   0

There are no headers in the output but the important information is contained in the first six columns:

Column 1 - our sample ID.
Column 2 - the total length of the sequence (this is the length of the reference).
Column 3 - the total number of ’A’s in the sequence.
Column 4 - the total number of ’C’s in the sequence.
Column 5 - the total number of ’G’s in the sequence.
Column 6 - the total number of ’T’s in the sequence.

To calculate the percentage of the reference mapped we divide the sum of ’A’s, ’C’s, ’G’s and ’T’s, divide by the total length of the sequence and multiply by 100:

(715550+1348517+1343684+715401)/4435783 * 100 = 92.95%

This is more than 90% so we can proceed with the analysis of this sample.

Exercise 2 - How much of the reference was mapped?

We have calculated the percentage of the reference mapped for a single sample. However, we have five samples that we need to repeat the analysis on. To do this, we’ve provided a script that runs seqtk comp on all the samples in the pseudogenomes directory using a for loop.

In the folder scripts (inside your analysis directory), you’ll find a script named 04-pseudogenome_check.sh.
Open the script, which you will notice is composed of two sections:
- #### Settings #### where we define some variables for input and output files names. If you were running this script on your own data, you may want to edit the directories in this section.
- #### Analysis #### this is where seqtk comp is run on each sample as detailed in Section 13.4. You should not change the code in this section, although examining it is a good way to learn about running a for loop.
Activate the software environment: mamba activate seqtk
Run the script with bash scripts/04-pseudogenome_check.sh. If the script is running successfully it should print a message on the screen as the samples are processed.
Once the analysis finishes open the mapping_summary.tsv file in Excel from your file browser .
Sort the results by the %ref mapped column and identify the sample which has the lowest percentage of the reference mapped.

Answer

We opened the script 04-pseudogenome_check.sh and these are the settings we used:

fasta_dir="preprocessed/bactmap/pseudogenomes" - the name of the directory with the pseudogenomes produced by bactmap in it.
outdir="results/bactmap/pseudogenomes_check" - the name of the directory where we want to save our results.
parser="scripts/seqtk_parser.py" - the path to a python script that takes the seqtk TSV files as input and does the calculation we performed above for all the samples.

We then ran the script using bash scripts/04-pseudogenome_check.sh. The script prints a message while it’s running:

Processing ERX9450498_ERR9907670.fas
Processing ERX9450499_ERR9907671.fas
Processing ERX9450502_ERR9907674.fas
...

We opened the mapping_summary.tsv file in Excel and sorted the %ref mapped in ascending order to identify which sample had the lowest percentage of the reference mapped.

sample  ref_length  #A  #C  #G  #T  mapped  %ref mapped
ERX9450498_ERR9907670   4435783 715550  1348517 1343684 715401  4123152 92.95206731258044
ERX9450499_ERR9907671   4435783 711436  1338950 1334043 711328  4095757 92.33447623564994
ERX9450502_ERR9907674   4435783 726446  1373546 1368600 726342  4194934 94.57031599607105
ERX9450504_ERR9907676   4435783 728780  1377603 1372394 728397  4207174 94.84625375046525
ERX9450506_ERR9907678   4435783 726076  1376188 1370635 726568  4199467 94.67250764972046

We can see that ERX9450520_ERR9907692 only mapped to 79.9% of the reference. Whilst this is less than the other samples, it’s still above 75% so we’ll include it for now.

13.5 Create final alignment

Now that we’ve mapped our sequence data to the ancestral reference, called and filtered variants and created consensus pseudogenomes that we checked, we can create the final alignment we will use for inferring a phylogenetic tree. As we are not excluding any samples based on the pseudogenome check we did above, we can use the aligned_pseudogenomes.fas file that was created by bactmap (it’s worth remembering that this alignment includes the reference as well as the pseudogenomes). If any of the pseudogenomes contained more than 25% missing data and were removed, we could create our final alignment with cat as described below.

Tip: Building a final alignment from pseudogenome FASTA files

One of the advantages of working with pseudogenome FASTA files is that the files are all the same length i.e. the length of the reference. This means that they are effectively already aligned so we don’t need to do any additional aligning like we might do with gene sequences from different isolates. If you need to create a final alignment from the pseudogenome files, it’s as simple as using cat:

First create a tmp directory and move the pseudogenome files you want to include to the tmp directory:

mkdir tmp

mv *.fas tmp

Now change to the tmp directory and cat the pseudogenome files to create the alignment:

cd tmp

cat *.fas > aligned_pseudogenomes.fas

13.6 Mask final alignment

It’s standard practice to mask certain regions of the TB genome when we build alignments for building phylogenetic trees. In particular, repetitive regions of the genome such as PE/PPE genes and regions with an abundance of SNPs are removed as they are often difficult for the mapping software to accurately align and may cause misalignments leading to errors in variant calling with may lead to inaccurate tree topologies. There are two ways to do the masking:

Mask the regions in the VCF files before creating the pseudogenomes.
Create the final alignment then apply the masking to all the samples at once.

As we’ve already created our final alignment, we’re going to take the second approach and apply the mask to the aligned_pseudogenomes.fas file. The co-ordinates we’re going to use come from Goig et al. 2020 and have been transferred to the new ancestral reference sequence. We’ll use a tool called remove_blocks_from_aln.py to do the masking.

Let’s start by activating the remove_blocks environment:

mamba activate remove_blocks

To run remove_blocks_from_aln.py on aligned_pseudogenomes.fas, the following commands can be used:

# create output directory
mkdir -p results/bactmap/masked_alignment

# remove_blocks_from_aln.py
remove_blocks_from_aln.py -a preprocessed/bactmap/pseudogenomes/aligned_pseudogenomes.fas -t resources/masking/MTBC0_Goigetal_regions_toDiscard.bed -o results/bactmap/masked_alignment/aligned_pseudogenomes_masked.fas

The options we used are:

-a - the alignment file we want to mask.
-t - a BED file containing the coordinates of the regions we wish to mask.
-o - the name of the masked alignment file.

You should see output like this:

Found 1024 regions
Adjusted 51 sequences
Original alignment length: 4435783  New alignment length:4435783
Done.

The masked final alignment will be saved to the results/bactmap/masked_alignment/ directory.

Alternatively we’ve provided a script, 05-mask_pseudogenome.sh in the scripts directory which could be used instead with bash:

bash scripts/05-mask_pseudogenome.sh

13.7 Summary

Key Points

The nf-core/bactmap pipeline automates several steps to generate reference-based consensus genomes, such as: indexing the reference genome; quality-trimming reads; mapping the reads to the reference; calling variants; and generating the final consensus genomes.
Several key output files include:
- An interactive quality report located in multiqc/multiqc_report.html.
- FASTA files for each new pseudogenome in pseudogenomes/SAMPLE-NAME.fas.
- A single FASTA file with all the genomes concatenated in pseudogenomes/aligned_pseudogenomes.fas.
Key quality control statistics output by the pipeline include the % of reads mapped to the reference, the number and type of called variants, amongst others.
Additional steps not included in the pipeline are:
- Calculating the fraction of missing bases in each pseudogenome.
- Masking repetitive regions of the genome.

References

Goig G, et al. Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability. BMC Biology. 2020. DOI