26 Panaroo
- Use Panaroo to generate a core gene alignment.
- Produce a phylogenetic tree from a core gene alignment.
26.1 Core genome alignment generation with Panaroo
The software Panaroo
was developed to analyse bacterial pan-genomes. It is able to identify orthologous sequences between a set of sequences, which it uses to produce a multiple sequence alignment of the core genome. The output alignment it produces can then be used to build our phylogenetic tree in the next step.
As input to Panaroo
we will use the gene annotations for our newly assembled genomes, which were produced by the assembleBAC pipeline using Bakta
.
First, let’s activate the panaroo
software environment:
mamba activate panaroo
To run Panaroo
on our samples we can use the following commands:
# create output directory
mkdir results/panaroo
# run panaroo
panaroo \
--input results/assemblebac/annotation/*.gff3 \
--out_dir results/panaroo \
--clean-mode strict \
--alignment core \
--core_threshold 0.98 \
--remove-invalid-genes \
--threads 8
The options used are:
--input
- all the input annotation files, in thePanaroo
-compatible GFF3 format. Notice how we used the*
wildcard to match all the files in the folder.--out_dir
- the output directory we want to save the results into.--clean-mode
- determines the stringency ofPanaroo
in including genes within its pan-genome graph for gene clustering and core gene identification. The available modes are ‘strict’, ‘moderate’, and ‘sensitive’. These modes balance eliminating probable contaminants against preserving valid annotations like infrequent plasmids. In our case we used ‘strict’ mode, as we are interested in building a core gene alignment for phylogenetics, so including rare plasmids is less important for our downstream task.--alignment
- whether we want to produce an alignment of core genes or all genes (pangenome alignment). In our case we want to only consider the core genes, to build a phylogeny.--core_threshold
- the fraction of input genomes where a gene has to be found to be considered a “core gene”. In our case we’ve set this to a very high value, to ensure most of our samples have the gene.--remove-invalid-genes
- this is recommended to remove annotations that are incompatible with the annotation format expected byPanaroo
.--threads
- how many CPUs we want to use for parallel computations.
Panaroo
can takle a long time to run, so be prepared to wait a while for its analysis to finish .
Once it finishes, we can see the output it produces:
ls results/panaroo
aligned_gene_sequences/ core_alignment_header.embl gene_presence_absence_roary.csv
alignment_entropy.csv core_gene_alignment.aln pan_genome_reference.fa
combined_DNA_CDS.fasta core_gene_alignment_filtered.aln pre_filt_graph.gml
combined_protein_CDS.fasta final_graph.gml struct_presence_absence.Rtab
combined_protein_cdhit_out.txt gene_data.csv summary_statistics.txt
combined_protein_cdhit_out.txt.clstr gene_presence_absence.Rtab
core_alignment_filtered_header.embl gene_presence_absence.csv
There are several output files generated, which can be generated for more advanced analysis and visualisation (see Panaroo
documentation for details). For our purpose of creating a phylogeny from the core genome alignment, we need the file core_gene_alignment_filtered.aln
, which is a file in FASTA format. We can take a quick look at this file:
head results/panaroo/core_gene_alignment_filtered.aln
>ERX3876945_ERR3864892_T1
atgcaacaatcagacgtcattagtgctgccaaaaaatatatggaatctattcatcaaaat
gattatacaggccatgatattgcgcatgtatatcgtgtcactgctttagctaaatcaatc
gctgaaaatgaaggtgttaatgatactttagtcattgaactcgcatgtttgcttcatgat
accgttgacgaaaaagttgtagatgctaacaaacaatatgttgaattgaagtcattttta
tcttctttatcactatcaaccgaagatcaagagcacattttatttattattaataatatg
agctatcgcaatggcaaaaatgatcatgtcactttatctttagaaggtcaaattgtcagg
gatgcagatcgtcttgatgctataggcgctataggtgttgcacgaacatttcaatttgca
ggacactttggtgaaccaatgtggacagaacatatgtcactagataagattaatgatgat
ttagttgaacagttgccaccatctgcaattaagcatttctttgaaaaattacttaagtta
We can see this contains a sequence named “ERX3876945_ERR3864892_T1”, which corresponds to one of the genomes we included in our dataset. We can look at all the sequence names in the FASTA file:
grep ">" results/panaroo/core_gene_alignment_filtered.aln
>ERX3876945_ERR3864892_T1
>ERX3876948_ERR3864895_T1
>ERX3876909_ERR3864856_T1
>ERX3876942_ERR3864889_T1
>ERX3876935_ERR3864882_T1
>ERX3876905_ERR3864852_T1
>ERX3876940_ERR3864887_T1
>ERX3876954_ERR3864901_T1
... more output omitted to save space ...
We can see each input genomes, assembled and annotated by us, appears once.
Panaroo
(click to see details)
Panaroo
requires GFF files in a non-standard format. They are similar to standard GFF files, but they also include the genome sequence itself at the end of the file. By default, Bakta
(which we used earlier to annotate our assembled genomes) already produces files in this non-standard GFF format.
However, GFF files downloaded from NCBI will not be in this non-standard format. To convert the files to the required format, the Panaroo
developers provide us with a Python script that can do this conversion:
python3 convert_refseq_to_prokka_gff.py -g annotation.gff -f genome.fna -o new.gff
-g
is the original GFF (for example downloaded from NCBI).-f
is the corresponding FASTA file with the genome (also downloaded from NCBI).-o
is the name for the output file.
This is a bit more advanced, and is included here for interested users. We already prepared all the files for constructing a phylogeny, so you don’t need to worry about this for the course.
26.2 Summary
- Panaroo can be used to generate a core gene alignment.
- It can use as input GFF files generated by an annotation software such as Bakta.
- IQ-Tree can be used for tree inference, based on the output from Panaroo’s core gene alignment.