Data & Setup
If you are attending one of our workshops, we will provide a training environment with all of the required software and data.
If you want to setup your own computer to run the analysis demonstrated on this course, you can follow the instructions below.
Software
Linux
Most of the analyses demonstrated in these materials are more suited to be run on a High Performance Computing (HPC) cluster. If you already have access to a HPC in your institution, you can skip this step of the setup.
Otherwise, we provide instructions to setup Linux on a local computer.
The recommendation for bioinformatic analysis is to have a dedicated computer running a Linux distribution. The kind of distribution you choose is not critical, but we recommend Ubuntu if you are unsure.
You can follow the installation tutorial on the Ubuntu webpage.
Installing Ubuntu on the computer will remove any other operating system you had previously installed, and can lead to data loss.
The Windows Subsystem for Linux (WSL2) runs a compiled version of Ubuntu natively on Windows.
There are detailed instructions on how to install WSL on the Microsoft documentation page. But briefly:
- Click the Windows key and search for Windows PowerShell, right-click on the app and choose Run as administrator.
- Answer “Yes” when it asks if you want the App to make changes on your computer.
- A terminal will open; run the command:
wsl --install.- This should start installing “ubuntu”.
- It may ask for you to restart your computer.
- After restart, click the Windows key and search for Ubuntu, click on the App and it should open a new terminal.
- Follow the instructions to create a username and password (you can use the same username and password that you have on Windows, or a different one - it’s your choice).
- You should now have access to a Ubuntu Linux terminal. This (mostly) behaves like a regular Ubuntu terminal, and you can install apps using the
sudo apt installcommand as usual.
After WSL is installed, it is useful to create shortcuts to your files on Windows. Your C:\ drive is located in /mnt/c/ (equally, other drives will be available based on their letter). For example, your desktop will be located in: /mnt/c/Users/<WINDOWS USERNAME>/Desktop/. It may be convenient to set shortcuts to commonly-used directories, which you can do using symbolic links, for example:
- Documents:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/Documents/ ~/Documents- If you use OneDrive to save your documents, use:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/OneDrive/Documents/ ~/Documents
- If you use OneDrive to save your documents, use:
- Desktop:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/Desktop/ ~/Desktop - Downloads:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/Downloads/ ~/Downloads
Another way to run Linux within Windows (or macOS) is to install a Virtual Machine. However, this is mostly suitable for practicing and not suitable for real data analysis.
Detailed instructions to install an Ubuntu VM using Oracle’s Virtual Box is available from the Ubuntu documentation page.
Note: In the step configuring “Virtual Hard Disk” make sure to assign a large storage partition (at least 100GB).
Update Ubuntu
After installing Ubuntu (through either of the methods above), open a terminal and run the following commands to update your system and install some essential packages:
sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y
sudo apt install -y git
sudo apt install -y default-jreConda/Mamba
We recommend using the Conda package manager to install your software. In particular, the newest implementation called Mamba.
To install Mamba, run the following commands from the terminal:
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge3
rm Miniforge3-$(uname)-$(uname -m).sh
$HOME/miniforge3/bin/mamba initRestart your terminal (or open a new one) and confirm that your shell now starts with the word (base). Then run the following commands:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set remote_read_timeout_secs 1000Software environments
Due to the complexities of the different tools we will use, there are several software dependency incompatibilities between them. Therefore, rather than creating a single software environment with all the tools, we will create separate environments for different applications.
Pandas
For convenience, we recommend installing the popular Pandas package in the base (default) environment:
mamba install -n base pandasBakta
mamba create -n bakta baktaKrona
mamba create -n krona kronaGubbins
mamba create -n gubbins gubbinsIQ-Tree
mamba create -n iqtree iqtree snp-sites biopythonmlst
mamba create -n mlst mlstNextflow
mamba create -n nextflow nextflowAlso run the following commands to set a basic Nextflow configuration file. Make sure to adjust the resource limits to fit with your workstation or HPC (maximum values for cpus, memory and time).
mkdir -p $HOME/.nextflow
cat <<EOF >> $HOME/.nextflow/config
process {
resourceLimits = [
cpus: 8,
memory: 20.GB,
time: 12.h
]
}
singularity {
pullTimeout = '4 h'
cacheDir = '$HOME/.nextflow-singularity-cache/'
}
EOFpairsnp
mamba create -n pairsnp pairsnpPanaroo
mamba create -n panaroo python=3.9 panaroo>=1.3 snp-sitesPopPUNK
mamba create -n poppunk python=3.10 poppunkremove_blocks_from_aln
mamba create -n remove_blocks python=2.7
$HOME/miniforge3/envs/remove_blocks/bin/pip install git+https://github.com/sanger-pathogens/remove_blocks_from_aln.gitSeqtk
mamba create -n seqtk seqtk pandasTB-Profiler
mamba create -n tb-profiler tb-profiler pandasTreeTime
mamba create -n treetime treetime seqkit biopythonMOB-suite & Pling & mashtree
mamba create -n mob_suite mob_suite
mamba create -n pling pling
mamba create -n mashtree mashtreeReverse vaccinology
mamba create -n reverse-vaccinology bakta diamond cd-hit pandasPSORTb
Running PSORTb requires Apptainer and a wrapper script. The container is available from our Dropbox.
wget -O psortb.sif "https://www.dropbox.com/ #add link here"
wget https://raw.githubusercontent.com/brinkmanlab/psortb_commandline_docker/master/psortb_app
chmod +x psortb_appAccessory genome vaccine workflow
micromamba create -n accessory-vaccinology -c bioconda -c conda-forge mash pyseer python=3.6 openssl=1.0R and RStudio
R and RStudio are available for all major operating systems.
- Windows: download and install all these using default options:
- macOS: download and install all these using default options:
- Linux:
- Go to the R installation folder and look at the instructions for your distribution.
- Download the RStudio installer for your distribution and install it using your package manager.
After installing R, you will need to install a few packages. Open RStudio and on the console type the following commands:
install.packages("BiocManager")
BiocManager::install(c("data.table", "ggraph", "igraph",
"tidygraph", "tidyverse", "ape",
"phytools", "ggnewscale", "ggtree",
"janitor"))Singularity
We recommend that you install Singularity and use the -profile singularity option when running Nextflow pipelines. On Ubuntu/WSL2, you can install Singularity using the following commands:
sudo apt install -y libfuse2t64 runc fuse2fs uidmap
wget -O singularity.deb https://github.com/sylabs/singularity/releases/download/v4.3.0/singularity-ce_4.3.0-$(lsb_release -cs)_amd64.deb
sudo dpkg -i singularity.deb
rm singularity.debIf you have a different Linux distribution, you can find more detailed instructions on the Singularity documentation page.
If you have issues running Nextflow pipelines with Singularity, then you can follow the instructions below for Docker instead.
Docker
An alternative for software management when running Nextflow pipelines is to use Docker.
For Ubuntu Linux, here are the installation instructions:
sudo apt install curl
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh
sudo groupadd docker
sudo usermod -aG docker $USERAfter the last step, you will need to restart your computer. From now on, you can use -profile docker when you run Nextflow.
When using WSL2 on Windows, running Nextflow pipelines with -profile singularity sometimes doesn’t work.
As an alternative you can instead use Docker, which is another software containerisation solution. To set this up, you can follow the full instructions given on the Microsoft Documentation: Get started with Docker remote containers on WSL 2.
We briefly summarise the instructions here (but check that page for details and images):
- Download Docker for Windows.
- Run the installer and install accepting default options.
- Restart the computer.
- Open Docker and go to Settings > General to tick “Use the WSL 2 based engine”.
- Go to Settings > Resources > WSL Integration to enable your Ubuntu WSL installation.
Once you have Docker set and installed, you can use -profile docker when running your Nextflow command.
You can follow the same instructions as for “Ubuntu”.
Data
The data used in these materials are provided as archive files:
bact-data.tarcontains the main course data.bact-outbreak.tarcontains the data for the final capstone exercise.bact-databases.tarcontains a copy of the databases used by some of the programs. Note: we do not recommend that you use this copy in your own work, always download the latest database versions following the instructions given below.
You can download these files from the link below and extract the files from the archive into a directory of your choice.
You can also download them using the command line:
# directory for saving the data - change this to suit your needs
datadir="$HOME/Desktop/bacterial_genomics"
mkdir $datadir
# download and extract to directory
wget -O $datadir/bact-data.tar "https://www.dropbox.com/scl/fi/s88w1cdiqtygnepbff858/bact-data.tar?rlkey=xifz132zgjt7hj8oj38ef9o00&st=izvooc62&dl=1"
tar -xvf $datadir/bact-data.tar -C $datadir
rm $datadir/bact-data.tar
wget -O $datadir/bact-outbreak.tar "https://www.dropbox.com/scl/fi/tio9qtcuwvv86nwfckezl/bact-outbreak.tar?rlkey=khcb6nsvj3mpsvfbsv97q467e&st=r9vj0dm9&dl=1"
tar -xvf $datadir/bact-outbreak.tar -C $datadir
rm $datadir/bact-outbreak.tar
wget -O $datadir/bact-databases.tar "https://www.dropbox.com/scl/fi/ljwypmwetfu6o6pe3fwff/bact-databases.tar?rlkey=yyg3q7w0s47ildzad5sfftr1x&st=y381n5nj&dl=1"
tar -xvf $datadir/bact-databases.tar -C $datadir
rm $datadir/bact-databases.tarWe also need to include preprocessed data for the outbreak exercise. See the download script in the repo for details.
Databases
We include a copy of public databases used in the exercises in the dropbox link above. However, for your analyses you should always download the most up-to-date databases.
In the code below we download these databases into a directory called databases. This is optional, you can download the databases where it is most convenient for you. If you work in a research group, it’s a good idea to have a shared storage where everyone can access the same copy of the databases.
# create directory for public DBs
mkdir databases
cd databasesKraken2
We use a small version of the database for teaching purposes, whereas you may want to use the full version in your work. Look at the Kraken2 indexes page for the latest versions available.
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz
mkdir k2_standard_08gb_20240605
tar -xzvf k2_standard_08gb_20240605.tar.gz -C k2_standard_08gb_20240605
rm k2_standard_08gb_20240605.tar.gzBakta
We use the “light” version of the database for teaching purposes, whereas you may want to use the full version in your work. Look at the Bakta Zenodo repository for the latest versions available.
wget https://zenodo.org/records/10522951/files/db-light.tar.gz
tar -xzvf db-light.tar.gz
mv db-light bakta_light_20240119
rm db-light.tar.gz
# make sure to activate bakta environment
mamba activate bakta
amrfinder_update --force_update --database bakta_light_20240119/amrfinderplus-db/CheckM2
CheckM2 also provides a command checkm2 database --download to download the latest version of the database from Zenodo.
wget https://zenodo.org/records/5571251/files/checkm2_database.tar.gz
tar -xzvf checkm2_database.tar.gz
mv CheckM2_database checkm2_v2_20210323
rm checkm2_database.tar.gz CONTENTS.jsonGPSCs
wget https://gps-project.cog.sanger.ac.uk/GPS_v8_ref.tar.gz
mkdir poppunk
tar -xzvf GPS_v8_ref.tar.gz -C poppunk
rm GPS_v8_ref.tar.gz
wget -O poppunk/GPS_v8_external_clusters.csv https://gps-project.cog.sanger.ac.uk/GPS_v8_external_clusters.csvKrona
# make sure to activate krona environment
mamba activate krona
ktUpdateTaxonomy.sh krona/MOB-suite
MOB-suite has a generic database available, which can be downloaded using:
mamba activate mob_suite
mob_init -d mob_suite -vThe MOB-suite developers also provide a collection of Enterobacteriacea genomes for organisms such as E. coli. These can be downloaded separately from Zenodo, like so:
wget -O mobsuite.zip https://zenodo.org/api/records/3785351/files-archive
unzip mobsuite.zip -d mob_suite
rm mobsuite.zipSWISS-PROT and Human proteome
# download Swiss-Prot and Human Proteome from UniProt
wget "https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
wget "https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz"
gunzip *.gz
# make sure to activate reverse-vaccinology environment
mamba activate reverse-vaccinology
# create DIAMOND-formatted databases
diamond makedb --in uniprot_sprot.fasta -d swissprot
diamond makedb --in UP000005640_9606.fasta -d human_proteomeCARD
This database is used by the Nextflow workflow nf-core/funcscan. The database is downloaded by the workflow itself, but if you run this workflow regularly, it might be best to download it once, to save time and bandwidth.
Instructions for this are given in the workflow documentation page. Here is how we did it for our workshop:
mkdir card
wget -O card.tar.bz2 https://card.mcmaster.ca/latest/data
tar -xjvf card.tar.bz2 -C card
rm card.tar.bz2