Data & Setup

Workshop Attendees

If you are attending one of our workshops, we will provide a training environment with all of the required software and data.
If you want to setup your own computer to run the analysis demonstrated on this course, you can follow the instructions below.

Software

Linux

Most of the analyses demonstrated in these materials are more suited to be run on a High Performance Computing (HPC) cluster. If you already have access to a HPC in your institution, you can skip this step of the setup.

Otherwise, we provide instructions to setup Linux on a local computer.

The recommendation for bioinformatic analysis is to have a dedicated computer running a Linux distribution. The kind of distribution you choose is not critical, but we recommend Ubuntu if you are unsure.

You can follow the installation tutorial on the Ubuntu webpage.

Warning

Installing Ubuntu on the computer will remove any other operating system you had previously installed, and can lead to data loss.

The Windows Subsystem for Linux (WSL2) runs a compiled version of Ubuntu natively on Windows.

There are detailed instructions on how to install WSL on the Microsoft documentation page. But briefly:

  • Click the Windows key and search for Windows PowerShell, right-click on the app and choose Run as administrator.
  • Answer “Yes” when it asks if you want the App to make changes on your computer.
  • A terminal will open; run the command: wsl --install.
    • This should start installing “ubuntu”.
    • It may ask for you to restart your computer.
  • After restart, click the Windows key and search for Ubuntu, click on the App and it should open a new terminal.
  • Follow the instructions to create a username and password (you can use the same username and password that you have on Windows, or a different one - it’s your choice).
  • You should now have access to a Ubuntu Linux terminal. This (mostly) behaves like a regular Ubuntu terminal, and you can install apps using the sudo apt install command as usual.

After WSL is installed, it is useful to create shortcuts to your files on Windows. Your C:\ drive is located in /mnt/c/ (equally, other drives will be available based on their letter). For example, your desktop will be located in: /mnt/c/Users/<WINDOWS USERNAME>/Desktop/. It may be convenient to set shortcuts to commonly-used directories, which you can do using symbolic links, for example:

  • Documents: ln -s /mnt/c/Users/<WINDOWS USERNAME>/Documents/ ~/Documents
    • If you use OneDrive to save your documents, use: ln -s /mnt/c/Users/<WINDOWS USERNAME>/OneDrive/Documents/ ~/Documents
  • Desktop: ln -s /mnt/c/Users/<WINDOWS USERNAME>/Desktop/ ~/Desktop
  • Downloads: ln -s /mnt/c/Users/<WINDOWS USERNAME>/Downloads/ ~/Downloads

Another way to run Linux within Windows (or macOS) is to install a Virtual Machine. However, this is mostly suitable for practicing and not suitable for real data analysis.

Detailed instructions to install an Ubuntu VM using Oracle’s Virtual Box is available from the Ubuntu documentation page.

Note: In the step configuring “Virtual Hard Disk” make sure to assign a large storage partition (at least 100GB).

Update Ubuntu

After installing Ubuntu (through either of the methods above), open a terminal and run the following commands to update your system and install some essential packages:

sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y
sudo apt install -y git
sudo apt install -y default-jre

Conda/Mamba

We recommend using the Conda package manager to install your software. In particular, the newest implementation called Mamba.

To install Mamba, run the following commands from the terminal:

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge3
rm Miniforge3-$(uname)-$(uname -m).sh
$HOME/miniforge3/bin/mamba init

Restart your terminal (or open a new one) and confirm that your shell now starts with the word (base). Then run the following commands:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set remote_read_timeout_secs 1000

Software environments

Due to the complexities of the different tools we will use, there are several software dependency incompatibilities between them. Therefore, rather than creating a single software environment with all the tools, we will create separate environments for different applications.

Pandas

For convenience, we recommend installing the popular Pandas package in the base (default) environment:

mamba install -n base pandas

Bakta

mamba create -n bakta bakta

Gubbins

mamba create -n gubbins gubbins

IQ-Tree

mamba create -n iqtree iqtree snp-sites biopython

mlst

mamba create -n mlst mlst

Nextflow

mamba create -n nextflow nextflow

Also run these commands to set a basic Nextflow configuration file (copy/paste this entire code):

mkdir -p $HOME/.nextflow
echo "
conda {
  conda.enabled = true
  singularity.enabled = false
  docker.enabled = false
  useMamba = true
  createTimeout = '4 h'
  cacheDir = '$HOME/.nextflow-conda-cache/'
}
singularity {
  singularity.enabled = true
  conda.enabled = false
  docker.enabled = false
  pullTimeout = '4 h'
  cacheDir = '$HOME/.nextflow-singularity-cache/'
}
docker {
  docker.enabled = true
  singularity.enabled = false
  conda.enabled = false
}
" >> $HOME/.nextflow/config

pairsnp

mamba create -n pairsnp pairsnp

Panaroo

mamba create -n panaroo python=3.9 panaroo>=1.3 snp-sites

PopPUNK

mamba create -n poppunk python=3.10 poppunk

remove_blocks_from_aln

mamba create -n remove_blocks python=2.7
$HOME/miniforge3/envs/remove_blocks/bin/pip install git+https://github.com/sanger-pathogens/remove_blocks_from_aln.git

Seqtk

mamba create -n seqtk seqtk pandas

TB-Profiler

mamba create -n tb-profiler tb-profiler pandas

TreeTime

mamba create -n treetime treetime seqkit biopython

R and RStudio

R and RStudio are available for all major operating systems.

  • Windows: download and install all these using default options:
  • macOS: download and install all these using default options:
  • Linux:
    • Go to the R installation folder and look at the instructions for your distribution.
    • Download the RStudio installer for your distribution and install it using your package manager.

After installing R, you will need to install a few packages. Open RStudio and on the console type the following command:

install.packages(c("tidyverse", "tidygraph", "ggraph", "igraph", "ggtree", "ggnewscale"))

Singularity

We recommend that you install Singularity and use the -profile singularity option when running Nextflow pipelines. On Ubuntu/WSL2, you can install Singularity using the following commands:

sudo apt install -y runc cryptsetup-bin uidmap
wget -O singularity.deb https://github.com/sylabs/singularity/releases/download/v4.0.2/singularity-ce_4.0.2-$(lsb_release -cs)_amd64.deb
sudo dpkg -i singularity.deb
rm singularity.deb

If you have a different Linux distribution, you can find more detailed instructions on the Singularity documentation page.

If you have issues running Nextflow pipelines with Singularity, then you can follow the instructions below for Docker instead.

Docker

An alternative for software management when running Nextflow pipelines is to use Docker.

For Ubuntu Linux, here are the installation instructions:

sudo apt install curl
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh
sudo groupadd docker
sudo usermod -aG docker $USER

After the last step, you will need to restart your computer. From now on, you can use -profile docker when you run Nextflow.

When using WSL2 on Windows, running Nextflow pipelines with -profile singularity sometimes doesn’t work.

As an alternative you can instead use Docker, which is another software containerisation solution. To set this up, you can follow the full instructions given on the Microsoft Documentation: Get started with Docker remote containers on WSL 2.

We briefly summarise the instructions here (but check that page for details and images):

  • Download Docker for Windows.
  • Run the installer and install accepting default options.
  • Restart the computer.
  • Open Docker and go to Settings > General to tick “Use the WSL 2 based engine”.
  • Go to Settings > Resources > WSL Integration to enable your Ubuntu WSL installation.

Once you have Docker set and installed, you can use -profile docker when running your Nextflow command.

You can follow the same instructions as for “Ubuntu”.

Data

The data used in these materials is provided as an archive file (bact-data.tar). You can download it from the link below and extract the files from the archive into a directory of your choice.

You can also download them using the command line:

# directory for saving the data - change this to suit your needs
datadir="$HOME/Desktop/bacterial_genomics"

# download and extract to directory
mkdir $datadir
wget -O $datadir/bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
tar -xvf $datadir/bact-data.tar -C $datadir
rm $datadir/bact-data.tar
Note for training facility

We also need to include preprocessed data for the outbreak exercise. See the download script in the repo for details.

Databases

We include a copy of public databases used in the exercises in the dropbox link above. However, for your analyses you should always download the most up-to-date databases.

In the code below we download these databases into a directory called databases. This is optional, you can download the databases where it is most convenient for you. If you work in a research group, it’s a good idea to have a shared storage where everyone can access the same copy of the databases.

# create directory for public DBs
mkdir databases
cd databases

Kraken2

We use a small version of the database for teaching purposes, whereas you may want to use the full version in your work. Look at the Kraken2 indexes page for the latest versions available.

wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz
mkdir k2_standard_08gb_20240605
tar -xzvf k2_standard_08gb_20240605.tar.gz -C k2_standard_08gb_20240605
rm k2_standard_08gb_20240605.tar.gz

Bakta

We use the “light” version of the database for teaching purposes, whereas you may want to use the full version in your work. Look at the Bakta Zenodo repository for the latest versions available.

wget https://zenodo.org/records/10522951/files/db-light.tar.gz
tar -xzvf db-light.tar.gz
mv db-light  bakta_light_20240119
rm db-light.tar.gz

# make sure to activate bakta environment
mamba activate bakta
amrfinder_update --force_update --database bakta_light_20240119/amrfinderplus-db/

CheckM2

CheckM2 also provides a command checkm2 database --download to download the latest version of the database from Zenodo.

wget https://zenodo.org/records/5571251/files/checkm2_database.tar.gz
tar -xzvf checkm2_database.tar.gz
mv CheckM2_database checkm2_v2_20210323
rm checkm2_database.tar.gz CONTENTS.json

GPSCs

wget https://gps-project.cog.sanger.ac.uk/GPS_v8_ref.tar.gz
mkdir poppunk
tar -xzvf GPS_v8_ref.tar.gz -C poppunk
rm GPS_v8_ref.tar.gz

wget -O poppunk/GPS_v8_external_clusters.csv https://gps-project.cog.sanger.ac.uk/GPS_v8_external_clusters.csv