Data & Setup
If you are attending one of our workshops, we will provide a training environment with all of the required software and data.
If you want to setup your own computer to run the analysis demonstrated on this course, you can follow the instructions below.
Software
Linux
Most of the analyses demonstrated in these materials are more suited to be run on a High Performance Computing (HPC) cluster. If you already have access to a HPC in your institution, you can skip this step of the setup.
Otherwise, we provide instructions to setup Linux on a local computer.
The recommendation for bioinformatic analysis is to have a dedicated computer running a Linux distribution. The kind of distribution you choose is not critical, but we recommend Ubuntu if you are unsure.
You can follow the installation tutorial on the Ubuntu webpage.
Installing Ubuntu on the computer will remove any other operating system you had previously installed, and can lead to data loss.
The Windows Subsystem for Linux (WSL2) runs a compiled version of Ubuntu natively on Windows.
There are detailed instructions on how to install WSL on the Microsoft documentation page. But briefly:
- Click the Windows key and search for Windows PowerShell, right-click on the app and choose Run as administrator.
- Answer “Yes” when it asks if you want the App to make changes on your computer.
- A terminal will open; run the command:
wsl --install
.- This should start installing “ubuntu”.
- It may ask for you to restart your computer.
- After restart, click the Windows key and search for Ubuntu, click on the App and it should open a new terminal.
- Follow the instructions to create a username and password (you can use the same username and password that you have on Windows, or a different one - it’s your choice).
- You should now have access to a Ubuntu Linux terminal. This (mostly) behaves like a regular Ubuntu terminal, and you can install apps using the
sudo apt install
command as usual.
After WSL is installed, it is useful to create shortcuts to your files on Windows. Your C:\
drive is located in /mnt/c/
(equally, other drives will be available based on their letter). For example, your desktop will be located in: /mnt/c/Users/<WINDOWS USERNAME>/Desktop/
. It may be convenient to set shortcuts to commonly-used directories, which you can do using symbolic links, for example:
- Documents:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/Documents/ ~/Documents
- If you use OneDrive to save your documents, use:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/OneDrive/Documents/ ~/Documents
- If you use OneDrive to save your documents, use:
- Desktop:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/Desktop/ ~/Desktop
- Downloads:
ln -s /mnt/c/Users/<WINDOWS USERNAME>/Downloads/ ~/Downloads
Another way to run Linux within Windows (or macOS) is to install a Virtual Machine. However, this is mostly suitable for practicing and not suitable for real data analysis.
Detailed instructions to install an Ubuntu VM using Oracle’s Virtual Box is available from the Ubuntu documentation page.
Note: In the step configuring “Virtual Hard Disk” make sure to assign a large storage partition (at least 100GB).
Update Ubuntu
After installing Ubuntu (through either of the methods above), open a terminal and run the following commands to update your system and install some essential packages:
sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y
sudo apt install -y git
sudo apt install -y default-jre
Conda/Mamba
We recommend using the Conda package manager to install your software. In particular, the newest implementation called Mamba.
To install Mamba, run the following commands from the terminal:
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge3
rm Miniforge3-$(uname)-$(uname -m).sh
$HOME/miniforge3/bin/mamba init
Restart your terminal (or open a new one) and confirm that your shell now starts with the word (base)
. Then run the following commands:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set remote_read_timeout_secs 1000
Software environments
Due to the complexities of the different tools we will use, there are several software dependency incompatibilities between them. Therefore, rather than creating a single software environment with all the tools, we will create separate environments for different applications.
Pandas
For convenience, we recommend installing the popular Pandas package in the base (default) environment:
mamba install -n base pandas
Bakta
mamba create -n bakta bakta
Gubbins
mamba create -n gubbins gubbins
IQ-Tree
mamba create -n iqtree iqtree snp-sites biopython
mlst
mamba create -n mlst mlst
Nextflow
mamba create -n nextflow nextflow
Also run these commands to set a basic Nextflow configuration file (copy/paste this entire code):
mkdir -p $HOME/.nextflow
echo "
conda {
conda.enabled = true
singularity.enabled = false
docker.enabled = false
useMamba = true
createTimeout = '4 h'
cacheDir = '$HOME/.nextflow-conda-cache/'
}
singularity {
singularity.enabled = true
conda.enabled = false
docker.enabled = false
pullTimeout = '4 h'
cacheDir = '$HOME/.nextflow-singularity-cache/'
}
docker {
docker.enabled = true
singularity.enabled = false
conda.enabled = false
}
" >> $HOME/.nextflow/config
pairsnp
mamba create -n pairsnp pairsnp
Panaroo
mamba create -n panaroo python=3.9 panaroo>=1.3 snp-sites
PopPUNK
mamba create -n poppunk python=3.10 poppunk
remove_blocks_from_aln
mamba create -n remove_blocks python=2.7
$HOME/miniforge3/envs/remove_blocks/bin/pip install git+https://github.com/sanger-pathogens/remove_blocks_from_aln.git
Seqtk
mamba create -n seqtk seqtk pandas
TB-Profiler
mamba create -n tb-profiler tb-profiler pandas
TreeTime
mamba create -n treetime treetime seqkit biopython
R and RStudio
R and RStudio are available for all major operating systems.
- Windows: download and install all these using default options:
- macOS: download and install all these using default options:
- Linux:
- Go to the R installation folder and look at the instructions for your distribution.
- Download the RStudio installer for your distribution and install it using your package manager.
After installing R, you will need to install a few packages. Open RStudio and on the console type the following command:
install.packages(c("tidyverse", "tidygraph", "ggraph", "igraph", "ggtree", "ggnewscale"))
Singularity
We recommend that you install Singularity and use the -profile singularity
option when running Nextflow pipelines. On Ubuntu/WSL2, you can install Singularity using the following commands:
sudo apt install -y runc cryptsetup-bin uidmap
wget -O singularity.deb https://github.com/sylabs/singularity/releases/download/v4.0.2/singularity-ce_4.0.2-$(lsb_release -cs)_amd64.deb
sudo dpkg -i singularity.deb
rm singularity.deb
If you have a different Linux distribution, you can find more detailed instructions on the Singularity documentation page.
If you have issues running Nextflow pipelines with Singularity, then you can follow the instructions below for Docker instead.
Docker
An alternative for software management when running Nextflow pipelines is to use Docker.
For Ubuntu Linux, here are the installation instructions:
sudo apt install curl
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh
sudo groupadd docker
sudo usermod -aG docker $USER
After the last step, you will need to restart your computer. From now on, you can use -profile docker
when you run Nextflow.
When using WSL2 on Windows, running Nextflow pipelines with -profile singularity
sometimes doesn’t work.
As an alternative you can instead use Docker, which is another software containerisation solution. To set this up, you can follow the full instructions given on the Microsoft Documentation: Get started with Docker remote containers on WSL 2.
We briefly summarise the instructions here (but check that page for details and images):
- Download Docker for Windows.
- Run the installer and install accepting default options.
- Restart the computer.
- Open Docker and go to Settings > General to tick “Use the WSL 2 based engine”.
- Go to Settings > Resources > WSL Integration to enable your Ubuntu WSL installation.
Once you have Docker set and installed, you can use -profile docker
when running your Nextflow command.
You can follow the same instructions as for “Ubuntu”.
Data
The data used in these materials is provided as an archive file (bact-data.tar
). You can download it from the link below and extract the files from the archive into a directory of your choice.
You can also download them using the command line:
# directory for saving the data - change this to suit your needs
datadir="$HOME/Desktop/bacterial_genomics"
# download and extract to directory
mkdir $datadir
wget -O $datadir/bact-data.tar "https://www.dropbox.com/scl/fi/gdqf3y3toot2hjtivhlpk/bact-data.tar?rlkey=udjh38aqd05eg3r8klw5mguld&dl=1"
tar -xvf $datadir/bact-data.tar -C $datadir
rm $datadir/bact-data.tar
We also need to include preprocessed data for the outbreak exercise. See the download script in the repo for details.
Databases
We include a copy of public databases used in the exercises in the dropbox link above. However, for your analyses you should always download the most up-to-date databases.
In the code below we download these databases into a directory called databases
. This is optional, you can download the databases where it is most convenient for you. If you work in a research group, it’s a good idea to have a shared storage where everyone can access the same copy of the databases.
# create directory for public DBs
mkdir databases
cd databases
Kraken2
We use a small version of the database for teaching purposes, whereas you may want to use the full version in your work. Look at the Kraken2 indexes page for the latest versions available.
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz
mkdir k2_standard_08gb_20240605
tar -xzvf k2_standard_08gb_20240605.tar.gz -C k2_standard_08gb_20240605
rm k2_standard_08gb_20240605.tar.gz
Bakta
We use the “light” version of the database for teaching purposes, whereas you may want to use the full version in your work. Look at the Bakta Zenodo repository for the latest versions available.
wget https://zenodo.org/records/10522951/files/db-light.tar.gz
tar -xzvf db-light.tar.gz
mv db-light bakta_light_20240119
rm db-light.tar.gz
# make sure to activate bakta environment
mamba activate bakta
amrfinder_update --force_update --database bakta_light_20240119/amrfinderplus-db/
CheckM2
CheckM2 also provides a command checkm2 database --download
to download the latest version of the database from Zenodo.
wget https://zenodo.org/records/5571251/files/checkm2_database.tar.gz
tar -xzvf checkm2_database.tar.gz
mv CheckM2_database checkm2_v2_20210323
rm checkm2_database.tar.gz CONTENTS.json
GPSCs
wget https://gps-project.cog.sanger.ac.uk/GPS_v8_ref.tar.gz
mkdir poppunk
tar -xzvf GPS_v8_ref.tar.gz -C poppunk
rm GPS_v8_ref.tar.gz
wget -O poppunk/GPS_v8_external_clusters.csv https://gps-project.cog.sanger.ac.uk/GPS_v8_external_clusters.csv