2  Data & Setup

Workshop Attendees

If you are attending one of our workshops, we will provide a training environment with all of the required software and data. There is no need for you to set anything up in advance.

These instructions are for those who would like setup their own computer to run the analysis demonstrated on the materials.

Software

Install Linux

The recommendation for bioinformatic analysis is to have a dedicated computer running a Linux distribution. The kind of distribution you choose is not critical, but we recommend Ubuntu if you are unsure.

You can follow the installation tutorial on the Ubuntu webpage.

Warning

Installing Ubuntu on the computer will remove any other operating system you had previously installed, and can lead to data loss.

The Windows Subsystem for Linux (WSL2) runs a compiled version of Ubuntu natively on Windows.

There are detailed instructions on how to install WSL on the Microsoft documentation page. But briefly:

  • Click the Windows key and search for Windows PowerShell, right-click on the app and choose Run as administrator.
  • Answer “Yes” when it asks if you want the App to make changes on your computer.
  • A terminal will open; run the command: wsl --install.
    • This should start installing “ubuntu”.
    • It may ask for you to restart your computer.
  • After restart, click the Windows key and search for Ubuntu, click on the App and it should open a new terminal.
  • Follow the instructions to create a username and password (you can use the same username and password that you have on Windows, or a different one - it’s your choice).
  • You should now have access to a Ubuntu Linux terminal. This (mostly) behaves like a regular Ubuntu terminal, and you can install apps using the sudo apt install command as usual.

After WSL is installed, it is useful to create shortcuts to your files on Windows. Your C:\ drive is located in /mnt/c/ (equally, other drives will be available based on their letter). For example, your desktop will be located in: /mnt/c/Users/<WINDOWS USERNAME>/Desktop/. It may be convenient to set shortcuts to commonly-used directories, which you can do using symbolic links, for example:

  • Documents: ln -s /mnt/c/Users/<WINDOWS USERNAME>/Documents/ ~/Documents
    • If you use OneDrive to save your documents, use: ln -s /mnt/c/Users/<WINDOWS USERNAME>/OneDrive/Documents/ ~/Documents
  • Desktop: ln -s /mnt/c/Users/<WINDOWS USERNAME>/Desktop/ ~/Desktop
  • Downloads: ln -s /mnt/c/Users/<WINDOWS USERNAME>/Downloads/ ~/Downloads

Another way to run Linux within Windows (or macOS) is to install a Virtual Machine. However, this is mostly suitable for practicing and not suitable for real data analysis.

Detailed instructions to install an Ubuntu VM using Oracle’s Virtual Box is available from the Ubuntu documentation page.

Note: In the step configuring “Virtual Hard Disk” make sure to assign a large storage partition (at least 100GB).

Update Ubuntu

After installing Ubuntu (through either of the methods above), open a terminal and run the following commands to update your system and install some essential packages:

sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y
sudo apt install -y git
sudo apt install -y default-jre

Conda/Mamba

We recommend using the Conda package manager to install your software. In particular, the newest implementation called Mamba.

To install Mamba, run the following commands from the terminal:

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh -b
rm Mambaforge-$(uname)-$(uname -m).sh

Restart your terminal (or open a new one) and confirm that your shell now starts with the word (base). Then run the following commands:

conda config --add channels defaults; conda config --add channels bioconda; conda config --add channels conda-forge
conda config --set remote_read_timeout_secs 1000

Software environments

Due to the complexities of the different tools we will use, there are several software dependency incompatibilities between them. Therefore, rather than creating a single software environment with all the tools, we will create separate environments for different applications.

Mash

mamba create -y -n mash mash

Assembly

mamba create -y -n assembly flye rasusa bakta medaka

CheckM2

mamba create -y -n checkm2 checkm2

Typing

mamba create -y -n typing mlst perl blast

Phylogeny

mamba create -y -n phylogeny panaroo iqtree figtree snp-sites

Nextflow

mamba create -y -n nextflow nextflow

Also run these commands to set Nextflow correctly (copy/paste this entire code):

mkdir -p $HOME/.nextflow
echo "
conda {
  conda.enabled = true
  singularity.enabled = false
  docker.enabled = false
  useMamba = true
  createTimeout = '4 h'
  cacheDir = \"$HOME/.nextflow-conda-cache/\"
}
singularity {
  singularity.enabled = true
  conda.enabled = false
  docker.enabled = false
  pullTimeout = '4 h'
  cacheDir = \"$HOME/.nextflow-singularity-cache/\"
}
docker {
  docker.enabled = true
  singularity.enabled = false
  conda.enabled = false
}
" >> $HOME/.nextflow/config

Bandage

Generally, this software does not require installation, it can be simply downloaded from the website, unzipped and run. However, we provide command-line instructions which will place the executable on the Desktop for easy access.

From the command line:

# install dependencies
sudo apt-get install -y qt5-default

# download the executable
wget -O bandage.zip "https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip"
unzip bandage.zip -d bandage
mv bandage/Bandage ~/Desktop/
rm -r bandage.zip bandage

From the WSL command line:

wget -O bandage.zip "https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Windows_v0_8_1.zip"
unzip bandage.zip -d bandage
mv bandage/Bandage ~/Desktop/
rm -r bandage.zip bandage

You can follow the same instructions as for “Ubuntu”.

Singularity

We recommend that you install Singularity and use the -profile singularity option when running Nextflow pipelines. On Ubuntu/WSL2, you can install Singularity using the following commands:

sudo apt install -y runc cryptsetup-bin uidmap
CODENAME=$(lsb_release -cs)
wget -O singularity.deb https://github.com/sylabs/singularity/releases/download/v3.11.4/singularity-ce_3.11.4-${CODENAME}_amd64.deb
sudo dpkg -i singularity.deb
rm singularity.deb

If you have a different Linux distribution, you can find more detailed instructions on the Singularity documentation page.

If you have issues running Nextflow pipelines with Singularity, then you can follow the instructions below for Docker instead.

Docker

An alternative for software management when running Nextflow pipelines is to use Docker.

For Ubuntu Linux, here are the installation instructions:

sudo apt install curl
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh
sudo groupadd docker
sudo usermod -aG docker $USER

After the last step, you will need to restart your computer. From now on, you can use -profile docker when you run Nextflow.

When using WSL2 on Windows, running Nextflow pipelines with -profile singularity sometimes doesn’t work.

As an alternative you can instead use Docker, which is another software containerisation solution. To set this up, you can follow the full instructions given on the Microsoft Documentation: Get started with Docker remote containers on WSL 2.

We briefly summarise the instructions here (but check that page for details and images):

  • Download Docker for Windows.
  • Run the installer and install accepting default options.
  • Restart the computer.
  • Open Docker and go to Settings > General to tick “Use the WSL 2 based engine”.
  • Go to Settings > Resources > WSL Integration to enable your Ubuntu WSL installation.

Once you have Docker set and installed, you can use -profile docker when running your Nextflow command.

You can follow the same instructions as for “Ubuntu”.

Visual Studio Code

  • Go to the Visual Studio Code download page and download the installer for your operating system. Double-click the downloaded file to install the software, accepting all the default options.
  • After completing the installation, go to your Windows Menu, search for “Visual Studio Code” and launch the application.
  • Go to File > Preferences > Settings, then select Text Editor > Files on the drop-down menu on the left. Scroll down to the section named “EOL” and choose “\n” (this will ensure that the files you edit on Windows are compatible with the Linux operating system).
  • Click Ctrl + Shift + X, which will open an “Extensions” panel on the left.
  • Search for “WSL” and click “Install”.

From now on, you can open VS code directly from a WSL terminal by typing code ..

You can follow the same instructions as for “Ubuntu”.

Data

The data used in these materials is provided as a set of zip files. We provide instructions to download and uncompress the data via the command line, which is the recommended way to make sure you have the correct directory structure. However, we also provide the direct links to the zip files, in case you prefer to download them manually.

First create a directory to store the files. Here, we create a directory for the workshop in the “Documents” folder (you can change this if you want to):

# create variable for working directory - change this if you want
workdir="$HOME/Documents/awd_bioinfo"
mkdir $workdir

Resources

We provide files for databases and public genomes used in different parts of the analysis. These files are required in addition to any other datasets. In summary, this contains four directories:

  • mash_db - database for the software Mash, covered in the Read content chapter.
  • bakta_db - database for the software Bakta, covered in the Genome assembly chapter.
  • CheckM2_database - database for the CheckM2 program covered in the Assembly quality chapter.
  • vibrio_genomes - public genomes downloaded from NCBI and used in the Phylogenetics chapter.

We recommend downloading this file once and then creating a symbolic link (shortcut) to this folder from each of the analysis directories. This will reduce the storage space required for analysis.

Download this file using the command line:

# make sure you are in the workshop folder
cd $workdir

# download and unzip
wget -O resources.zip "https://www.dropbox.com/sh/t8ivljixrg0z1qz/AAD9fGRSyQHrCizxrBU1VMB-a?dl=1"
unzip resources.zip -d resources
rm resources.zip  # remove original zip file to save space

If you want to download this file manually: download resources.

Ambroise 2023

This dataset includes 5 samples sequenced on an ONT platform, and published in Ambroise et al. 2023. Here are the details about these data:

  • Number of samples: 5
  • Origin: samples from cholera patients from the Democratic Republic of the Congo.
  • Sample preparation: stool samples were collected and used for plate culture in media appropriate to grow Vibrio species; ONT library preparation and barcoding were done using standard kits.
  • Sequencing platform: MinION
  • Basecalling: Guppy version 6 in high accuracy (“hac”) mode (this information is not actually specified in the manuscript, but we are making this assumption, just as an example).

To download the data, you can run the following commands:

# make sure you are in the workshop folder
cd $workdir

# download and unzip
wget -O ambroise.zip "https://www.dropbox.com/sh/xytht4upehuo4c3/AABeYpICT2uAQzGBy4IzsKKwa?dl=1"
unzip ambroise.zip -d ambroise2023
rm ambroise.zip  # remove original zip file to save space

# create link to resources directory
ln -s $PWD/resources/ $PWD/ambroise2023/resources

If you want to download this file manually: download Ambroise 2023.

Scripts only

We also provide a folder containing only the scripts used in the exercises. This is useful if you want to use your own data.

Here are the commands to download these data:

# make sure you are in the workshop folder
cd $workdir

# download and unzip
wget -O minimal.zip "https://www.dropbox.com/sh/f421dkyos4us4ty/AABmomHwzL1miVvStaDQA4gma?dl=1"
unzip minimal.zip -d minimal
rm minimal.zip  # remove original zip file to save space

# create link to resources directory
ln -s $PWD/resources/ $PWD/minimal/resources

If you want to download this file manually: download scripts only.