8 Downloading sequence data
- Download sequence data from a public database with the nf-core pipeline fetchngs.
8.1 Pipeline Overview
fetchngs is a bioinformatics analysis pipeline written in Nextflow to automatically download and process raw FASTQ files from public databases. Identifiers can be provided in a file and any type of accession ID found in the SRA, ENA, DDBJ and GEO databases are supported. If run accessions (SRR/ERR/DRR) are provided, these will be resolved back to the sample accessions (SRX/ERX/DRX) to allow multiple runs for the same sample to be merged. As well as the FASTQ files, fetchngs
will also produce a samplesheet.csv
file containing the sample metadata obtained from the ENA. This file can be used as input for other nf-core and Nextflow pipelines like the ones we’ll be using this week.
8.2 Prepare a samples file
fetchngs
requires a samples file with the accessions you would like to download. The file requires the suffix .csv
but does not need to be in CSV format. Each line needs to represent a database id:
ERR9907668
ERR9907669
ERR9907670
ERR9907671
ERR9907672
Your first task is to create a samples.csv
file to be used as input for fetchngs
. Use the following accessions:
ERR9907668
ERR9907669
ERR9907670
ERR9907671
ERR9907672
Make sure you save the file in the ~/Course_Materials/M_tuberculosis
directory.
We opened a text editor (e.g. nano. vim), copied and pasted the accessions and saved the file as samples.csv
in the ~/Course_Materials/M_tuberculosis
directory.
8.3 Running fetchngs
Now that we have the samples.csv
file, we can run the fetchngs
pipeline. First, let’s activate the nextflow
software environment:
mamba activate nextflow
There are many options that can be used to customise the pipeline but a typical command is shown below:
nextflow run nf-core/fetchngs \
-profile singularity \
--max_memory '16.GB' --max_cpus 8 \
--input SAMPLES \
--outdir results/fetchngs \
--nf_core_pipeline viralrecon \
--download_method sratools \
-resume
The options we used are:
-profile singularity
- indicates we want to use the Singularity program to manage all the software required by the pipeline (another option is to usedocker
). See Data & Setup for details about their installation.--max_memory
and--max_cpus
- sets the available RAM memory and CPUs. You can check this with the commandsfree -h
andnproc --all
, respectively.--input
- the samples file with the accessions to be downloaded, as explained above.--nf_core_pipeline
- Name of supported nf-core pipeline e.g. ‘viralrecon’. A samplesheet for direct use with the pipeline will be created with the appropriate columns.--download_method
- forces the pipeline to usesratools
instead of a direct FTP download.-resume
- all Nextflow pipelines can be resumed. It isn’t necessary for the force run of the pipeline but it’s good practice to include it in the command.
In our Nextflow
command above we have set --max_memory '16.GB' --max_cpus 8
to limit the resources used in the analysis. This is suitable for the computers we are using in this workshop. However, make sure to set these options to the maximum resources available on the computer where you process your data.
Your next task is to download sequence data with the fetchngs
. In the folder scripts
(in the M_tuberculosis
analysis directory) you will find a script named 01-run_fetchngs.sh
. This script contains the code to run fetchngs
.
Edit this script, adjusting it to fit your input file.
Run the script using
bash scripts/01-run_fetchngs.sh
.If the script is running successfully it should start printing the progress of each job in the
fetchngs
pipeline.
- The fixed script is:
#!/bin/bash
nextflow run nf-core/fetchngs \
-profile singularity \
--max_memory '16.GB' --max_cpus 8 \
--input samples.csv \
--outdir results/fetchngs \
--nf_core_pipeline viralrecon \
--download_method sratools \
-resume
- We ran the script as instructed using:
bash scripts/01-run_fetchngs.sh
- While it was running it printed a message on the screen:
executor > local (20)
[5d/682f49] process > NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR9907668) [100%] 5 of 5 ✔
[9d/ab8a7b] process > NFCORE_FETCHNGS:SRA:SRA_RUNINFO_TO_FTP (5) [100%] 5 of 5 ✔
[f9/39e27c] process > NFCORE_FETCHNGS:SRA:SRA_FASTQ_FTP (ERX9450498_ERR9907670) [ 80%] 4 of 5
[82/c5f847] process > NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:CUSTOM_SRATOOLSNCBISETTINGS (ncbi-settings) [100%] 1 of 1 ✔
[- ] process > NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_PREFETCH -
[- ] process > NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP -
[2e/4fd490] process > NFCORE_FETCHNGS:SRA:SRA_TO_SAMPLESHEET (ERX9450498_ERR9907670) [100%] 4 of 4
[- ] process > NFCORE_FETCHNGS:SRA:SRA_MERGE_SAMPLESHEET -
[- ] process > NFCORE_FETCHNGS:SRA:MULTIQC_MAPPINGS_CONFIG -
8.4 fetchngs
results
Once fetchngs
has run, we can look at the various directories it created in results/fetchngs
:
Directory | Description |
---|---|
custom |
Contains settings to help the pipeline run |
fastq |
Paired-end/single-end reads downloaded from the SRA/ENA/DDBJ/GEO for each accession in the samples.csv file |
metadata |
Contains the re-formatted ENA metadata for each sample |
samplesheet |
Contains the samplesheet with collated metadata and paths to downloaded FASTQ files |
pipeline_info |
Contains information about the pipeline run |
Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory. This means that if, for whatever reason, the pipeline doesn’t finish successfully you can resume it. However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it:
rm -rf work
8.5 Summary
- The nf-core fetchngs pipeline can be used to quickly download sequence data from public databases such as the ENA and GEO.
- The pipeline also produces a
samplesheet.csv
file that can be used as input for other nf-core and Nextflow pipelines.