21 de novo Assembly and Annotation
- Describe the process of de novo genome assembly and how it differs from reference-based assembly.
- List some of the common software that is used for de novo assembly.
- Describe the purpose of genome annotation.
21.1 Introduction
There are two approaches for genome assembly: reference-based (or comparative) or de novo. In a reference-based assembly, we use a reference genome as a guide to map our sequence data to and thus reassemble our sequence this way (this is what we did in the previous module). Alternatively, we can create a ‘new’ (de novo) assembly that does not rely on a map or reference and more closely reflects the actual genome structure of the isolate that was sequenced. Genome assemblers work by calculating overlaps between reads and (usually) represent these as a graph or network. They then “walk” the graph to determine the original sequence
21.2 Genome assemblers
Several tools are available for de novo genome assembly, depending on whether you use short-read sequence data, long reads, or a combination of both. Two of the most commonly used assemblers for short-read Illumina data are Velvet
and SPAdes
. SPAdes has become the de facto standard de novo genome assembler for Illumina whole genome sequencing data of bacteria and is a major improvement over previous assemblers like Velvet. However, some of its components can be slow and it traditionally did not handle overlapping paired-end reads well. Shovill
is a pipeline which uses SPAdes
at its core, but alters the steps before and after the primary assembly step to get similar results in less time. Shovill also supports other assemblers like SKESA
, Velvet
and Megahit
.
21.3 Genome annotation
Genome annotation is a multi-level process that includes prediction of protein-coding genes (CDSs), as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements. The most commonly used tools for annotating bacterial genomes are Prokka
and, more recently, Bakta
. Both use a tool called prodigal
to predict the protein-coding regions along with other tools for predicting other genomic features such as Aragorn
for tRNA. Once the genomic regions have been predicted, the tools use a database of existing bacterial genome annotations, normally generated from a large collection of genomes such as UniRef, to add this information to your genome.
21.4 Summary
- de novo genome assembly consists of reconstructing a genome from sequencing reads, without prior information about its structure.
- Assembling genomes de novo requires identifying how the millions of sequencing reads overlap with each other, inferring a “path” through an overlap graph.
- There are many software packages for de novo assembly, which can be chosen depending on the type of data being used (short read, long read, or both).
- SPades is a popular software for bacterial de novo assembly from short read data. In particular, the Shovill pipeline uses this software in a more efficient way.
- Genome annotation consists of identifying the location of functional elements in the genome, such as genes.
- Bakta is a recent genome annotation package for bacteria, which relies in part on known annotated sequences available from public databases such as UniRef.
References
Comparative Genomics (link) (accessed 2023)
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016 (link) (accessed 2023)