35 Identifying vaccine candidates with reverse vaccinology
- Understand the principles of reverse vaccinology and its application in vaccine development.
- Learn the key steps involved in a reverse vaccinology workflow, from genome sequencing to candidate selection.
- Gain hands-on experience with bioinformatics tools used in reverse vaccinology, including PSORTb, CD-HIT, and DIAMOND.
35.1 The Traditional vs. Reverse Vaccinology Approach
To understand the innovation of reverse vaccinology, it’s helpful to compare it to the conventional method.
Traditional Vaccinology (Forward): This classic method involves growing large quantities of a pathogen in the lab, inactivating it, and then breaking it down to isolate its constituent parts (like surface proteins or polysaccharides). These isolated components are then tested, often through a long process of trial and error, to see if they can induce a protective immune response. This approach is limited to pathogens that can be cultured in the lab and often only identifies the most abundant components, potentially missing other effective antigens.
Reverse Vaccinology (Backward): This modern approach begins with the complete genomic sequence of the pathogen. Using computational tools (a process known as in silico analysis), the entire genome is scanned to predict which genes code for proteins that are likely to be good vaccine candidates. This allows for a comprehensive and rational selection of targets before any lab work on the pathogen itself even begins.
35.2 The Reverse Vaccinology Workflow
The process can be broken down into five key steps, moving from the digital sequence to a tangible vaccine candidate.
Genome Sequencing and In Silico Analysis The process starts by sequencing the entire genome of the target bacterium. Bioinformatics software then analyzes this genetic information to create a list of potential antigens. The algorithms prioritize genes that code for proteins with desirable characteristics, such as:
Surface Location: Proteins that are secreted or located on the pathogen’s outer surface are ideal targets because they are easily accessible to the host’s immune system.
Virulence Factors: Proteins critical for the bacterium’s survival or ability to cause disease are excellent candidates.
Broad Conservation: The protein should be present in most or all strains of the pathogen to ensure the vaccine provides broad protection.
No Human Homology: The selected proteins must be dissimilar to human proteins to prevent the immune system from attacking the body’s own cells (autoimmunity).
Cloning and Expression The genes that were flagged as promising candidates during the in silico screening are then synthesized and cloned into a laboratory expression system, typically E. coli. This turns the bacteria into tiny factories that produce large quantities of the selected proteins for testing.
Immunological Screening The purified proteins are used to immunize lab animals, such as mice. Scientists then analyze the animals’ immune response by:
Measuring antibody levels in their blood serum.
Testing whether these antibodies can effectively kill the target bacterium in a lab setting (e.g., through a serum bactericidal assay).
Candidate Validation Proteins that elicit a strong, protective immune response in the animal models are identified as the top vaccine candidates. These candidates undergo further testing to confirm their potential before moving into human clinical trials.
For the purposes of this training course, we will focus on the first step: Genome sequencing and in silico analysis.
35.3 Case study: the Meningococcus B (MenB) vaccine
The most prominent success story for reverse vaccinology is the development of the vaccine against Neisseria meningitidis serogroup B (MenB), a leading cause of bacterial meningitis.
For decades, a MenB vaccine remained elusive. Traditional vaccine strategies often target the polysaccharide capsule surrounding the bacteria. However, the MenB capsule is chemically identical to a molecule found on human nerve cells, so using it in a vaccine could trigger a dangerous autoimmune reaction.
By applying reverse vaccinology, researchers sequenced the MenB genome and identified hundreds of potential protein antigens. After extensive screening, a few key proteins were selected that induced a robust and protective immune response. These proteins formed the basis of the Bexsero (4CMenB) vaccine, which is now used worldwide to prevent MenB disease. This breakthrough would have been incredibly difficult, if not impossible, using conventional methods alone.
35.4 Running a reverse vaccinology workflow
The purpose of this exercise is to familiarize you with the types of analyses involved in a reverse vaccinology workflow. We will not be running a complete analysis here, but rather focusing on the key steps. The end point of this exercise will be a list of potential vaccine candidates. We have provided a mamba environment with all the necessary software pre-installed. Before we start, please activate the environment:
mamba activate reverse-vaccinologyNow navigate to the N_meningitidis directory within the course materials.
35.4.1 Genome annotation
The first step is to identify all the potential genes and translate them into protein sequences. We have used Bakta for this (you should already be familiar with Bakta as assembleBAC runs Bakta as part of the annotation step) and have provided the protein sequences that Bakta predicted. These are in files with the suffix .faa located within the data/faa folder.
35.4.2 Subcellular location prediction
Now, we need to predict where each protein resides in the cell. We are interested in Outer Membrane and Secreted proteins. We will use PSORTb, a bioinformatics tool that predicts the subcellular localization of bacterial proteins based on their amino acid sequences. It uses a combination of machine learning algorithms and curated databases to classify proteins into different cellular compartments, such as cytoplasm, inner membrane, periplasm, outer membrane, and extracellular space. The command for running PSORTb is as follows:
./psortb_app -i data/faa/ERX029793.faa -r results/psortb/ERX029793 -nThe options used are:
-i- the input file containing amino acid sequences in FASTA format.-r- the output directory where results will be saved.-n- specifies that the input sequences are from a Gram-negative bacterium.
35.4.3 Conservation analysis
To find proteins that are present in most of our strains, we can cluster all proteins from all genomes using CD-HIT. We have provided faa files for all samples in the data/faa folder. CD-HIT will group similar proteins together based on a specified sequence identity threshold. For this exercise, we will use a threshold of 95% identity, meaning that proteins that are at least 95% identical will be grouped into the same cluster. This helps us identify conserved proteins across different strains of Neisseria meningitidis. To run CD-HIT, we can use the following command:
# run cd-hit to cluster proteins at 95% identity
cd-hit -i data/faa/all_proteins.faa -o results/cd-hit/conserved_clusters.txt -c 0.95 -d 0The options we used are:
-i- the input file containing all amino acid sequences in FASTA format.-o- the output file where clustered results will be saved.-c- the sequence identity threshold (0.95 for 95% identity).-d- controls the length of the description in the output file (0 means full length).
35.4.4 Functional annotation and human homology check
For function and safety, we will use DIAMOND, an ultra-fast alternative to BLAST. We will use it to compare our protein sequences against two databases:
- Swiss-Prot: A high-quality, manually curated database for functional annotation.
- Human Proteome: For checking against human proteins to avoid autoimmunity.
The first step will give us the likely function of each protein, while the second step will help us filter out any proteins that are too similar to human proteins and should therefore be avoided as vaccine candidates. The diamond commands we will use are as follows:
# Search against Swiss-Prot for functional annotation
diamond blastp -d databases/swissprot.dmnd -q results/cd-hit/all_proteins.faa -o results/homology_searches/function.tsv --outfmt 6 qseqid stitle evalue
# Search against Human Proteome for homology check
diamond blastp -d databases/human_proteome.dmnd -q results/cd-hit/all_proteins.faa -o results/homology_searches/human_homology.tsv --outfmt 6 qseqid stitle evalue pident The options we used are:
-d- the database to search against (Swiss-Prot or Human Proteome).-q- the input file containing amino acid sequences in FASTA format.-o- the output file where results will be saved.--outfmt 6- specifies the output format (tabular format with specific fields).
35.4.5 Compiling vaccine candidates
Now that we have all the necessary data, we can compile a list of potential vaccine candidates by integrating the results from PSORTb, CD-HIT, and DIAMOND. We have provided a script called combine_results.py that does just that. This script takes the outputs from the previous analyses and combines them into a single CSV file called vaccine_candidates.csv. This file contains all the relevant information for each protein, including its predicted location, conservation across strains, functional annotation, and human homology status. To run the script, simply execute the following command in your terminal:
python scripts/combine_results.py --locations "preprocessed/psortb/locations/*.csv" --conservation results/cd-hit/conserved_clusters_conservation.csv --function results/homology_searches/function.tsv --human results/homology_searches/human_homology.tsv --output vaccine_candidates.csvOpen the vaccine_candidates.csv file to explore the compiled data. The four proteins included in the MenB vaccine are fHbp, NHBA, NadA, PorA. Are they present in this list? You may need to search by their gene names or their product descriptions. Unfortunately, due to the small number of genomes we are using in this exercise, it is unlikely that the four vaccine components will be present in the final list.
For the purposes of this training course, we selected a small number of Neisseria meningitidis genomes at random. In a real reverse vaccinology study, you would typically analyze a much larger dataset to ensure that the identified vaccine candidates are broadly conserved across diverse strains of the pathogen.
35.5 Summary
- Reverse vaccinology is a genome-based approach to identify potential vaccine candidates by analyzing the genetic makeup of pathogens.
- The workflow involves genome sequencing, subcellular location prediction, conservation analysis, functional annotation, and human homology checks.
- Bioinformatics tools like PSORTb, CD-HIT, and DIAMOND play crucial roles in the reverse vaccinology process.
- The MenB vaccine is a successful example of reverse vaccinology, demonstrating its potential to revolutionize vaccine development.