A guide to in silico vaccine discovery for eukaryotic pathogens
B RIEFINGS IN BIOINF ORMATICS . VOL 14. NO 6. 753^774
Advance Access published on 24 October 2012
doi:10.1093/bib/bbs066
A guide to in silico vaccine discovery for
eukaryotic pathogens
Stephen J. Goodswen, Paul J. Kennedy and JohnT. Ellis
Submitted: 21st April 2012; Received (in revised form) : 23rd August 2012
Abstract
Keywords: reverse vaccinology; eukaryotic pathogens; in silico vaccine discovery; apicomplexans; immunoinformatics
INTRODUCTION
After almost a century of laboratory culture-based
approaches to vaccine discovery, researchers are
beginning to capitalize on the vast potential of omics data (genomes, transcriptomes and proteomes)
to make an in silico approach to vaccine discovery
possible, without the need to cultivate the pathogen.
Eukaryotic pathogens are extremely complicated systems with multifaceted life cycles. The key challenge
of this in silico approach is how best to transform
mere biological abstractions of complex systems (in
the form of digital information) into the knowledge
required to identify vaccine candidates.
In 2000, Rino Rappuoli [1] first proposed the idea
of mining biological data to predict antigens that are
most likely to be vaccine candidates. Effectively, the
wet laboratory in the traditional culture-based
approach to cultivate, dissect and identify antigens
is replaced by a computer. His approach has been
widely accepted as a way of discovering vaccines
and is referred to as ‘Reverse Vaccinology’ on
account that in its basic form, the approach starts
with the genome of the pathogen rather than the
pathogen itself. There are several successful applications of reverse vaccinology to the discovery of subunit vaccines against prokaryotic pathogens [2–6].
The key to subunit vaccine development is the
successful identification of molecules of a pathogen,
as opposed to using the entire entity, which evoke a
safe immune response. The candidate molecules
from a eukaryotic pathogen expected to induce immunity comprise proteins that are as follows: (i) present on the surface of the pathogen, (ii) excreted/
secreted from the pathogen and (iii) homologous to
Corresponding author. John T. Ellis, School of Medical and Molecular Sciences, Ithree Institute, University of Technology Sydney.
Tel.: þ61 2 9514 4161; E-mail: .
Stephen Goodswen did his research for MSc at CSIRO while enrolled at the University of New England. He is now pursuing a PhD
at the University of Technology Sydney focusing on an in silico vaccine discovery pipeline for parasitic protozoa.
Paul Kennedy obtained his PhD in Computing Science at the University of Technology, Sydney, in 1999 where he currently directs
the Knowledge Infrastructure Laboratory in the Centre for Quantum Computation and Intelligent Systems. His interests involve data
mining of biomedical data, particularly visualization and classification of childhood cancer patients using their systems biology.
John Ellis has research interests focused on translational research that includes development of vaccines and diagnostics for parasitic
diseases of economic importance. For the past 20 years, he has studied parasitic protozoa of both veterinary and medical importance and
in recent times has broadened his interests to environmental protozoology and groundwater.
ß The Author 2012. Published by Oxford University Press. For Permissions, please email:
In this article, a framework for an in silico pipeline is presented as a guide to high-throughput vaccine candidate discovery for eukaryotic pathogens, such as helminths and protozoa. Eukaryotic pathogens are mostly parasitic and
cause some of the most damaging and difficult to treat diseases in humans and livestock. Consequently, these parasitic pathogens have a significant impact on economy and human health. The pipeline is based on the principle of reverse vaccinology and is constructed from freely available bioinformatics programs. There are several successful
applications of reverse vaccinology to the discovery of subunit vaccines against prokaryotic pathogens but not yet
against eukaryotic pathogens. The overriding aim of the pipeline, which focuses on eukaryotic pathogens, is to generate through computational processes of elimination and evidence gathering a ranked list of proteins based on a
scoring system. These proteins are either surface components of the target pathogen or are secreted by the pathogen and are of a type known to be antigenic. No perfect predictive method is yet available; therefore, the
highest-scoring proteins from the list require laboratory validation.
754
Goodswen et al.
reviews and suggests freely available bioinformatics
programs that can complete each explicit stage of
an in silico vaccine discovery pipeline.
PIPELINE OVERVIEW
As a proof of concept, a vaccine discovery pipeline
was constructed and evaluated using data from the
eukaryotic pathogen Toxoplasma gondii, which is an
important model system for the phylum
Apicomplexa [8–10]. The focus here, however, is
on the construction of the pipeline, and no attempt
is made to propose scientific findings for T. gondii, as
it is beyond the scope of the present article. Despite
the similarity of eukaryotic pathogens, realistically
there can be no ‘off-the-shelf’ pipeline for vaccine
discovery that would instantly work for all pathogens. A generic pipeline, nevertheless, comprising
the same linked programs can theoretically be used.
The challenge from a user’s perspective is that these
programs critically need appropriate training sets specific to the pathogen of interest.
A pipeline here simply refers to a chain of data
processing stages. Freely available bioinformatics programs are suggested for each stage described herein.
An ideal objective of the pipeline is to have a seamless transition from start to end in which the output
of each stage is the input of the next one. The transition between the stages can be achieved by writing
simple parsing and reformatting programs. A critical
aspect of these programs that tie the pipeline together is extracting the pertinent data from the
stage outputs and providing logic to accept or
reject the data from the pipeline. Example stage outputs are provided throughout the present article, and
the parts of the output that are useful are indicated.
The stage transitions in the pipeline presented
were written in the Perl computer language. There
were five underlying criteria for selecting the various
programs used to complete each stage—public availability, operating platform, high-throughput functionality, cell type and software support. Each
criterion is now described in more detail: (i) public
availability—the program had to be freely downloadable and have stand-alone capability and (ii)
type of operating platform—the numerous programs
potentially available can be classified into three platform categories: web interface, Microsoft Windows
and Linux. The web interface programs are by far the
most prevalent because of their immediate accessi (...truncated)