Immunoinformatics and epitope prediction in the age of genomic medicine
Backert and Kohlbacher Genome Medicine (2015) 7:119
DOI 10.1186/s13073-015-0245-0
REVIEW
Open Access
Immunoinformatics and epitope prediction
in the age of genomic medicine
Linus Backert1*
and Oliver Kohlbacher1,2,3
Abstract
Immunoinformatics involves the application of computational methods to immunological problems. Prediction of
B- and T-cell epitopes has long been the focus of immunoinformatics, given the potential translational implications,
and many tools have been developed. With the advent of next-generation sequencing (NGS) methods, an
unprecedented wealth of information has become available that requires more-advanced immunoinformatics tools.
Based on information from whole-genome sequencing, exome sequencing and RNA sequencing, it is possible to
characterize with high accuracy an individual’s human leukocyte antigen (HLA) allotype (i.e., the individual set of
HLA alleles of the patient), as well as changes arising in the HLA ligandome (the collection of peptides presented
by the HLA) owing to genomic variation. This has allowed new opportunities for translational applications of
epitope prediction, such as epitope-based design of prophylactic and therapeutic vaccines, and personalized cancer
immunotherapies. Here, we review a wide range of immunoinformatics tools, with a focus on B- and T-cell epitope
prediction. We also highlight fundamental differences in the underlying algorithms and discuss the various metrics
employed to assess prediction quality, comparing their strengths and weaknesses. Finally, we discuss the new
challenges and opportunities presented by high-throughput data-sets for the field of epitope prediction.
Keywords: Immunoinformatics, Bioinformatics, Next-generation sequencing, Machine learning, HLA, Vaccine design,
Personalized medicine
From genomics to epitope prediction
Immunoinformatics deals with the application of computational methods to immunological problems and is
thus considered a part of bioinformatics. Historically,
tools for the prediction of HLA-binding peptides were
the first tools developed specifically for immunoinformatics applications (Box 1). These tools paved the way for
more-complex applications. The development of immunoinformatics tools has been crucial to the availability of
sufficient experimental data. High-throughput human
leukocyte antigen (HLA) binding assays led to major progress in this area. More recently, next-generation sequencing (NGS) has facilitated many of the novel applications
and challenges that we will review here. A first area where
the availability of cost-effective sequencing is having a
large impact is our knowledge of the major histocompatibility complex (MHC, HLA in human) itself. The number
* Correspondence:
1
Applied Bioinformatics, Center of Bioinformatics and Department of
Computer Science, University of Tübingen, Sand 14, 72076 Tübingen,
Germany
Full list of author information is available at the end of the article
of known HLA alleles, as registered in the International
ImMunoGeneTics information system (IMGT) database,
has increased from 1000 in 1998 to more than 13,000 in
2015 [1]. Initially tools for prediction of HLA binding
(often also — slightly inaccurately — called epitope prediction) were trained on data for each HLA allele independently, but the number of new alleles renders this
approach more and more impractical. The development
of novel predictors, so-called pan-specific binding predictors, has been necessitated by this development. In general, the availability of large-scale data has improved the
performance of immunoinformatics tools, and, for many,
although not for all, applications, there is now a wealth of
data available. This increase in data volume often translates to an increased accuracy of these tools, primarily because many tools are based on machine learning methods,
which profit greatly from additional data. In this context,
the availability of comprehensive and well-curated immunological databases is essential.
Here, we will first review how immunoinformatics tools
can be used to infer HLA allotypes from NGS data, and
© 2015 Backert and Kohlbacher. Open Access This article is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Backert and Kohlbacher Genome Medicine (2015) 7:119
Box 1. The adaptive immune system
The adaptive immune system is the component of the immune
system that can learn to recognize specific threats (e.g.,
pathogens). This immunological memory results in long-lasting
immunity and rapid immune responses. Humoral immunity is
mediated by the recognition of antigens by B cells, whereas
cell-mediated immunity is based on the presentation of
antigens on human leukocyte antigen (HLA) and the recognition
of these antigens by T cells. B cells recognize antigens through
membrane-bound antibodies using B-cell receptors (BCRs),
resulting in the secretion of antibodies that bind to the antigen
and deactivate or eliminate it.
Processing and presentation of peptide epitopes are essential
steps in cell-mediated immunity. In general, the HLA class I
pathway processes proteins originating from inside the cell,
whereas the class II pathway presents extracellular proteins
(Fig. 2). The HLA system is encoded by 21 genes, which are
Page 2 of 12
construct vaccines based only on the genomic sequence of
a pathogen [2], and the availability of personal genomic
data enables personalized approaches to cancer immunotherapy [3]. It is in these areas that we expect the combination of NGS data and novel computational tools to
impact healthcare in a most profound way.
Immunoinformatics methods and databases for
epitope prediction
The availability of the sequence data of HLA-binding
peptides in the early 1990s [4] led to a search for commonalities among these sequences — that is, allelespecific motifs that convey binding. It quickly became
clear that the interaction between HLA and peptides is
rather complex, and thus more and more involved
pattern-recognition methods were developed. Learning
patterns from data is a field in computer science that is
typically called machine learning (ML), and, in particular, supervised ML has been applied to HLA-ligand
binding.
located on chromosome 6 and are highly polymorphic. HLA
class I entails three different loci, HLA-A, HLA-B and HLA-C, and
HLA class II encompasses HLA-DR, HLA-DP and HLA-DQ. Owing
to the possession of a diploid genome, each individual can thus
have between three and six different HLA class I allotypes. (...truncated)