PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing
www.nature.com/scientificreports
OPEN
received: 15 June 2016
accepted: 12 October 2016
Published: 08 November 2016
PGAdb-builder: A web service tool
for creating pan-genome allele
database for molecular fine typing
Yen-Yi Liu1,*, Chien-Shun Chiou1,* & Chih-Chieh Chen2,3
With the advance of next generation sequencing techniques, whole genome sequencing (WGS) is
expected to become the optimal method for molecular subtyping of bacterial isolates. To use WGS as
a general subtyping method for disease outbreak investigation and surveillance, the layout of WGSbased typing must be comparable among laboratories. Whole genome multilocus sequence typing
(wgMLST) is an approach that achieves this requirement. To apply wgMLST as a standard subtyping
approach, a pan-genome allele database (PGAdb) for the population of a bacterial organism must first
be established. We present a free web service tool, PGAdb-builder (http://wgmlstdb.imst.nsysu.edu.tw),
for the construction of bacterial PGAdb. The effectiveness of PGAdb-builder was tested by constructing
a pan-genome allele database for Salmonella enterica serovar Typhimurium, with the database being
applied to create a wgMLST tree for a panel of epidemiologically well-characterized S. Typhimurium
isolates. The performance of the wgMLST-based approach was as high as that of the SNP-based
approach in Leekitcharoenphon’s study used for discerning among epidemiologically related and nonrelated isolates.
Molecular subtyping of bacterial isolates has been fundamental for epidemiologic study of infectious diseases.
Subtyping methods used for disease outbreak investigation and surveillance across regions and countries must
be standardized so that the results can be compared across laboratories. For example, pulsed-field gel electrophoresis (PFGE) is a good example; it has been standardized and successfully implemented as a common subtyping tool in the foodborne disease surveillance network—PulseNet1. Although PFGE is highly discriminatory
to most bacterial organisms, it is labor- and time-consuming and sometimes insufficient in discerning among
strains of highly clonal organisms. A multilocus variable-number tandem repeat analysis (MLVA) exhibits a much
higher level of discrimination than PFGE in discerning among very closely related strains; however, MLVA is very
organism-specific, and comparing its results across laboratories is difficult2,3. With the advance of next-generation
sequencing (NGS) techniques, whole genome sequencing (WGS) has become a practical and powerful subtyping
tool for disease outbreak detection4,5.
To use WGS as a standard subtyping tool for disease surveillance and the investigation of common outbreaks
across regions or countries, the layout of fingerprints (genotypes) generated from WGS data must be comparable among laboratories. Currently, NGS platforms generally produce millions of short sequences (reads) for a
bacterial strain. The millions of reads can be further assembled into longer sequences (contigs) and annotated
using various assemblers6–8. A number of algorithms and approaches have been developed for analyzing WGS
data9–14. Single nucleotide polymorphism (SNP) is an approach frequently used to analyze WGS data for evolutionary study and disease outbreak investigation15–17. To apply the SNP approach, a reference genome sequence is
required for selecting SNPs from WGS data of strains. When different reference sequences are used, different SNP
sets are generally yielded, making the SNP profiles incomparable across laboratories. Whole genome multilocus
sequence typing (wgMLST)14,18, an extended concept of the traditional MLST19, is considered an ideal approach
to sort out WGS data and generate genetic layouts that are portable and comparable among laboratories. To use
wgMLST as a standard subtyping tool, a pan-genome allele database (PGAdb) for the population of a bacterial
organism must first be established. In a PGAdb, genes (loci) and their sequence variants (alleles) are designated
1
Central Regional Laboratory, Center for Diagnostics and Vaccine Development, Centers for Disease Control,
Taichung 40855, Taiwan. 2Institute of Medical Science and Technology, National Sun Yat-sen University, Kaohsiung
80424, Taiwan. 3Medical Science and Technology Center, National Sun Yat-sen University, Kaohsiung 80424, Taiwan.
*
These authors contributed equally to this work. Correspondence and requests for materials should be addressed to
C.-C.C. (email: )
Scientific Reports | 6:36213 | DOI: 10.1038/srep36213
1
www.nature.com/scientificreports/
Figure 1. The schematic work flow of PGAdb-builder.
using a standardized numbering system. An allelic sequence consists of a series of digital numbers and can be
portable and comparable across laboratories.
We present a web service tool, PGAdb-builder that can be used for the construction of bacterial pan-genome
allele databases. In this paper, we demonstrate the function of the PGAdb-builder by constructing a
S. Typhimurium PGAdb and generating a wgMLST tree for a panel of epidemiologically well-characterized
S. Typhimurium isolates, which were sequenced previously by the DTU Food20.
Methods and Implementation
The flowchart for the proposed PGAdb-builder is illustrated in Fig. 1. The PGAdb-builder server comprises two
functional modules: Build_PGAdb for creating a PGAdb database and Build_wgMLSTtree for constructing a
wgMLST tree from uploaded genome contigs and formulating genetic relatedness trees by using the PGAdb
for generating allelic sequences. The details of the Build_PGAdb and Build_wgMLSTtree modules are described
herein.
Build_PGAdb.
The Build_PGAdb module executes the annotation of uploaded genome contigs by using
the Prokka pipeline21, a rapid bacterial genome annotation tool. Subsequently, the output gff file created in
the annotation process is processed to place proteins into orthologous clusters by using the Roary pipeline22, a
tool that can rapidly process a large-scale collection of genomes. In this module, paralogous genes are excluded
from a pan-genome allele dataset. Each orthologous cluster consists of a protein family with 95% (adjustable
between 90% and 99%) sequence identity. Each protein family is defined as a locus (gene). The orthologous
proteins in each cluster are converted to nucleotide sequences through inference to the ffn file created in the
annotation process to establish a pan-genome allele dataset. In this step, sequences in a locus with one or more
mismatched nucleotides between each other are defined as different alleles. The loci of a pan-genome allele dataset are then encoded with a prefix string of three alphabetic letters followed by an eight digits serial number
(e.g., SAL00000001, SAL00000002…) and the alleles in each locus are simply assigned by a series of integers
beginning from 1 to n (e.g. 1, 2, 3, … n).
Build_wgMLSTtree.
The Build_wgMLSTtree module compares the uploaded genome contigs of strains by
using a PGAdb da (...truncated)