PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/srep36213.pdf

PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing

www.nature.com/scientificreports OPEN received: 15 June 2016 accepted: 12 October 2016 Published: 08 November 2016 PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing Yen-Yi Liu1,*, Chien-Shun Chiou1,* & Chih-Chieh Chen2,3 With the advance of next generation sequencing techniques, whole genome sequencing (WGS) is expected to become the optimal method for molecular subtyping of bacterial isolates. To use WGS as a general subtyping method for disease outbreak investigation and surveillance, the layout of WGSbased typing must be comparable among laboratories. Whole genome multilocus sequence typing (wgMLST) is an approach that achieves this requirement. To apply wgMLST as a standard subtyping approach, a pan-genome allele database (PGAdb) for the population of a bacterial organism must first be established. We present a free web service tool, PGAdb-builder (http://wgmlstdb.imst.nsysu.edu.tw), for the construction of bacterial PGAdb. The effectiveness of PGAdb-builder was tested by constructing a pan-genome allele database for Salmonella enterica serovar Typhimurium, with the database being applied to create a wgMLST tree for a panel of epidemiologically well-characterized S. Typhimurium isolates. The performance of the wgMLST-based approach was as high as that of the SNP-based approach in Leekitcharoenphon’s study used for discerning among epidemiologically related and nonrelated isolates. Molecular subtyping of bacterial isolates has been fundamental for epidemiologic study of infectious diseases. Subtyping methods used for disease outbreak investigation and surveillance across regions and countries must be standardized so that the results can be compared across laboratories. For example, pulsed-field gel electrophoresis (PFGE) is a good example; it has been standardized and successfully implemented as a common subtyping tool in the foodborne disease surveillance network—PulseNet1. Although PFGE is highly discriminatory to most bacterial organisms, it is labor- and time-consuming and sometimes insufficient in discerning among strains of highly clonal organisms. A multilocus variable-number tandem repeat analysis (MLVA) exhibits a much higher level of discrimination than PFGE in discerning among very closely related strains; however, MLVA is very organism-specific, and comparing its results across laboratories is difficult2,3. With the advance of next-generation sequencing (NGS) techniques, whole genome sequencing (WGS) has become a practical and powerful subtyping tool for disease outbreak detection4,5. To use WGS as a standard subtyping tool for disease surveillance and the investigation of common outbreaks across regions or countries, the layout of fingerprints (genotypes) generated from WGS data must be comparable among laboratories. Currently, NGS platforms generally produce millions of short sequences (reads) for a bacterial strain. The millions of reads can be further assembled into longer sequences (contigs) and annotated using various assemblers6–8. A number of algorithms and approaches have been developed for analyzing WGS data9–14. Single nucleotide polymorphism (SNP) is an approach frequently used to analyze WGS data for evolutionary study and disease outbreak investigation15–17. To apply the SNP approach, a reference genome sequence is required for selecting SNPs from WGS data of strains. When different reference sequences are used, different SNP sets are generally yielded, making the SNP profiles incomparable across laboratories. Whole genome multilocus sequence typing (wgMLST)14,18, an extended concept of the traditional MLST19, is considered an ideal approach to sort out WGS data and generate genetic layouts that are portable and comparable among laboratories. To use wgMLST as a standard subtyping tool, a pan-genome allele database (PGAdb) for the population of a bacterial organism must first be established. In a PGAdb, genes (loci) and their sequence variants (alleles) are designated 1 Central Regional Laboratory, Center for Diagnostics and Vaccine Development, Centers for Disease Control, Taichung 40855, Taiwan. 2Institute of Medical Science and Technology, National Sun Yat-sen University, Kaohsiung 80424, Taiwan. 3Medical Science and Technology Center, National Sun Yat-sen University, Kaohsiung 80424, Taiwan. * These authors contributed equally to this work. Correspondence and requests for materials should be addressed to C.-C.C. (email: ) Scientific Reports | 6:36213 | DOI: 10.1038/srep36213 1 www.nature.com/scientificreports/ Figure 1. The schematic work flow of PGAdb-builder. using a standardized numbering system. An allelic sequence consists of a series of digital numbers and can be portable and comparable across laboratories. We present a web service tool, PGAdb-builder that can be used for the construction of bacterial pan-genome allele databases. In this paper, we demonstrate the function of the PGAdb-builder by constructing a S. Typhimurium PGAdb and generating a wgMLST tree for a panel of epidemiologically well-characterized S. Typhimurium isolates, which were sequenced previously by the DTU Food20. Methods and Implementation The flowchart for the proposed PGAdb-builder is illustrated in Fig. 1. The PGAdb-builder server comprises two functional modules: Build_PGAdb for creating a PGAdb database and Build_wgMLSTtree for constructing a wgMLST tree from uploaded genome contigs and formulating genetic relatedness trees by using the PGAdb for generating allelic sequences. The details of the Build_PGAdb and Build_wgMLSTtree modules are described herein. Build_PGAdb. The Build_PGAdb module executes the annotation of uploaded genome contigs by using the Prokka pipeline21, a rapid bacterial genome annotation tool. Subsequently, the output gff file created in the annotation process is processed to place proteins into orthologous clusters by using the Roary pipeline22, a tool that can rapidly process a large-scale collection of genomes. In this module, paralogous genes are excluded from a pan-genome allele dataset. Each orthologous cluster consists of a protein family with 95% (adjustable between 90% and 99%) sequence identity. Each protein family is defined as a locus (gene). The orthologous proteins in each cluster are converted to nucleotide sequences through inference to the ffn file created in the annotation process to establish a pan-genome allele dataset. In this step, sequences in a locus with one or more mismatched nucleotides between each other are defined as different alleles. The loci of a pan-genome allele dataset are then encoded with a prefix string of three alphabetic letters followed by an eight digits serial number (e.g., SAL00000001, SAL00000002…) and the alleles in each locus are simply assigned by a series of integers beginning from 1 to n (e.g. 1, 2, 3, … n). Build_wgMLSTtree. The Build_wgMLSTtree module compares the uploaded genome contigs of strains by using a PGAdb da (...truncated)