snpTree - a web-server to identify and construct SNP trees from whole genome sequence data
Leekitcharoenphon et al. BMC Genomics
snpTree - a web-server to identify and construct SNP trees from whole genome sequence data
Pimlapas Leekitcharoenphon 0 1
Rolf S Kaas 0 1
Martin Christen Frlund Thomsen 1
Carsten Friis 0
Simon Rasmussen 1
Frank M Aarestrup 0
0 National Food Institute , Building 204 , Technical University of Denmark , 2800 Kgs Lyngby, Denmark 4444
1 Center for Biological Sequence Analysis , Building 208 , Department of Systems Biology, Technical University of Denmark , 2800 Kgs Lyngby , Denmark
Background: The advances and decreasing economical cost of whole genome sequencing (WGS), will soon make this technology available for routine infectious disease epidemiology. In epidemiological studies, outbreak isolates have very little diversity and require extensive genomic analysis to differentiate and classify isolates. One of the successfully and broadly used methods is analysis of single nucletide polymorphisms (SNPs). Currently, there are different tools and methods to identify SNPs including various options and cut-off values. Furthermore, all current methods require bioinformatic skills. Thus, we lack a standard and simple automatic tool to determine SNPs and construct phylogenetic tree from WGS data. Results: Here we introduce snpTree, a server for online-automatic SNPs analysis. This tool is composed of different SNPs analysis suites, perl and python scripts. snpTree can identify SNPs and construct phylogenetic trees from WGS as well as from assembled genomes or contigs. WGS data in fastq format are aligned to reference genomes by BWA while contigs in fasta format are processed by Nucmer. SNPs are concatenated based on position on reference genome and a tree is constructed from concatenated SNPs using FastTree and a perl script. The online server was implemented by HTML, Java and python script. The server was evaluated using four published bacterial WGS data sets (V. cholerae, S. aureus CC398, S. Typhimurium and M. tuberculosis). The evalution results for the first three cases was consistent and concordant for both raw reads and assembled genomes. In the latter case the original publication involved extensive filtering of SNPs, which could not be repeated using snpTree. Conclusions: The snpTree server is an easy to use option for rapid standardised and automatic SNP analysis in epidemiological studies also for users with limited bioinformatic experience. The web server is freely accessible at http://www.cbs.dtu.dk/services/snpTree-1.0/.
-
Background
The dramatic decrease in cost for whole-genome
sequencing (WGS) has made this technology economically feasible
as a routine tool for scientific research, including infectious
disease epidemiology. In addition, WGS has major
applications for health service providers working with infectious
diseases [1] as such to deliver high-resolution genomic
epidemiology as the ultimate typing method for bacteria.
The ideal microbial typing technique should enable
differentiation of epidemiological unrelated strains and group
epidemiological related (outbreak) strains, [2] and give
information that will help to understand the evolutionary
history of multiple strains within a clonal lineage [1,2].
Although some current technologies are highly
informative like MLST or PFGE, they have limited resolution
when applied to closely related isolates and different
methods often have to be applied in different situations [1,2].
Especially outbreak isolates normally have very little
diversity and require extensive genomic methods to
differentiate and catagorize the isolates [3]. Single nucleotide
polymorphisms (SNPs) also show relatively low mutation
rates and are evolutionarily stable. Moreover, SNPs
analysis has successfully been used for determining broad
patterns of evolution in many recent studies [4-6].
Currently, There are a number of available
non-commercial NGS genotype analysis software such as SOAP2
[7], GATK [8] and SAMtools [9]. Nonetheless, all of the
software require bioinformatic skills, various options,
various setting and they do not have a user friendly
web-interface.
Here we introduce snpTree. A server for
online-automatic SNP analysis and SNP tree construction from
sequencing reads as well as from assembled genomes or
contigs. The server is a pipeline which intregrates avaliable
SNPs analysis softwares such as SAMtools [9] and
MUMmer [10], with customized scripts. The performance of the
server was evaluated with four published bacterial WGS
data set; Vibrio cholerae [3], Staphylococcus aureus CC398
[6], Salmonella Typhimurium [11] and Mycobacterium
tuberculosis [12].
Implementation
The snpTree server was created to handle both WGS data
and assembled genomes to generate a phylogenetic tree
based on SNPs data. The overall process is shown in
Figure 1. For raw reads (Figure 1A), snpTree use an
inhouse toolbox (Genobox) for mapping and genotyping
which consists of avaliable programs for next-generation
sequencing analysis such as Burrows-Wheeler Aligner,
BWA [13] and software package for SNPs calling and
genotyping, SAMtools [9]. The source code of Genebox is
available at https://github.com/srcbs/GenoBox. For contigs
or assembled genomes (Figure 1B), MUMmer [10] is used
for both reference genome alignment and SNPs
identification processes.
The web-server contains more than 2,000 completed
reference genomes collected from NCBI Genome
database (accessed on April 2012).
SNPs identification from WGS
Prior to mapping raw reads to a proper reference genome,
the sequence data in fastq format are filtered and trimmed
according to the following criteria [14]: (i) reads with Ns
are removed, (ii) if a read matches a minimum of 25 nt of
a sequencing primer/adaptor the reads are trimmed at the
5 coordinate of match, (iii) the 3 tail bases are trimmed if
the quality score is less than 20, (iv) the minimum average
quality of the read should be 20 and the read length after
trimming should be at least 20 nt.
Trimmed raw reads are aligned against a reference
genome using BWA [13] with minimum mapping quality
equal to 30 as a default (Figure 1A). BWA is based on an
effective data compression algorithm called
BurrowsWheeler transform (BWT) that is fast, memory-efficient
and espectially useful for aligning short reads [15].
SNPs calling and filtering are accomplished by
SAMtools that is a software package for parsing and
manipulating alignments in the generic alignment format (SAM/
BAM format) [9]. The snpTree server allows users to set
a couple of parameters to filter SNPs, a minimum
coverage and a minimum distance between each SNPs
(prune). The default for both cut-offs is set to 10 and
additionally all heterozygous SNPs are filtered because
these are likely mapping errors in haploid chromosomes.
The identifed SNPs are concluded into a VCF file.
SNPs identification from assembled genomes
A pipeline has been developed around the software
package MUMmer version 3.23 [10] (Figure 1B). An
application named Nucmer, which (...truncated)