SNPServer: a real-time SNP discovery tool
David Savage
2
Jacqueline Batley
2
Tim Erwin
1
2
Erica Logan
1
2
Christopher G. Love
1
2
Geraldine A. C. Lim
1
2
Emmanuel Mongin
1
2
Gary Barker
0
German C. Spangenberg
1
2
David Edwards
1
2
0
School of Biological Sciences, University of Bristol
, Bristol BS8 1UG,
UK
1
Victorian Bioinformatics Consortium, Plant Biotechnology Centre, Primary Industries Research Victoria, La Trobe University
, Bundoora 3086, Victoria,
Australia
2
Plant Biotechnology Centre
SNPServer is a real-time flexible tool for the discovery of SNPs (single nucleotide polymorphisms) within DNA sequence data. The program uses BLAST, to identify related sequences, and CAP3, to cluster and align these sequences. The alignments are parsed to the SNP discovery software autoSNP, a program that detects SNPs and insertion/ deletion polymorphisms (indels). Alternatively, lists of related sequences or pre-assembled sequences may be entered for SNP discovery. SNPServer and autoSNP use redundancy to differentiate between candidate SNPs and sequence errors. For each candidate SNP, two measures of confidence are calculated, the redundancy of the polymorphism at a SNP locus and the co-segregation of the candidate SNP with other SNPs in the alignment. SNPServer is available at http://hornbill.cspp.latrobe.edu.au/ snpdiscovery.html.
-
Single nucleotide polymorphisms (SNPs) and small insertions/
deletions (indels) are the most frequently found DNA
sequence variations (1). The development of
highthroughput methods for the detection of SNPs has led to a
revolution in their use as molecular markers (2). As such, they
represent one of the most powerful tools for the analysis of
genomes and are increasingly becoming the marker of choice
in genetic analysis. SNPs are used routinely in agriculture as
markers in crop-breeding programmes (3). They also have
many uses in human genetics, such as the detection of alleles
associated with genetic diseases and inferences of population
history (4,5). Furthermore, SNPs are invaluable as a tool
for genome mapping, offering the potential for generating
very high-density genetic maps, that can be used to develop
haplotyping systems for genes or regions of interest (6).
The simplicity and the low mutation rate of SNPs also
make them excellent markers for studying complex genetic
traits and as a tool for the understanding of genome
evolution (7).
As with the majority of molecular markers, one of the
limitations of SNPs is the initial cost associated with their
development. However, with the growth of high-throughput
sequencing technology, large amounts of data have been
submitted to the various DNA databases that may be suitable for
data mining and SNP discovery (8). Methods used to identify
SNPs in aligned sequence data have traditionally relied on
sequence trace file analysis to filter out sequence errors by
their dubious trace quality (911). The major drawbacks to this
approach are the requirement for sequence trace files, which
are rarely complete for large sequence datasets collated from a
variety of sources, and the high level of sequence error
associated with the reverse transcription process. These problems
are overcome by the use of autoSNP software for the detection
of SNPs within sequence data with associated measurements
of confidence (12).
AutoSNP calculates two associated measurements of
confidence in the validity of SNPs for each polymorphism. The
frequency of occurrence of a polymorphism at a particular
locus provides a primary measure of confidence in the SNP
representing a true polymorphism and is referred to as the
SNP redundancy score. The co-segregation of multiple
SNPs within an alignment to define a haplotype provides a
second measure of confidence in SNP validity and is referred
to as the co-segregation score. Here we introduce the real-time
autoSNP web server, the SNPServer. This builds on the
use of autoSNP software by providing a web interface for
sequence input, comparison and assembly, and permits the
rapid discovery of SNPs related to any specified sequence
of interest.
METHODS
Sequence input, assembly and clustering
The real-time autoSNP web server, SNPServer, acts as a
web interface and wrapper for the three programs, BLAST,
CAP3 and autoSNP, that make up the SNP discovery pipeline
(Figure 1). The complete pipeline accepts a single sequence as
an input. This entry sequence is compared with a specified
nucleotide sequence database using BLAST (13) to identify
related sequences. The resulting sequences may then be
selected for assembly with CAP3 (14) and subsequent SNP
discovery using autoSNP (12). Alternatively, users may enter
a list of sequences in FASTA format for assembly, or a
pre-calculated sequence assembly in ACE format. Complete
options for BLAST sequence comparisons, CAP3 assembly
and SNP discovery may be specified at the user interface.
SNP discovery is performed using a redundancy-based
approach with a modified version of the autoSNP PERL script
(12,15). Alignment data generated by CAP3 (or from a user
submitted ACE file) are used to load the sequences in each
assembly into a 2D array. Spacing characters (-) added during
sequence alignment are considered as a fifth element in
addition to the four nucleotides A, C, G and T. This permits the
identification of insertion/deletion polymorphisms between
sequences. Each row (representing a single base locus in the
assembly) is assessed for differing nucleotides. Minimum
redundancy scores specified by the user and associated with
alignment width (the number of sequences included in the
contig) determine the number of different nucleotides at a
base position required for classification as a SNP. Where
a SNP is recorded, an SNP score is allocated equal to the
minimum number of reads that share a common
polymorphism. Where several SNPs are present in an alignment, a
cosegregation score is calculated for each SNP. This is measured
as the frequency of haplotype specifying SNP patterns
occurring in the alignment. This figure is then normalized to the
number of sequences in the alignment to produce a weighted
co-segregation score. HTML format files are generated to
allow the user to input data, select comparison, assembly
and SNP discovery parameters, and browse the SNP results
(Figure 2).
ACKNOWLEDGEMENTS
Funding to pay the Open Access publication charges for this
article was provided by the Victorian Department of Primary
Industries.
Conflict of interest statement. None declared.
(...truncated)