A pipeline for high throughput detection and mapping of SNPs from EST databases
A. M. Anithakumari
0
1
2
3
4
Jifeng Tang
0
1
2
3
4
Herman J. van Eck
0
1
2
3
4
Richard G. F. Visser
0
1
2
3
4
Jack A. M. Leunissen
0
1
2
3
4
Ben Vosman
0
1
2
3
4
C. Gerard van der Linden
0
1
2
3
4
0
H. J. van Eck R. G. F. Visser B. Vosman C. G. van der Linden (&) Wageningen UR Plant Breeding,
Wageningen University & Research Centre
, PO Box 386, 6700 AJ Wageningen,
The Netherlands
1
A. M. Anithakumari Graduate School Experimental Plant Sciences
, Wageningen UR Plant Breeding, PO Box 386, 6700 AJ Wageningen,
The Netherlands
2
J. Tang: formerly at Wageningen UR Laboratory of Bioinformatics
,
Wageningen, The Netherlands
3
J. A. M. Leunissen Wageningen UR Laboratory of Bioinformatics, Wageningen University & Research Centre
,
Wageningen, The Netherlands
4
J. Tang Keygene N.V,
Wageningen, The Netherlands
Single nucleotide polymorphisms (SNPs) represent the most abundant type of genetic variation that can be used as molecular markers. The SNPs that are hidden in sequence databases can be unlocked using bioinformatic tools. For efficient application of these SNPs, the sequence set should be error-free as much as possible, targeting single loci and suitable for the SNP scoring platform of choice. We have developed a pipeline to effectively mine SNPs from public EST databases with or without quality information using QualitySNP software, select reliable SNP and prepare the loci for analysis on the Illumina GoldenGate genotyping platform. The applicability of the pipeline was demonstrated using publicly available potato EST data, genotyping individuals from two diploid mapping populations and subsequently mapping the SNP markers (putative genes) in both populations. Over 7000 reliable SNPs were identified that met the criteria for genotyping on the GoldenGate platform. Of the 384 SNPs on the SNP array approximately 12% dropped out. For the two potato mapping populations 165 and 185 SNPs segregating SNP loci could be mapped on the respective genetic maps, illustrating the effectiveness of our pipeline for SNP selection and validation.
-
Genetic variation is the basis for the biodiversity of
life (Schlotterer 2004). Variations in the DNA
sequence of genes and their regulatory regions
underlie most of the phenotypic variation that has
been exploited in modern crops (Bryan et al. 2000;
Masouleh et al. 2009). Breeding strategies aiming to
improve crop agronomical performance have gained
momentum in the last few decades by the use of
molecular marker technologies that visualize DNA
polymorphisms (Collard et al. 2005). Molecular
markers have proven to be extremely useful in
breeding, for genome-wide screens for variation,
genotype identification and/or fingerprinting,
evolutionary and ecological studies.
In breeding programs that are aimed at transferring
genes or alleles within or between different species
with the aid of molecular markers several steps can be
discerned. The first step in this process is the
identification of one or more markers closely linked to or
within the traits to be introgressed. For this, a high
density map of markers on the genome and/or markers
in genes that are likely to be involved in the trait of
interest can be invaluable tool. SNPs are very well
suited for this purpose. Their astonishing abundance
has been reported in several discovery projects in many
species including humans (Sachidanandam et al.
2001), model species such as Arabidopsis thaliana
(Jander et al. 2002) and Drosophila melanogaster
(Hoskins et al. 2001) and in crop plants such as barley
(Rostoks et al. 2005), maize (Ching et al. 2002), rice
(Shen et al. 2004; McNally et al. 2006), soybean (Zhu
et al. 2003) and wheat (Ablett et al. 2006).
Recent technological advancements in discovery
and detection platforms have made SNP markers
attractive for high-throughput use not only in model
species, but also in crop plants (Rafalski 2002). In
species for which no genome sequence is available,
large scale SNP discovery has generally relied on
sequence variation found in libraries of expressed
sequence tags (ESTs) (Somers et al. 2003) or on
resequencing (Choi et al. 2007).
Several software tools are available for SNP
discovery from nucleotide databases, including
PolyBayes, AutoSNP, and QualitySNP (Marth 1999;
Barker et al. 2003; Tang et al. 2006). QualitySNP is
especially useful in extracting reliable SNPs from EST
sequence databases that lack quality information, and
is in many cases capable of distinguishing paralogs
from allelic sequences effectively (Tang et al. 2006).
Along with the development of tools to mine a large
number of SNPs from nucleotide databases, new SNP
genotyping platforms were developed that can analyze
a large number of SNPs in parallel in a large set of
individuals (Syvanen 2005). An increasing number
of reports indicate that the GoldenGate system of
Illumina is a reliable and cost-effective SNP
genotyping platform. It is capable of multiplexing from 96 to
1536 SNPs in a single reaction (Fan et al. 2003).
In this paper we describe a bioinformatics pipeline
starting from SNP discovery in ESTs to genotyping
using the Illumina GoldenGate assay. Following SNP
discovery, the SNP loci are further screened for
suitability to be analyzed with the Illumina
GoldenGate Genotyping platform. We demonstrate the
applicability of this pipeline for potato, which is the
third most important food crop in the world. Potato is
a heterozygous crop, and commercial varieties are
generally tetraploid. For potato, approximately
200,000 ESTs mainly from three cultivars are
publicly available. We show here that SNPs identified by
QualitySNP from this collection of SNPs can
effectively be turned into markers that can be mapped in
different diploid potato mapping populations,
showing the versatility of the pipeline and the produced
SNP markers. Our results indicate that the pipeline
produces a large number of SNP markers, and that the
selection of SNPs for genotyping on the Illumina
GoldenGate genotyping platform yields a high
number of reliable functional co-dominant markers that
can be easily placed on a genetic map.
Materials and methods
Mapping populations
(a) SH 9 RH: A cross between two diploid
heterozygous potato clones SH83-92-488 and RH89-039-16
(SH 9 RH) resulted in an F1 mapping population
of 135 individuals (van Os et al. 2006). Using a
Selective Mapping strategy (Vision et al. 2000) 57
individuals were selected which captured the
highest number of recombination events.
(b) C 9 E: This diploid backcross population
consisting of 250 genotypes was obtained from the cross
between clones C [USW5337.3; (Hanneman RE
1967)] and E [originally named 77.2102.37;
(Jacobsen 1980)]. Clone C is a hybrid between S.
phureja PI225696.1 and S. tuberosum dihaploid
USW42. Clone E is the result of a cross between
clone C and the S. verneiS. tuberosum backcross
clone VH3-4211 (Jacobsen 1978). A set of 94
randomly selected individuals was used for this
study, along with the parents of the cross.
DNA extracti (...truncated)