Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/19/3/421.full.pdf

Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP

Gary Barker 2 Jacqueline Batley 1 Helen O' Sullivan 2 Keith J. Edwards 0 David Edwards 1 0 School of Biological Sciences, University of Bristol BS8 1UG , UK 1 Agriculture Victoria Plant Biotechnology Centre, La Trobe University , Bundoora, Victoria 3086 , Australia 2 Institute of Arable Crop Research , Long Ashton, Bristol, BS41 9AF , UK Summary: AutoSNP is a program to detect single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms (indels) in expressed sequence tag (EST) data. The program uses d2cluster and cap3 to cluster and align EST sequences, and uses redundancy to differentiate between candidate SNPs and sequence errors. Candidate polymorphisms are identified as occurring in multiple reads within an alignment. For each candidate SNP, two measures of confidence are calculated, the redundancy of the polymorphism at a SNP locus and the co segregation of the candidate SNP with other SNPs in the alignment. Availability: The program was written in PERL and is freely available to non-commercial users by request from the authors. Contact: - Single nucleotide polymorphisms (SNPs) are increasingly becoming the marker of choice in genetic analysis. They are used routinely in agriculture as markers in breeding programs and have many uses in human genetics, such as the detection of alleles associated with genetic diseases and the identification of individuals. SNPs are invaluable for genome mapping, offering the potential for generating very high density genetic maps (Rafalski, 2002). The low mutation rate of SNPs also makes them excellent markers for studying complex genetic traits and as a tool for the understanding of genome evolution (Syvanen, 2001). As with the majority of molecular markers, one of the limitations of SNPs is the initial cost associated with their development. However, with the development of high throughput sequencing technology, large amounts of data have been submitted to the various DNA databases that may be suitable for data mining and SNP discovery (Taillon-Miller et al., 1998). In particular, EST sequencing programs have provided a wealth of information, identifying novel genes from a broad range of organisms. EST sequence data may provide the richest source of biologically useful SNPs due to the relatively high redundancy of gene sequence, the diversity of genotypes represented within databases and the fact that each SNP would be associated with an expressed gene. Methods used to identify SNPs in aligned sequence data has previously relied on sequence trace file analysis to filter out sequence errors by their dubious trace quality (Kwok et al., 1994; Marth et al., 1999; Garg et al., 1999). The major drawbacks to this approach are the requirement for sequence trace files, which are rarely complete for large sequence datasets collated from a variety sources, and the high level of sequence error associated with the reverse transcription process. We have attempted to overcome this difficulty by developing software for the automated detection of SNPs within EST data with associated measurements of confidence in the validity of candidate SNPs. A conservative approach was followed to limit the error associated with cloning and sequencing, so that only polymorphisms represented by two or more sequences were considered. While this discards a significant amount of variation in the EST data, it permits the ready identification of large numbers of candidate SNPs with a high level of confidence in their validity. PROGRAM OPTIONS The AutoSNP script is run from the command line. On start-up, the user is asked to supply FASTA format input file name together with a similarity cut-off for d2cluster and cap3. Default values are 80% similarity for d2cluster and 95% for cap3. PROGRAM FLOW AND DEPENDENCIES Initial clustering is carried out by d2cluster (Burke et al., 1999). AutoSNP reads the output table created by Fig. 1. An AutoSNP report summary. This report depicts 11 candidate SNPs, identifying their base position in the sequence alignment along with two measures of confidence in SNP validity. The Min. informative score measures the minimum number of sequences that represent a polymorphism. The cosegregation score is a measure of the number of SNPs in the alignment which share the same pattern of polymorphism between aligned sequences. The weighted cosegregation score takes account of missing data in the alignment of ESTs that may otherwise bias the cosegregation score. The key relates the aligned sequences to original GenBank sequence and also identifies the maize line (where available) derived from the GenBank annotation. The full SNP report includes the complete sequence alignment along with the above SNP summary. d2cluster, and uses the information to build sequence cluster files in FASTA format. These clusters are then passed to the sequence assembly program cap3 (Huang and Madan, 1999). AutoSNP reads the ACE format output file from each cap3 run, and generates gapped FASTA format alignment files which are finally passed to the SNP detection and co-segregation subroutines. PROGRAM OUTPUT The primary output of AutoSNP is a set of linked HTML format SNP reports, prefaced by an index page containing statistical information relating to the sequence contig assembly and candidate SNP/indel identification. The SNP report pages have three components: (i) A key to the sequences in the alignment, (ii) A summary table showing the candidate SNPs/indels, together with confidance scores, and (iii) A full vertical alignment of the sequences, with the SNPs highlighted (Figure 1). Each SNP report also has a hyperlink to the underlying sequence alignment in FASTA format. In addition to the main report, several supporting files are produced which hold information such as the frequency distribution of cap3 sequence contig sizes, and the number of SNPs associated with each size of sequence contig, nucleotide substitution ratios and tables of indel sequence and size frequency. PERFORMANCE WITH THE MAIZE TEST DATA An input file containing 102 551 maize ESTs was downloaded from ZmDB (http://www.zmdb.iastate.edu/), and the AutoSNP program executed on a 1 GHz Intel Pentium III PC with 520 MB RAM running RedHat Linux 7.0. The d2cluster program took 6 days to organize the sequences into primary clusters. The cap3 assembly and SNP detection took a further 22 h to complete analysis. Of the 13 247 clusters produced by cap3, 3479 were found to contain one or more candidate SNP. A total of 14832 candidate polymorphisms were identified (http://www.cerealsdb.uk.net/ discover.htm). Indel size frequencies, nucleotide substitution ratios and segregation of candidate polymorphisms with haplotypes indicate that the majority of SNPs and indels identified using this approach represent true genetic variation in maize. ACKNOWLEDGEMENTS IACR-Long Ashton receives grant aided support from the Biotechnology and Biological Sciences Research Council of the United Kingdom. David (...truncated)