SEAN: SNP prediction and display program utilizing EST sequence clusters

Bioinformatics, Feb 2006

Summary: SEAN is an application that predicts single nucleotide polymorphisms (SNPs) using multiple sequence alignments produced from expressed sequence tag (EST) clusters. The algorithm uses rules of sequence identity and SNP abundance to determine the quality of the prediction. A Java viewer is provided to display the EST alignments and predicted SNPs.

SEAN: SNP prediction and display program utilizing EST sequence clusters

BIOINFORMATICS APPLICATIONS NOTE Vol. 22 no. 4 2006, pages 495–496 doi:10.1093/bioinformatics/btk006 Sequence analysis SEAN: SNP prediction and display program utilizing EST sequence clusters Derek Huntley1, , Angela Baldo4, Saurabh Johri2 and Marek Sergot3 1 Centre for Bioinformatics, Division of Molecular Biosciences, 2Centre for Molecular Microbiology and Infection, Division of Investigative Sciences and 3Department of Computing, Imperial College, London SW7 2AZ, UK and 4 USDA-ARS Plant Genetic Resources Unit, New York State Agricultural Experiment Station, Geneva, NY 14456, USA ABSTRACT Summary: SEAN is an application that predicts single nucleotide polymorphisms (SNPs) using multiple sequence alignments produced from expressed sequence tag (EST) clusters. The algorithm uses rules of sequence identity and SNP abundance to determine the quality of the prediction. A Java viewer is provided to display the ESTalignments and predicted SNPs. Availability: SEAN is freely available from http//zebrafish.doc.ic.ac. uk/Sean Contact: INTRODUCTION Expressed sequence tags (ESTs) are an important resource for identifying polymorphisms in transcribed regions. In humans, for example, estimates of polymorphism are in the range of 1 every 1.3 kb (Sachidanandam et al., 2001) and in cultivated tomatoes 1 every 7 kb (Nesbitt and Tanksley, 2002). SEAN provides a method to predict and visualize the presence of single nucleotide polymorphisms (SNPs) using EST sequence clusters. EST data have previously been used for SNP prediction by programs such as AutoSNP (Barker et al., 2003), PolyPhred (Nickerson et al., 1997), PolyBayes (Marth et al., 1999), TRACE_DIFF (Bonfield et al., 1998) and HarvEST (HarvEST Home Page available at http:// harvest.ucr.edu). Whereas HarvEST provides pre-built SNP prediction libraries, AutoSNP, PolyPhred and PolyBayes, like SEAN, enable the prediction of SNPs from a users own EST dataset. SEAN, as with AutoSNP, uses the redundancy of the SNP in an alignment as a measure of confidence but reinforces this with a measure of sequence identity in the surrounding aligned sequences. Unlike the other tools listed, SEAN also allows for the inclusion of library data to further support SNP predictions. A Java viewer is included that enables the visualization of the alignments and SNP predictions for user inspection. The search strategy for SEAN is based on the work of PicoultNewberg et al. (1999) The sequence assembly program Phrap (Phrap available at http://www.phrap.org) is used to build a consensus from the clustered sequences and using the output file produced by the Phrap ‘ace’ flag the sequence alignment,  To whom correspondence should be addressed. including consensus, is built and the alignment parsed to find potential SNPs. Five output files are produced by SEAN: three reference files and two Java configuration files. The first two reference files contain the sequence alignments (only those regions that align with the consensus are in the first file, the full alignments are in the second) together with a list of the potential SNPs and their locations and the consensus sequence in FASTA format. The third reference file contains a listing of the contigs produced by Phrap and their details—sequences, average sequence length and number of predicted SNPs. There is an option to include cultivar and library information for an improved SNP prediction. If this is used an additional output file details the predicted SNP position within the consensus and the number of occurrences of each base within each library at that position. This is provided to give additional evidence of the quality of the predicted SNP. There are also two Java configuration files produced, one for the alignments only and one for the complete sequences. These are for a Java viewer that has been developed to enable visual inspection of the alignments and predicted SNPs. The viewer has been developed using the Neomorphic Genomic Software Development Kit (NGSDK) (available at http://www.affymetrix.com). The viewer displays the sequences as solid bars with the position of any potential SNPs shown by red points at the top of the display and the positions in the relevant sequences highlighted in red. If the SNP predictions have been generated using the library and cultivar data then the SNPs predicted with the lower confidence are coloured green to distinguish them. The display has zooming functionality and fully horizontally zooming overlays the bars with their nucleotide sequence. SEAN requires Perl and Phrap for the analysis component and Java (1.3+) for the viewer. IMPLEMENTATION The SEAN generated sequence alignment is parsed a base position at a time to find potential SNPs by comparing the base at each position with the corresponding consensus base. To eliminate poor quality sequence when a base difference is found, the surrounding sequence is compared with the consensus over a defined window, by default 15 bp either side of the base but configurable when running SEAN. If the sequences in the windows are identical the base and its position are flagged, and stored as a predicted SNP  The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: 495 Received on June 29, 2005; revised on November 23, 2005; accepted on December 11, 2005 Advance Access publication December 15, 2005 Associate Editor: Chris Stoeckert D.Huntley et al. VALIDATION In silico validation of SEAN has been carried out by searching mouse and human UniGene (Boguski and Schuler, 1995) clusters and confirming the predicted SNPs using the relevant dbSNP databases (Sherry et al., 2001). UniGene clusters were selected with the minimum number of four sequences required for SNP prediction and a maximum number of 500. This provided 27 169 human and 29 360 mouse clusters from which 128 408 human and 328 714 mouse SNPs were predicted. dbSNP contained 9 123 517 human and 506 198 mouse SNPs and confirmed 32 150 human predicted SNPs (25%) and 8528 mouse (24%). 496 SEAN has been used to successfully identify SNPs among public ESTs from tomato cultivars. Among 53 re-sequenced contigs in two or three cultivars, 21 confirmed the SNPs predicted by SEAN (Labate and Baldo, 2005). Five additional SNPs were visible in the SEAN viewer but not predicted because they fell within 15 bp of each other. Overall efficiency of SNP discovery/confirmation was increased 10-fold using SEAN to target SNP-containing regions relative to sequencing arbitrary regions of the genome (Labate and Baldo, 2005). Further validation results are documented on the website (SEAN SNP prediction and display programs available at http://zebrafish.doc.ic.ac.uk/Sean/). ACKNOWLEDGEMENTS The authors gratefully acknowledge Joanne Labate for the confirmation data of SNPs in cultivated tomato and constructive suggestions for improvements in the SEAN prediction package and viewer. The authors also thank Elizabeth Fisher for the o (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/bioinformatics/article-pdf/22/4/495/48838275/bioinformatics_22_4_495.pdf
Article home page: https://academic.oup.com/bioinformatics/article/22/4/495/184394

Huntley, Derek, Baldo, Angela, Johri, Saurabh, Sergot, Marek. SEAN: SNP prediction and display program utilizing EST sequence clusters, Bioinformatics, 2006, pp. 495-496, Volume 22, Issue 4, DOI: 10.1093/bioinformatics/btk006