Sequence type analysis and recombinational tests (START)

Bioinformatics, Dec 2001

Summary: The 32-bit Windows application START is implemented using Visual Basic and C++ and performs analyses to aid in the investigation of bacterial population structure using multilocus sequence data. These analyses include data summary, lineage assignment, and tests for recombination and selection. Availability: START is available at http://outbreak.ceid.ox.ac.uk/software.htm. Contact: keith.jolley{at}ceid.ox.ac.uk

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/17/12/1230.full.pdf

Sequence type analysis and recombinational tests (START)

K. A. Jolley 0 E. J. Feil 0 M.-S. Chan 0 M. C. J. Maiden 0 0 Department of Zoology, University of Oxford , South Parks Road, Oxford OX1 3PS , UK Summary: The 32-bit Windows application START is implemented using Visual Basic and C++ and performs analyses to aid in the investigation of bacterial population structure using multilocus sequence data. These analyses include data summary, lineage assignment, and tests for recombination and selection. Availability: START is available at http://outbreak.ceid.ox. ac.uk/software.htm. Contact: - Multilocus Sequence Typing (MLST) is a nucleotide sequence-based typing method that indexes the variation present in bacterial housekeeping genes, where most of the variation is selectively neutral (Maiden et al., 1998). Internal fragments of seven housekeeping genes, approximately 450500 bp in length, are sequenced and novel alleles are assigned with arbitrary numbers sequentially to provide an allelic profile of seven integers that defines the Sequence Type (ST) of each isolate. The technique is designed primarily for global or long-term epidemiology and surveillance, and has the advantage over other typing methods, such as genetic fingerprinting, of electronic portability and unambiguous characterization of isolates. MLST schemes have been developed for a range of bacterial pathogens and databases for these organisms can be interrogated at the MLST web-site (http://www.mlst.net/) thus facilitating rapid comparisons of isolates typed using the method. A further advantage of MLST is that it provides large quantities of data that may be analyzed by a number of evolutionary approaches to yield insights into the structure of bacterial populations and the selective pressures which act upon them. With the increasing availability of MLST data, the need for software to describe and analyze datasets has become apparent. Sequence Type Analysis and Recombinational Tests (START) was written to address this need through the inclusion of multiple analytical techniques in an easyto-use and intuitive interface for Windows 95/98/NT/2000 operating systems. To whom correspondence should be addressed. Techniques available within the START program are divided into four categories: data summary, lineage assignment, tests for recombination and tests for selection. Two input files are required for many of the testsallelic profiles, consisting of isolate identifiers and allele numbers, and allele sequences. Profile data can be entered into the program directly from the keyboard, by pasting from the clipboard or by loading a tab-delimited text file while allele sequences need to be in FASTA format. The program utilizes an embedded web-browser for output, enabling easy formatting of tables and inclusion of diagrams generated by the lineage assignment algorithms using HTML, as well as printing and saving of results. Graphical output from analyses is produced in the form of Windows Metafiles (WMF) embedded within the page and these may be saved for manipulation within a graphics package. With all tests it is possible to select subsets of isolates to analyze. There are five data summary functions available within START: allele and profile frequency functions display the relative abundance of each allele sequence or ST within the dataset; the polymorphism frequency function produces a gene sequence and table showing the positions of all polymorphic sites within the dataset and where these are unique highlight the corresponding isolate name and/or allele number; and the codon usage and GC content functions produce appropriate frequency tables broken down by locus. To aid in the assignment of STs to lineages, BURST (Feil, in preparation) and UPGMA methods are implemented along with a function to create a distance matrix. BURST is a clustering algorithm designed for use on microbial MLST data which examines the relationships within clonal complexes where isolates are grouped based on the number of locus differences within their profiles. A putative founder genotype may be identified based on its number of single- and double-locus variants and a summary graphical representation displayed. Figure 1 is part of the output obtained from the analysis of 156 MLST profiles using the housekeeping genes abcZ, adk, aroE, fumC, gdh, pdhC and pgm, from a carriage population of Neisseria meningitidis (Jolley et al., 2000). This Fig. 1. BURST analysis in START showing one of the clonal groupings obtained from a carriage sample of N.meningitidis. The group comprises 19 isolates with seven unique STs centred around ST-44. The three STs within the inner ring of the diagram are singlelocus variants of ST-44, while those in the outer ring are doublelocus variants. shows one of twelve clonal complexes identified by the algorithm, grouped around a recognized hyper-invasive genotype, ST-44, and the inter-relationships within the complex. These functions can also be used to estimate recombinational parameters (Feil et al., 2001). START includes a number of tests which can be used to investigate the extent and significance of recombination. These are the Sawyers Runs Test (Sawyer, 1989), the Maximum Chi-Squared ( 2) Test (Maynard-Smith, 1992), START the Homoplasy Test (Maynard-Smith and Smith, 1998) and the Index of Association (IA) (Maynard-Smith et al., 1993). The ratio of non-synonymous (dN) to synonymous (dS) substitutions per nucleotide site is an indicator of the kind of selective pressure acting on a gene as a whole. START uses the method of Nei and Gojobori (1986) to estimate these parameters, providing values for each locus in the dataset. The package integrates these methods for analysis of MLST datasets and includes full on-line help and example data. K.A.J., E.J.F. and M-S.C. are supported by the Wellcome Trust. We thank John Maynard-Smith for the code used in the IA and Homoplasy Tests. MM is a Wellcome Trust Senior Fellow in Biodiversity Research. (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/17/12/1230.full.pdf
Article home page: http://bioinformatics.oxfordjournals.org/content/17/12/1230.abstract

K. A. Jolley, E. J. Feil, M.-S. Chan, M. C. J. Maiden. Sequence type analysis and recombinational tests (START), Bioinformatics, 2001, pp. 1230-1231, 17/12, DOI: 10.1093/bioinformatics/17.12.1230