Automated genome sequence analysis and annotation. (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/15/5/391.full.pdf

Automated genome sequence analysis and annotation.

0 Center for Genome Research , Cambridge, MA 02139, USA relation to sequence database coverage, database dynamics and database search methods is analysed, demonstrating the inherent advantages of an integrated automatic approach using multiple databases and search methods applied in an objective and repeatable manner. Availability: The GeneQuiz system is publicly available for analysis of protein sequences through a Web server at http://www.sander.ebi.ac.uk/gqsrv/submit Contact: Supplementary information: http://www.sander.ebi.ac.uk/ genequiz/ - .,7*/ 1)4&)* !*&56.&1 2*45(1,*/2 4&1(-.1. -4.56257;271.5 7423*&1 .2.1+240&6.(5 156.676* %*//(20* "4756 *120* &0375 &04.),* ! # &1) 426*.1 *5.,1 4273 =! &0375 # 76<120& &162/&1(2 &)4.) = !3&.1 Motivation: Large-scale genome projects generate a rapidly increasing number of sequences, most of them biochemically uncharacterized. Research in bioinformatics contributes to the development of methods for the computational characterization of these sequences. However, the installation and application of these methods require experience and are time consuming. Results: We present here an automatic system for preliminary functional annotation of protein sequences that has been applied to the analysis of sets of sequences from complete genomes, both to refine overall performance and to make new discoveries comparable to those made by human experts. The GeneQuiz system includes a Web-based browser that allows examination of the evidence leading to an automatic annotation and offers additional information, views of the results, and links to biological databases that complement the automatic analysis. System structure and operating principles concerning the use of multiple sequence databases, underlying sequence analysis tools, lexical analyses of database annotations and decision criteria for functional assignments are detailed. The system makes automatic quality assessments of results based on prior experience with the underlying sequence analysis tools; overall error rates in functional assignment are estimated at 2.55% for cases annotated with highest reliability (clear cases). Sources of over-interpretation of results are discussed with proposals for improvement. A conservative definition for reporting new findings that takes account of database maturity is presented along with examples of possible kinds of discoveries (new function, family and superfamily) made by the system. System performance in Functional analyses of protein sequences can now be performed on a computer using a variety of software tools that allow the user to exploit the biochemical knowledge accumulated in sequence databases. For example, the correlation of sequence similarity with similarity of function provides a basis for transferring functional knowledge from a biochemically characterized protein to a homologous, but otherwise uncharacterized one. Given a protein sequence, analysis of the conservation patterns in the corresponding protein family can allow the association of regions of the sequence or of individual residues with structural or functional motifs and may even allow the construction of a three-dimensional (3D) model by homology to a known structure in the family. Such theoretically obtained functional and structural insights may be used to direct the comparatively much more lengthy, difficult and expensive experimentation on the real protein. Although these methods are available to the researcher, their application can be cumbersome for various reasons. First, computer programs may be difficult to install and maintain. Some of them require the combined installation of huge nucleotide and protein databases that currently contain hundreds of thousands of sequences requiring gigabytes of disk storage space. The installation and maintenance of such programs and/or databases require suitably powerful computer hardware as well as special skills, so that the effort may be disproportionate for an experimental group working on a small number of proteins. Fortunately, for small requirements, some of these tools are available for interactive (Web server) or semi-interactive (Web or mail server) use over the Internet. However, the user will be constrained by the variety of software available in this manner, as well as by the choice of databases or even program parameters provided by any service, and by the limiting turnaround time of the remote service or the speed of Internet access. Even if access to appropriate software and databases is available, a second major difficulty is the need for specialist skills in using these programs effectively, both through the appropriate choice of controlling parameter settings and in evaluating the significance of the results. This expert knowledge can only be acquired through repeated use of the tools, often comparing and combining results from several methods. Again, a researcher interested only in a small number of proteins may not have this experience. If a group is interested in analysing a great number of uncharacterized sequences, as from the large-scale sequencing projects, then installation of the programs and databases and investment in the necessary expertise are worthwhile, indeed essential. However, a third problem arises, namely the application of the methods and evaluation of the results for a large number of sequences require a considerable amount of computer and human expert time, as well as tight quality control to ensure a uniformity of application and interpretation. Moreover, methods and databases improve over time and frequent re-analysis may bring new results. A partial solution to these three problems, (i) flexible installation and maintenance of a set of methods and databases, (ii) need for expertise in the use and evaluation of the methods and (iii) fast and uniform analysis of the results, was addressed with the development of the first GeneQuiz system (Scharf et al., 1994; Casari et al., 1996). GeneQuiz is a semi-automated protein sequence analysis system, the principal purpose of which is to infer a specific and reliable functional assignment together with a broad cellular role for a query protein by analysis of annotations from sequence database matches. The system also applies a selected suite of analysis tools to the query sequence, integrating the results into a coherent display to complement the functional assignments. The GeneQuiz system is able to process large numbers of sequences quickly and repeatably in a consistent manner, and makes use of regularly updated combined sequence databases. Thus, the system can be used for occasional analyses of a few query protein sequences, or it can be systematically applied to the large numbers of open reading frames (ORFs) identified in a genome sequencing project. A high degree of automation is required to cope with the analysis of the huge number of sequences generated by genome sequencing projects, and to ensure consistent and reprod (...truncated)