Automated genome sequence analysis and annotation.
0
Center for Genome Research
,
Cambridge, MA 02139, USA
relation to sequence database coverage, database dynamics and database search methods is analysed, demonstrating the inherent advantages of an integrated automatic approach using multiple databases and search methods applied in an objective and repeatable manner. Availability: The GeneQuiz system is publicly available for analysis of protein sequences through a Web server at http://www.sander.ebi.ac.uk/gqsrv/submit Contact: Supplementary information: http://www.sander.ebi.ac.uk/ genequiz/
-
.,7*/ 1)4&)*
!*&56.&1
2*45(1,*/2 4&1(-.1.
-4.56257;271.5
7423*&1 .2.1+240&6.(5 156.676* %*//(20* "4756 *120* &0375
&04.),* ! # &1) 426*.1 *5.,1 4273 =! &0375 #
76<120& &162/&1(2 &)4.) = !3&.1
Motivation: Large-scale genome projects generate a rapidly
increasing number of sequences, most of them biochemically
uncharacterized. Research in bioinformatics contributes to
the development of methods for the computational
characterization of these sequences. However, the installation and
application of these methods require experience and are time
consuming.
Results: We present here an automatic system for
preliminary functional annotation of protein sequences that has
been applied to the analysis of sets of sequences from
complete genomes, both to refine overall performance and to
make new discoveries comparable to those made by human
experts. The GeneQuiz system includes a Web-based
browser that allows examination of the evidence leading to
an automatic annotation and offers additional information,
views of the results, and links to biological databases that
complement the automatic analysis. System structure and
operating principles concerning the use of multiple sequence
databases, underlying sequence analysis tools, lexical
analyses of database annotations and decision criteria for
functional assignments are detailed. The system makes
automatic quality assessments of results based on prior
experience with the underlying sequence analysis tools;
overall error rates in functional assignment are estimated at
2.55% for cases annotated with highest reliability (clear
cases). Sources of over-interpretation of results are
discussed with proposals for improvement. A conservative
definition for reporting new findings that takes account of
database maturity is presented along with examples of
possible kinds of discoveries (new function, family and
superfamily) made by the system. System performance in
Functional analyses of protein sequences can now be
performed on a computer using a variety of software tools that
allow the user to exploit the biochemical knowledge
accumulated in sequence databases. For example, the correlation
of sequence similarity with similarity of function provides a
basis for transferring functional knowledge from a
biochemically characterized protein to a homologous, but
otherwise uncharacterized one. Given a protein sequence, analysis
of the conservation patterns in the corresponding protein
family can allow the association of regions of the sequence
or of individual residues with structural or functional motifs
and may even allow the construction of a three-dimensional
(3D) model by homology to a known structure in the family.
Such theoretically obtained functional and structural insights
may be used to direct the comparatively much more lengthy,
difficult and expensive experimentation on the real protein.
Although these methods are available to the researcher,
their application can be cumbersome for various reasons.
First, computer programs may be difficult to install and
maintain. Some of them require the combined installation of
huge nucleotide and protein databases that currently contain
hundreds of thousands of sequences requiring gigabytes of
disk storage space. The installation and maintenance of such
programs and/or databases require suitably powerful
computer hardware as well as special skills, so that the effort may
be disproportionate for an experimental group working on a
small number of proteins. Fortunately, for small
requirements, some of these tools are available for interactive (Web
server) or semi-interactive (Web or mail server) use over the
Internet. However, the user will be constrained by the variety
of software available in this manner, as well as by the choice
of databases or even program parameters provided by any
service, and by the limiting turnaround time of the remote
service or the speed of Internet access.
Even if access to appropriate software and databases is
available, a second major difficulty is the need for specialist
skills in using these programs effectively, both through the
appropriate choice of controlling parameter settings and in
evaluating the significance of the results. This expert
knowledge can only be acquired through repeated use of the tools,
often comparing and combining results from several
methods. Again, a researcher interested only in a small
number of proteins may not have this experience.
If a group is interested in analysing a great number of
uncharacterized sequences, as from the large-scale sequencing
projects, then installation of the programs and databases and
investment in the necessary expertise are worthwhile, indeed
essential. However, a third problem arises, namely the
application of the methods and evaluation of the results for a
large number of sequences require a considerable amount of
computer and human expert time, as well as tight quality
control to ensure a uniformity of application and interpretation.
Moreover, methods and databases improve over time and
frequent re-analysis may bring new results.
A partial solution to these three problems, (i) flexible
installation and maintenance of a set of methods and databases,
(ii) need for expertise in the use and evaluation of the
methods and (iii) fast and uniform analysis of the results, was
addressed with the development of the first GeneQuiz
system (Scharf et al., 1994; Casari et al., 1996).
GeneQuiz is a semi-automated protein sequence analysis
system, the principal purpose of which is to infer a specific
and reliable functional assignment together with a broad
cellular role for a query protein by analysis of annotations
from sequence database matches. The system also applies a
selected suite of analysis tools to the query sequence,
integrating the results into a coherent display to complement the
functional assignments.
The GeneQuiz system is able to process large numbers of
sequences quickly and repeatably in a consistent manner, and
makes use of regularly updated combined sequence
databases. Thus, the system can be used for occasional analyses
of a few query protein sequences, or it can be systematically
applied to the large numbers of open reading frames (ORFs)
identified in a genome sequencing project.
A high degree of automation is required to cope with the
analysis of the huge number of sequences generated by
genome sequencing projects, and to ensure consistent and
reprod (...truncated)