The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology
Evelyn Camon
0
Michele Magrane
0
Daniel Barrell
0
Vivian Lee
0
Emily Dimmer
0
John Maslen
0
David Binns
0
Nicola Harte
0
Rodrigo Lopez
0
Rolf Apweiler
0
0
European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus
, Hinxton, Cambridge CB10 1SD,
UK
The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide highquality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60 000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email: .
-
The UniProt Knowledgebase (1) [which includes Swiss-Prot
(2), TrEMBL (2) and PIR-PSD (3)] is the worlds most highly
annotated protein sequence database, having archived and
annotated more than a million proteins through a combination
of manual and electronic techniques. Over the next few years,
it is estimated that this figure will increase to over 4 million
proteins, the majority of which will lack biochemical and
functional characterization. To prepare for this increase, those
involved in bioinformatics have responded by developing new
protocols for the capture, sharing and analysis of the
functional annotation of the various data sets held. One way
in which to maximize the annotation of these data while
safeguarding quality, is to draw on the expertise of in-house
and specialist community resources.
Successful integration is reliant on each database using the
same language to characterize proteins and to distribute data in
parsable formats. In this regard, one of the most important and
well-used ontologies within the bioinformatics community is
the Gene Ontology (GO) (4). GO is a dynamic controlled
vocabulary of over 16 000 terms used to describe molecular
function, process and location of action of a protein in a
generic cell. The success of GO is based largely on its open
source approach and the involvement throughout its
development of various biological communities rich in expertise. Now
7 years on, the hype around GO has not lessened. The GO
Consortium continues to develop strategies to improve the GO
data set by staying abreast of future opportunities for
integration with other useful open biological ontologies
(OBOs) (5).
In support of standardized nomenclature, the UniProt group
became a member of the GO Consortium annotation effort in
2001. It initiated the Gene Ontology Annotation (GOA)
project (6,7) to provide assignments of GO terms to all well
characterized proteins and in particular to that of the human
proteome.
The initial aims and objectives of the GOA project have
already been achieved. GOA has organized, shared and
integrated protein knowledge using the GO structured
vocabulary. This was facilitated with the help of UniProt
curation, which led to the successful mapping to GO from
existing references, resources and publications. Initially, GOA
focused on producing GO annotations and improving update
cycles. More recently, GOA has successfully supplemented
the GOA-SPTr data set with annotations from GO Consortium
initiators, Mouse Genome Database (MGD) (8), FlyBase (9)
and Saccharomyces Genome Database (SGD) (10). In 2004,
the GOA group will report on the manual evaluation of
electronically extracted GO terms from literature as part of the
BioCreative competition. It is also hoped that the information
in GOA will accelerate the discovery of proteins of
pharmaceutical interest.
GO ANNOTATION PROCESS
High-quality GO annotations (GOA) are generated through a
combination of electronic and manual techniques, the latter of
which employs a team of skilled biologists.
ELECTRONIC GO ANNOTATION
The large-scale assignment of GO terms to UniProt entries has
been made possible by successfully converting a proportion of
the pre-existing knowledge held within the flat files into GO
terms (7). For example, UniProt description lines (DE) may
contain Enzyme Commission (EC) numbers. Using an
existing mapping of EC numbers to the GO molecular function
ontology (ec2go) and a mapping of protein accession numbers
to EC numbers, GOA can produce a UniProt to GO
association. In a similar fashion the GOA group maintains a
Swiss-Prot keyword to GO mapping (spkw2go). This mapping
file is routinely used to generate a large number of annotations
to GO process, function and component ontologies (see
contents of current release on the GOA home page).
Bi-directional database cross-references also help to
integrate GO annotations. For example, the majority of UniProt
entries will cross-reference an InterPro identification number
and vice versa. InterPro is a key database maintained at the
EBI (11,12). It provides an integrated documentation resource
for proteins, families and domains. A single InterPro entry
provides comprehensive annotation describing a set of related
proteins, some of which may have identical functions, be
involved in the same processes and act in the same locations.
During the curation of each InterPro entry, high-level GO
terms are manually curated, based on a review of the literature
available on the related proteins. This annotation is used to
generate an InterPro2go mapping and also serves as a
biological summary in the InterPro entry. So far, the
application of the InterPro2go mapping in the electronic
assignment of GO terms to gene products (...truncated)