The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/32/suppl_1/D262.full.pdf

The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology

Evelyn Camon 0 Michele Magrane 0 Daniel Barrell 0 Vivian Lee 0 Emily Dimmer 0 John Maslen 0 David Binns 0 Nicola Harte 0 Rodrigo Lopez 0 Rolf Apweiler 0 0 European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 1SD, UK The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide highquality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60 000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email: . - The UniProt Knowledgebase (1) [which includes Swiss-Prot (2), TrEMBL (2) and PIR-PSD (3)] is the worlds most highly annotated protein sequence database, having archived and annotated more than a million proteins through a combination of manual and electronic techniques. Over the next few years, it is estimated that this figure will increase to over 4 million proteins, the majority of which will lack biochemical and functional characterization. To prepare for this increase, those involved in bioinformatics have responded by developing new protocols for the capture, sharing and analysis of the functional annotation of the various data sets held. One way in which to maximize the annotation of these data while safeguarding quality, is to draw on the expertise of in-house and specialist community resources. Successful integration is reliant on each database using the same language to characterize proteins and to distribute data in parsable formats. In this regard, one of the most important and well-used ontologies within the bioinformatics community is the Gene Ontology (GO) (4). GO is a dynamic controlled vocabulary of over 16 000 terms used to describe molecular function, process and location of action of a protein in a generic cell. The success of GO is based largely on its open source approach and the involvement throughout its development of various biological communities rich in expertise. Now 7 years on, the hype around GO has not lessened. The GO Consortium continues to develop strategies to improve the GO data set by staying abreast of future opportunities for integration with other useful open biological ontologies (OBOs) (5). In support of standardized nomenclature, the UniProt group became a member of the GO Consortium annotation effort in 2001. It initiated the Gene Ontology Annotation (GOA) project (6,7) to provide assignments of GO terms to all well characterized proteins and in particular to that of the human proteome. The initial aims and objectives of the GOA project have already been achieved. GOA has organized, shared and integrated protein knowledge using the GO structured vocabulary. This was facilitated with the help of UniProt curation, which led to the successful mapping to GO from existing references, resources and publications. Initially, GOA focused on producing GO annotations and improving update cycles. More recently, GOA has successfully supplemented the GOA-SPTr data set with annotations from GO Consortium initiators, Mouse Genome Database (MGD) (8), FlyBase (9) and Saccharomyces Genome Database (SGD) (10). In 2004, the GOA group will report on the manual evaluation of electronically extracted GO terms from literature as part of the BioCreative competition. It is also hoped that the information in GOA will accelerate the discovery of proteins of pharmaceutical interest. GO ANNOTATION PROCESS High-quality GO annotations (GOA) are generated through a combination of electronic and manual techniques, the latter of which employs a team of skilled biologists. ELECTRONIC GO ANNOTATION The large-scale assignment of GO terms to UniProt entries has been made possible by successfully converting a proportion of the pre-existing knowledge held within the flat files into GO terms (7). For example, UniProt description lines (DE) may contain Enzyme Commission (EC) numbers. Using an existing mapping of EC numbers to the GO molecular function ontology (ec2go) and a mapping of protein accession numbers to EC numbers, GOA can produce a UniProt to GO association. In a similar fashion the GOA group maintains a Swiss-Prot keyword to GO mapping (spkw2go). This mapping file is routinely used to generate a large number of annotations to GO process, function and component ontologies (see contents of current release on the GOA home page). Bi-directional database cross-references also help to integrate GO annotations. For example, the majority of UniProt entries will cross-reference an InterPro identification number and vice versa. InterPro is a key database maintained at the EBI (11,12). It provides an integrated documentation resource for proteins, families and domains. A single InterPro entry provides comprehensive annotation describing a set of related proteins, some of which may have identical functions, be involved in the same processes and act in the same locations. During the curation of each InterPro entry, high-level GO terms are manually curated, based on a review of the literature available on the related proteins. This annotation is used to generate an InterPro2go mapping and also serves as a biological summary in the InterPro entry. So far, the application of the InterPro2go mapping in the electronic assignment of GO terms to gene products (...truncated)