GIFtS: annotation landscape analysis with GeneCards
BMC Bioinformatics
BioMed Central
Research article
Open Access
GIFtS: annotation landscape analysis with GeneCards
Arye Harel*1, Aron Inger, Gil Stelzer1, Liora Strichman-Almashanu1,
Irina Dalah1, Marilyn Safran1,2 and Doron Lancet1
Address: 1Department of Molecular Genetics, Weizmann Institute of Science Rehovot 76100, Israel and 2Department of Biological Services
(Bioinformatics Unit), Weizmann Institute of Science, Rehovot 76100, Israel
Email: Arye Harel* - ; Aron Inger - ; Gil Stelzer - ;
Liora Strichman-Almashanu - ; Irina Dalah - ;
Marilyn Safran - ; Doron Lancet -
* Corresponding author
Published: 23 October 2009
BMC Bioinformatics 2009, 10:348
doi:10.1186/1471-2105-10-348
Received: 22 February 2009
Accepted: 23 October 2009
This article is available from: http://www.biomedcentral.com/1471-2105/10/348
© 2009 Harel et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: Gene annotation is a pivotal component in computational genomics, encompassing
prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative
measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a
gene-centric compendium of rich annotative information for over 50,000 human gene entries,
building upon 68 data sources, including Gene Ontology (GO), pathways, interactions, phenotypes,
publications and many more.
Results: We present the GeneCards Inferred Functionality Score (GIFtS) which allows a
quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity
of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates
browsing the human genome by searching for the annotation level of a specified gene, retrieving a
list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS
value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories.
The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into
two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the lowGIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors
provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also
provide measures which enable the evaluation of the databases that serve as GeneCards sources.
An inverse correlation is found (for GIFtS>25) between the number of genes annotated by each
source, and the average GIFtS value of genes associated with that source. Three typical source
prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising
mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree
of accumulated knowledge for a given gene measured by GIFtS was correlated (for GIFtS>30) with
the number of publications for a gene, and with the seniority of this entry in the HGNC database.
Conclusion: GIFtS can be a valuable tool for computational procedures which analyze lists of large
set of genes resulting from wet-lab or computational research. GIFtS may also assist the scientific
community with identification of groups of uncharacterized genes for diverse applications, such as
delineation of novel functions and charting unexplored areas of the human genome.
Page 1 of 11
(page number not for citation purposes)
BMC Bioinformatics 2009, 10:348
http://www.biomedcentral.com/1471-2105/10/348
Background
Results
GIFtS definition and applications
We devised the GeneCards Inferred Functionality Score
(GIFtS) which allows a quantitative assessment of a gene's
annotation status, with potential relevance to the degree
of relevant functional knowledge. A GIFtS value for a gene
is defined as the number of GeneCards sources, out of a
total of 68 (see additional file 1: Table S1), that include
information about this gene (see Methods). Data sources
have heterogeneous sizes, as estimated by the number of
human gene entries for which the source contains information (Fig. 1), having an average of 11,404 ± 10,970
entries per source. One of GeneCards' main aims is to
incorporate overlapping sources, and perform integration
of data for different annotation fields. Considerable attention is also directed to conflicts among sources, one clear
example being the GeneLoc [8] member of the GeneCards
suite, which handles conflicts in genomic coordinates
from Ensembl [13] and NCBI [14]. The overlap problem
is particularly applicable for data extracted from genomewide sources such as Entrez Gene, Ensembl, GO [15], UniProt [16] and InterPro [17] which are all closely linked
and share some of the information presented, which may
introduce a degree of redundancy.
100000
Source size (number of genes)
In the quest for revealing the function of DNA sequences,
scientists have used a variety of approaches, from molecular techniques targeting specific genes, to systematic analyses of thousands of functional units encompassed by the
transcriptome, proteome, and metabolome. This heterogeneous mass of knowledge is time-dependent, with new
information constantly arising from a variety of sources.
Thus, a quantitative tool for assessing annotation depth is
important for directing ongoing research and for analyzing the emerging results. Efforts in this field have included
the Genome Annotation Scores (GAS) algorithm [1],
which demonstrates a quantitative methodology of
assigning annotations scores at the whole genome level,
the GO Annotation Quality (GAQ) score, which gives a
quantitative measure of GO annotations [2], and the
Gene Characterization Index (GCI), which scores the
extent to which a gene's functionality is described, based
largely on the quantification of human perception, and
applied only to protein-encoding genes [3]. We now
introduce the GeneCards Inferred Functionality Scores
(GIFtS) tool [4], which utilizes the wealth of gene annotation within GeneCards [5] to quantify the degree of functional knowledge about >50,000 GeneCards entries.
GeneCards is a comprehensive gene-centric compendium
of annotative information about human genes, automatically mined from nearly 70 data sources [6-12]. Thus,
GIFtS can provide quantitative annotation estimates for a
very large number of genes, and at a significant depth,
made possible by the exploitation of dozens of annotation resources.
10000
1000
100
10
0
20
40
60
Source rank
Figure size
Source
1
Source size. The number of human gene entries in each
one of the sources contributing to the GIFtS score. Sources
are shown by their rank according to their size (see additional file 1: Table S1). A (...truncated)