GIFtS: annotation landscape analysis with GeneCards

Oct 2009

Background Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO), pathways, interactions, phenotypes, publications and many more. Results We present the GeneCards Inferred Functionality Score (GIFtS) which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the human genome by searching for the annotation level of a specified gene, retrieving a list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the low-GIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the databases that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25) between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a given gene measured by GIFtS was correlated (for GIFtS>30) with the number of publications for a gene, and with the seniority of this entry in the HGNC database. Conclusion GIFtS can be a valuable tool for computational procedures which analyze lists of large set of genes resulting from wet-lab or computational research. GIFtS may also assist the scientific community with identification of groups of uncharacterized genes for diverse applications, such as delineation of novel functions and charting unexplored areas of the human genome.

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-10-348.pdf

GIFtS: annotation landscape analysis with GeneCards

BMC Bioinformatics BioMed Central Research article Open Access GIFtS: annotation landscape analysis with GeneCards Arye Harel*1, Aron Inger, Gil Stelzer1, Liora Strichman-Almashanu1, Irina Dalah1, Marilyn Safran1,2 and Doron Lancet1 Address: 1Department of Molecular Genetics, Weizmann Institute of Science Rehovot 76100, Israel and 2Department of Biological Services (Bioinformatics Unit), Weizmann Institute of Science, Rehovot 76100, Israel Email: Arye Harel* - ; Aron Inger - ; Gil Stelzer - ; Liora Strichman-Almashanu - ; Irina Dalah - ; Marilyn Safran - ; Doron Lancet - * Corresponding author Published: 23 October 2009 BMC Bioinformatics 2009, 10:348 doi:10.1186/1471-2105-10-348 Received: 22 February 2009 Accepted: 23 October 2009 This article is available from: http://www.biomedcentral.com/1471-2105/10/348 © 2009 Harel et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO), pathways, interactions, phenotypes, publications and many more. Results: We present the GeneCards Inferred Functionality Score (GIFtS) which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the human genome by searching for the annotation level of a specified gene, retrieving a list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the lowGIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the databases that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25) between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a given gene measured by GIFtS was correlated (for GIFtS>30) with the number of publications for a gene, and with the seniority of this entry in the HGNC database. Conclusion: GIFtS can be a valuable tool for computational procedures which analyze lists of large set of genes resulting from wet-lab or computational research. GIFtS may also assist the scientific community with identification of groups of uncharacterized genes for diverse applications, such as delineation of novel functions and charting unexplored areas of the human genome. Page 1 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348 Background Results GIFtS definition and applications We devised the GeneCards Inferred Functionality Score (GIFtS) which allows a quantitative assessment of a gene's annotation status, with potential relevance to the degree of relevant functional knowledge. A GIFtS value for a gene is defined as the number of GeneCards sources, out of a total of 68 (see additional file 1: Table S1), that include information about this gene (see Methods). Data sources have heterogeneous sizes, as estimated by the number of human gene entries for which the source contains information (Fig. 1), having an average of 11,404 ± 10,970 entries per source. One of GeneCards' main aims is to incorporate overlapping sources, and perform integration of data for different annotation fields. Considerable attention is also directed to conflicts among sources, one clear example being the GeneLoc [8] member of the GeneCards suite, which handles conflicts in genomic coordinates from Ensembl [13] and NCBI [14]. The overlap problem is particularly applicable for data extracted from genomewide sources such as Entrez Gene, Ensembl, GO [15], UniProt [16] and InterPro [17] which are all closely linked and share some of the information presented, which may introduce a degree of redundancy. 100000 Source size (number of genes) In the quest for revealing the function of DNA sequences, scientists have used a variety of approaches, from molecular techniques targeting specific genes, to systematic analyses of thousands of functional units encompassed by the transcriptome, proteome, and metabolome. This heterogeneous mass of knowledge is time-dependent, with new information constantly arising from a variety of sources. Thus, a quantitative tool for assessing annotation depth is important for directing ongoing research and for analyzing the emerging results. Efforts in this field have included the Genome Annotation Scores (GAS) algorithm [1], which demonstrates a quantitative methodology of assigning annotations scores at the whole genome level, the GO Annotation Quality (GAQ) score, which gives a quantitative measure of GO annotations [2], and the Gene Characterization Index (GCI), which scores the extent to which a gene's functionality is described, based largely on the quantification of human perception, and applied only to protein-encoding genes [3]. We now introduce the GeneCards Inferred Functionality Scores (GIFtS) tool [4], which utilizes the wealth of gene annotation within GeneCards [5] to quantify the degree of functional knowledge about >50,000 GeneCards entries. GeneCards is a comprehensive gene-centric compendium of annotative information about human genes, automatically mined from nearly 70 data sources [6-12]. Thus, GIFtS can provide quantitative annotation estimates for a very large number of genes, and at a significant depth, made possible by the exploitation of dozens of annotation resources. 10000 1000 100 10 0 20 40 60 Source rank Figure size Source 1 Source size. The number of human gene entries in each one of the sources contributing to the GIFtS score. Sources are shown by their rank according to their size (see additional file 1: Table S1). A (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2105-10-348.pdf
Article home page: http://www.biomedcentral.com/1471-2105/10/348

Arye Harel, Aron Inger, Gil Stelzer, Liora Strichman-Almashanu, Irina Dalah, Marilyn Safran, Doron Lancet. GIFtS: annotation landscape analysis with GeneCards, 2009, pp. 348, 10, DOI: 10.1186/1471-2105-10-348