Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms

Sep 2011

Functional analysis of large sets of genes and proteins is becoming more and more necessary with the increase of experimental biomolecular data at omic-scale. Enrichment analysis is by far the most popular available methodology to derive functional implications of sets of cooperating genes. The problem with these techniques relies in the redundancy of resulting information, that in most cases generate lots of trivial results with high risk to mask the reality of key biological events. We present and describe a computational method, called GeneTerm Linker, that filters and links enriched output data identifying sets of associated genes and terms, producing metagroups of coherent biological significance. The method uses fuzzy reciprocal linkage between genes and terms to unravel their functional convergence and associations. The algorithm is tested with a small set of well known interacting proteins from yeast and with a large collection of reference sets from three heterogeneous resources: multiprotein complexes (CORUM), cellular pathways (SGD) and human diseases (OMIM). Statistical Precision, Recall and balanced F-score are calculated showing robust results, even when different levels of random noise are included in the test sets. Although we could not find an equivalent method, we present a comparative analysis with a widely used method that combines enrichment and functional annotation clustering. A web application to use the method here proposed is provided at http://gtlinker.cnb.csic.es.

Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms

De Las Rivas J (2011) Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms. PLoS ONE 6(9): e24289. doi:10.1371/journal.pone.0024289 Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms Celia Fontanillo 0 Ruben Nogales-Cadenas 0 Alberto Pascual-Montano 0 Javier De Las Rivas 0 Debashish Bhattacharya, Rutgers University, United States of America 0 1 Cancer Research Center (CiC-IBMCC, CSIC/USAL) , Campus Miguel de Unamuno, Salamanca , Spain , 2 National Center of Biotechnology (CNB, CSIC), Campus de Cantoblanco UAM , Madrid , Spain Functional analysis of large sets of genes and proteins is becoming more and more necessary with the increase of experimental biomolecular data at omic-scale. Enrichment analysis is by far the most popular available methodology to derive functional implications of sets of cooperating genes. The problem with these techniques relies in the redundancy of resulting information, that in most cases generate lots of trivial results with high risk to mask the reality of key biological events. We present and describe a computational method, called GeneTerm Linker, that filters and links enriched output data identifying sets of associated genes and terms, producing metagroups of coherent biological significance. The method uses fuzzy reciprocal linkage between genes and terms to unravel their functional convergence and associations. The algorithm is tested with a small set of well known interacting proteins from yeast and with a large collection of reference sets from three heterogeneous resources: multiprotein complexes (CORUM), cellular pathways (SGD) and human diseases (OMIM). Statistical Precision, Recall and balanced F-score are calculated showing robust results, even when different levels of random noise are included in the test sets. Although we could not find an equivalent method, we present a comparative analysis with a widely used method that combines enrichment and functional annotation clustering. A web application to use the method here proposed is provided at http://gtlinker.cnb.csic.es. - Funding: Dr. De Las Rivas receives financial support provided by EU FP7-HEALTH-2007-B (project 223411), by Spanish Ministry of Science and Innovation MICINNISCiii (projects PI061153 and PS09/00843), and by the Regional Government, Junta de Castilla y Leon JCyL (project CSI07A09). Dr. Pascual-Montano receives financial support provided by MICINN grant BIO2010-17527. Dr. Nogales-Cadenas thanks the Juan de la Cierva Program (MICINN-JDC 2010) and Dr. Fontanillo thanks the CSIC JAE-PREDOC Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. . These authors contributed equally to this work. Genome- and proteome-wide analyses performed using highthroughput techniques are providing many collections of genes and proteins that are associated to studies performed over specific sets of samples in definite biological contexts. One of the major challenges of current computational biology is to provide robust automatic methods for a meaningful functional annotation of the long lists of genes or proteins derived from such high-throughput studies. Functional enrichment analysis (EA) is at present the most popular available methodology to derive functional implications of sets of cooperating genes. It uses statistical testing to find significant annotations in groups of genes. A recent review of enrichment tools categorizes them in three major classes: singular (SEA), modular (MEA) and gene-set (GSEA) [1]. Modular analysis (MEA) can be considered a second generation of functional enrichment since it uses concurrent gene annotation improving coverage [2,3,4]. Gene set enrichment analysis (GSEA) has become a popular tool to extract biological insight from complete ranked gene lists without the need of pre-selecting top genes [5]. Functional enrichment analysis, however, does not address several key problems associated to the biological annotations: (i) Redundancy of the biological terms, that are repeated in many different annotation resources (e.g. cell cycle GO:0007049, cell cycle KEGG hsa04110, etc) or that are segregated in very similar terms with the same biological meaning (e.g. GO:0007049 cell cycle and GO:0022402 cell cycle process). (ii) Bias in the annotation space due to highly frequent use of certain promiscuous terms that are unspecific (e.g. GO:0050789 regulation of biological process includes more than 44% of all human genes annotated to GO-BP). (iii) Inadequate functional annotation of many genes that are well-known (e.g. NRAS human gene product P01111 is not annotated to GO:0043410 positive regulation of MAPKKK cascade, but the role of this gene in the MAPK signaling is well-known, since it is paralogous to gene HRAS, which has a central role in such pathway). To overcome these limitations and challenges we have developed a new computational method that finds significant and coherent metagroups of genes and terms, performing several steps to eliminate redundant and non-informative data. The method takes the output of an enrichment analysis and produces a simple result that includes genes and co-annotations associated in metagroups. These metagroups are ranked by analysis of their significance and coherence, as a way to find the most relevant functions present in the query gene list. The algorithm is tested with a small set of well known interacting proteins and with a large reference set of data from three heterogeneous resources: mammalian multiprotein complexes (CORUM), yeast cellular pathways (SGD) and human diseases (OMIM). Statistical Precision, Recall and balanced F-score are calculated for each test, and we observe robust results even introducing different percentages of randomly selected genes in the queries. The computational method can be applied to the output result of any enrichment analysis. We provide a web application to use the method (http:// gtlinker.cnb.csic.es) that only needs as input a gene list, because in a first step it runs an enrichment analysis tool [3] implemented within the same workflow. Analysis of the distributions of terms/genes in different Annotation Spaces Functional annotation and enrichment analysis relies on the use of biological databases that include groups of genes associated to specific biological functions, such as: metabolic and signaling pathways, cellular processes and apparatus, organisms, etc. Some of the biological databases most used in functional profiling are: GO (repository of gene and gene product ontological attributes across species) [6], KEGG (atlas of biological pathways) [7], UniProt (catalog of structural and functional information on proteins) [8]. In these databases the functions are annotated with specific terms that define and (...truncated)


This is a preview of a remote PDF: http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0024289&type=printable
Article home page: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0024289

Celia Fontanillo, Ruben Nogales-Cadenas, Alberto Pascual-Montano, Javier De Las Rivas. Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms, 2011, Volume 6, Issue 9, DOI: 10.1371/journal.pone.0024289