Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms
De Las Rivas J (2011) Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage
of Genes and Biological Terms. PLoS ONE 6(9): e24289. doi:10.1371/journal.pone.0024289
Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms
Celia Fontanillo 0
Ruben Nogales-Cadenas 0
Alberto Pascual-Montano 0
Javier De Las Rivas 0
Debashish Bhattacharya, Rutgers University, United States of America
0 1 Cancer Research Center (CiC-IBMCC, CSIC/USAL) , Campus Miguel de Unamuno, Salamanca , Spain , 2 National Center of Biotechnology (CNB, CSIC), Campus de Cantoblanco UAM , Madrid , Spain
Functional analysis of large sets of genes and proteins is becoming more and more necessary with the increase of experimental biomolecular data at omic-scale. Enrichment analysis is by far the most popular available methodology to derive functional implications of sets of cooperating genes. The problem with these techniques relies in the redundancy of resulting information, that in most cases generate lots of trivial results with high risk to mask the reality of key biological events. We present and describe a computational method, called GeneTerm Linker, that filters and links enriched output data identifying sets of associated genes and terms, producing metagroups of coherent biological significance. The method uses fuzzy reciprocal linkage between genes and terms to unravel their functional convergence and associations. The algorithm is tested with a small set of well known interacting proteins from yeast and with a large collection of reference sets from three heterogeneous resources: multiprotein complexes (CORUM), cellular pathways (SGD) and human diseases (OMIM). Statistical Precision, Recall and balanced F-score are calculated showing robust results, even when different levels of random noise are included in the test sets. Although we could not find an equivalent method, we present a comparative analysis with a widely used method that combines enrichment and functional annotation clustering. A web application to use the method here proposed is provided at http://gtlinker.cnb.csic.es.
-
Funding: Dr. De Las Rivas receives financial support provided by EU FP7-HEALTH-2007-B (project 223411), by Spanish Ministry of Science and Innovation
MICINNISCiii (projects PI061153 and PS09/00843), and by the Regional Government, Junta de Castilla y Leon JCyL (project CSI07A09). Dr. Pascual-Montano receives
financial support provided by MICINN grant BIO2010-17527. Dr. Nogales-Cadenas thanks the Juan de la Cierva Program (MICINN-JDC 2010) and Dr. Fontanillo
thanks the CSIC JAE-PREDOC Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
. These authors contributed equally to this work.
Genome- and proteome-wide analyses performed using
highthroughput techniques are providing many collections of genes
and proteins that are associated to studies performed over specific
sets of samples in definite biological contexts. One of the major
challenges of current computational biology is to provide robust
automatic methods for a meaningful functional annotation of the
long lists of genes or proteins derived from such high-throughput
studies. Functional enrichment analysis (EA) is at present the most
popular available methodology to derive functional implications of
sets of cooperating genes. It uses statistical testing to find
significant annotations in groups of genes. A recent review of
enrichment tools categorizes them in three major classes: singular
(SEA), modular (MEA) and gene-set (GSEA) [1]. Modular analysis
(MEA) can be considered a second generation of functional
enrichment since it uses concurrent gene annotation improving
coverage [2,3,4]. Gene set enrichment analysis (GSEA) has
become a popular tool to extract biological insight from complete
ranked gene lists without the need of pre-selecting top genes [5].
Functional enrichment analysis, however, does not address
several key problems associated to the biological annotations: (i)
Redundancy of the biological terms, that are repeated in many
different annotation resources (e.g. cell cycle GO:0007049, cell cycle
KEGG hsa04110, etc) or that are segregated in very similar terms
with the same biological meaning (e.g. GO:0007049 cell cycle and
GO:0022402 cell cycle process). (ii) Bias in the annotation space due
to highly frequent use of certain promiscuous terms that are
unspecific (e.g. GO:0050789 regulation of biological process includes
more than 44% of all human genes annotated to GO-BP). (iii)
Inadequate functional annotation of many genes that are well-known
(e.g. NRAS human gene product P01111 is not annotated to
GO:0043410 positive regulation of MAPKKK cascade, but the role of
this gene in the MAPK signaling is well-known, since it is
paralogous to gene HRAS, which has a central role in such
pathway).
To overcome these limitations and challenges we have
developed a new computational method that finds significant
and coherent metagroups of genes and terms, performing several
steps to eliminate redundant and non-informative data. The
method takes the output of an enrichment analysis and produces a
simple result that includes genes and co-annotations associated in
metagroups. These metagroups are ranked by analysis of their
significance and coherence, as a way to find the most relevant
functions present in the query gene list. The algorithm is tested
with a small set of well known interacting proteins and with a large
reference set of data from three heterogeneous resources:
mammalian multiprotein complexes (CORUM), yeast cellular
pathways (SGD) and human diseases (OMIM). Statistical Precision,
Recall and balanced F-score are calculated for each test, and we
observe robust results even introducing different percentages of
randomly selected genes in the queries. The computational
method can be applied to the output result of any enrichment
analysis. We provide a web application to use the method (http://
gtlinker.cnb.csic.es) that only needs as input a gene list, because in
a first step it runs an enrichment analysis tool [3] implemented
within the same workflow.
Analysis of the distributions of terms/genes in different
Annotation Spaces
Functional annotation and enrichment analysis relies on the use
of biological databases that include groups of genes associated to
specific biological functions, such as: metabolic and signaling
pathways, cellular processes and apparatus, organisms, etc. Some
of the biological databases most used in functional profiling are:
GO (repository of gene and gene product ontological attributes
across species) [6], KEGG (atlas of biological pathways) [7],
UniProt (catalog of structural and functional information on
proteins) [8]. In these databases the functions are annotated with
specific terms that define and (...truncated)