Inferring gene ontologies from pairwise similarity data
BIOINFORMATICS
Vol. 30 ISMB 2014, pages i34–i42
doi:10.1093/bioinformatics/btu282
Inferring gene ontologies from pairwise similarity data
Michael Kramer1, Janusz Dutkowski1, Michael Yu1, Vineet Bafna2 and Trey Ideker1,*
1
Department of Medicine and 2Department of Computer Science and Engineering, University of California San Diego,
La Jolla, CA 92093, USA
ABSTRACT
Motivation: While the manually curated Gene Ontology (GO) is widely
used, inferring a GO directly from -omics data is a compelling new
problem. Recognizing that ontologies are a directed acyclic graph
(DAG) of terms and hierarchical relations, algorithms are needed that:
(1) analyze a full matrix of gene–gene pairwise similarities from
-omics data;
(3) respect biological pleiotropy, by which a term in the hierarchy
can relate to multiple higher level terms.
Methods addressing these requirements are just beginning to
emerge—none has been evaluated for GO inference.
Methods: We consider two algorithms [Clique Extracted Ontology
(CliXO), LocalFitness] that uniquely satisfy these requirements, compared with methods including standard clustering. CliXO is a new approach that finds maximal cliques in a network induced by progressive
thresholding of a similarity matrix. We evaluate each method’s ability
to reconstruct the GO biological process ontology from a similarity
matrix based on (a) semantic similarities for GO itself or (b) three
-omics datasets for yeast.
Results: For task (a) using semantic similarity, CliXO accurately reconstructs GO (499% precision, recall) and outperforms other approaches
(520% precision, 520% recall). For task (b) using -omics data, CliXO
outperforms other methods using two -omics datasets and achieves
30% precision and recall using YeastNet v3, similar to an earlier approach (Network Extracted Ontology) and better than LocalFitness or
standard clustering (20–25% precision, recall).
Conclusion: This study provides algorithmic foundation for building
gene ontologies by capturing hierarchical and pleiotropic structure
embedded in biomolecular data.
Contact:
1
INTRODUCTION
Ontologies have proven very useful for capturing and organizing
knowledge as a hierarchical set of terms and their interrelationships. In biology, one of the most successful and widely used
ontologies is from the Gene Ontology (GO) project, a major
effort to represent gene functions in cellular level processes
across organisms (Ashburner et al., 2000; Gene Ontology
Consortium, 2001). GO is ‘the default source of functional annotations for virtually every experimental system and the gold
standard for measuring the success of bioinformatic methods’
(Dolinski and Botstein, 2013). It is extensively used by researchers in a wide variety of situations, such as understanding
*To whom correspondence should be addressed.
ß The Author 2014. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial
re-use, please contact
(2) infer true hierarchical structure in these data rather than enforcing hierarchy as a computational artifact; and
the function of genes discovered in a Genome Wide Association
Study (Holmans et al., 2009; Wang et al., 2010) or computationally predicting functions for uncharacterized genes (Pena-Castillo
et al., 2008; Yan et al., 2010).
An important feature of GO is that the ontology structure is
constructed by a diverse team of scientists according to their best
abilities to curate the published scientific literature. As the
amount of cell biological literature increases, however, curating
the ontology structure has become a painstaking effort that is
proving difficult to scale up and systematize (Alterovitz et al.,
2010). Moreover, human curation necessarily favors biological
entities that have been well studied and misses the large proportion of cell biology that is not yet known or has not yet been
curated. For these reasons, it is not possible to directly learn
about an uncharacterized gene or discover a new function
using GO, and one cannot quickly assemble an ontology
model for a new organism, let alone a specific cell type or disease
state.
Recently, it has been shown by some of us that a GO can be
inferred directly from molecular data as a complement to further
curation efforts (www.nexontology.org) (Dutkowski et al., 2013).
For ontology curators, this approach ‘is extremely valuable in
three ways. First . . . it finds connections missed by curators.
Second, it will save huge amounts of curation time by pointing
curators to the data that matter. Third, it provides a qualitycontrol check on the GO that is unbiased by the vagaries of
publication policies, as it is based only on the data themselves’
(Dolinski and Botstein, 2013). Furthermore, the ability to rapidly
generate ontologies from data opens up new possibilities for the
use of ontologies in general. ‘For example, data-driven ontologies generated from diseased and normal samples could be compared. This would be a novel way to look at what goes awry in
particular disease states, providing the context and perspective of
complex, interrelated biological processes’ (Dolinski and
Botstein, 2013). Such an ontology model may also serve as the
basis for an intelligent, predictive agent, as one of us has
described elsewhere (Carvunis and Ideker, 2014). Despite these
possibilities inspired by an initial attempt (Dutkowski et al.,
2013) it remains an open question as to how best algorithmically
to infer an ontology from molecular data.
To understand the challenges involved in inferring an ontology
from data, we first must recognize that ontologies contain both
syntactic information (terms and their structural relations) as
well as semantic information (relations between terms have
defined meanings—in GO these include ‘is a’, ‘part of’ and ‘regulates’ relations). Both our previous work and this work will focus
on inferring the syntactic information—the ontology terms, their
relations and the annotations of genes to terms. This syntactic
information is the most commonly used information by biologists using GO as a gold standard.
Inferring gene ontologies
and Michener, 1958; Sørensen, 1948; Ward, 1963). These methods, however, rely on iterative joining of pairs of terms, resulting
in forced construction of a binary tree. Clusters cannot overlap
(i.e. have multiple parents for a single node) or have42 children,
and the number of clusters inferred is fixed at n 1 where n is the
number of terminal nodes.
There have recently been a handful of algorithms which construct hierarchies with overlapping clusters, by creating a first
level of overlapping clusters with terminal nodes and then combining these base clusters into higher level clusters (Becker et al.,
2 (...truncated)