Inferring gene ontologies from pairwise similarity data (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/30/12/i34.full.pdf

Inferring gene ontologies from pairwise similarity data

BIOINFORMATICS Vol. 30 ISMB 2014, pages i34–i42 doi:10.1093/bioinformatics/btu282 Inferring gene ontologies from pairwise similarity data Michael Kramer1, Janusz Dutkowski1, Michael Yu1, Vineet Bafna2 and Trey Ideker1,* 1 Department of Medicine and 2Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA ABSTRACT Motivation: While the manually curated Gene Ontology (GO) is widely used, inferring a GO directly from -omics data is a compelling new problem. Recognizing that ontologies are a directed acyclic graph (DAG) of terms and hierarchical relations, algorithms are needed that: (1) analyze a full matrix of gene–gene pairwise similarities from -omics data; (3) respect biological pleiotropy, by which a term in the hierarchy can relate to multiple higher level terms. Methods addressing these requirements are just beginning to emerge—none has been evaluated for GO inference. Methods: We consider two algorithms [Clique Extracted Ontology (CliXO), LocalFitness] that uniquely satisfy these requirements, compared with methods including standard clustering. CliXO is a new approach that finds maximal cliques in a network induced by progressive thresholding of a similarity matrix. We evaluate each method’s ability to reconstruct the GO biological process ontology from a similarity matrix based on (a) semantic similarities for GO itself or (b) three -omics datasets for yeast. Results: For task (a) using semantic similarity, CliXO accurately reconstructs GO (499% precision, recall) and outperforms other approaches (520% precision, 520% recall). For task (b) using -omics data, CliXO outperforms other methods using two -omics datasets and achieves 30% precision and recall using YeastNet v3, similar to an earlier approach (Network Extracted Ontology) and better than LocalFitness or standard clustering (20–25% precision, recall). Conclusion: This study provides algorithmic foundation for building gene ontologies by capturing hierarchical and pleiotropic structure embedded in biomolecular data. Contact: 1 INTRODUCTION Ontologies have proven very useful for capturing and organizing knowledge as a hierarchical set of terms and their interrelationships. In biology, one of the most successful and widely used ontologies is from the Gene Ontology (GO) project, a major effort to represent gene functions in cellular level processes across organisms (Ashburner et al., 2000; Gene Ontology Consortium, 2001). GO is ‘the default source of functional annotations for virtually every experimental system and the gold standard for measuring the success of bioinformatic methods’ (Dolinski and Botstein, 2013). It is extensively used by researchers in a wide variety of situations, such as understanding *To whom correspondence should be addressed. ß The Author 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact (2) infer true hierarchical structure in these data rather than enforcing hierarchy as a computational artifact; and the function of genes discovered in a Genome Wide Association Study (Holmans et al., 2009; Wang et al., 2010) or computationally predicting functions for uncharacterized genes (Pena-Castillo et al., 2008; Yan et al., 2010). An important feature of GO is that the ontology structure is constructed by a diverse team of scientists according to their best abilities to curate the published scientific literature. As the amount of cell biological literature increases, however, curating the ontology structure has become a painstaking effort that is proving difficult to scale up and systematize (Alterovitz et al., 2010). Moreover, human curation necessarily favors biological entities that have been well studied and misses the large proportion of cell biology that is not yet known or has not yet been curated. For these reasons, it is not possible to directly learn about an uncharacterized gene or discover a new function using GO, and one cannot quickly assemble an ontology model for a new organism, let alone a specific cell type or disease state. Recently, it has been shown by some of us that a GO can be inferred directly from molecular data as a complement to further curation efforts (www.nexontology.org) (Dutkowski et al., 2013). For ontology curators, this approach ‘is extremely valuable in three ways. First . . . it finds connections missed by curators. Second, it will save huge amounts of curation time by pointing curators to the data that matter. Third, it provides a qualitycontrol check on the GO that is unbiased by the vagaries of publication policies, as it is based only on the data themselves’ (Dolinski and Botstein, 2013). Furthermore, the ability to rapidly generate ontologies from data opens up new possibilities for the use of ontologies in general. ‘For example, data-driven ontologies generated from diseased and normal samples could be compared. This would be a novel way to look at what goes awry in particular disease states, providing the context and perspective of complex, interrelated biological processes’ (Dolinski and Botstein, 2013). Such an ontology model may also serve as the basis for an intelligent, predictive agent, as one of us has described elsewhere (Carvunis and Ideker, 2014). Despite these possibilities inspired by an initial attempt (Dutkowski et al., 2013) it remains an open question as to how best algorithmically to infer an ontology from molecular data. To understand the challenges involved in inferring an ontology from data, we first must recognize that ontologies contain both syntactic information (terms and their structural relations) as well as semantic information (relations between terms have defined meanings—in GO these include ‘is a’, ‘part of’ and ‘regulates’ relations). Both our previous work and this work will focus on inferring the syntactic information—the ontology terms, their relations and the annotations of genes to terms. This syntactic information is the most commonly used information by biologists using GO as a gold standard. Inferring gene ontologies and Michener, 1958; Sørensen, 1948; Ward, 1963). These methods, however, rely on iterative joining of pairs of terms, resulting in forced construction of a binary tree. Clusters cannot overlap (i.e. have multiple parents for a single node) or have42 children, and the number of clusters inferred is fixed at n 1 where n is the number of terminal nodes. There have recently been a handful of algorithms which construct hierarchies with overlapping clusters, by creating a first level of overlapping clusters with terminal nodes and then combining these base clusters into higher level clusters (Becker et al., 2 (...truncated)