Integrating Information in Biological Ontologies and Molecular Networks to Infer Novel Terms (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/srep39237.pdf

Integrating Information in Biological Ontologies and Molecular Networks to Infer Novel Terms

Abstract Currently most terms and term-term relationships in Gene Ontology (GO) are defined manually, which creates cost, consistency and completeness issues. Recent studies have demonstrated the feasibility of inferring GO automatically from biological networks, which represents an important complementary approach to GO construction. These methods (NeXO and CliXO) are unsupervised, which means 1) they cannot use the information contained in existing GO, 2) the way they integrate biological networks may not optimize the accuracy, and 3) they are not customized to infer the three different sub-ontologies of GO. Here we present a semi-supervised method called Unicorn that extends these previous methods to tackle the three problems. Unicorn uses a sub-tree of an existing GO sub-ontology as training part to learn parameters in integrating multiple networks. Cross-validation results show that Unicorn reliably inferred the left-out parts of each specific GO sub-ontology. In addition, by training Unicorn with an old version of GO together with biological networks, it successfully re-discovered some terms and term-term relationships present only in a new version of GO. Unicorn also successfully inferred some novel terms that were not contained in GO but have biological meanings well-supported by the literature.Availability: Source code of Unicorn is available at http://yiplab.cse.cuhk.edu.hk/unicorn/. Introduction Gene Ontology (GO)1 is the most widely-used biological ontology. It systematically summarizes current knowledge of gene products and their relationships across a wide range of species. GO contains standardized terms in three sub-categories, namely biological processes (BP), cellular components (CC), and molecular functions (MF). These terms are organized hierarchically in directed acyclic graphs (DAGs), which are tree-like structures that allow a node to have multiple parents, corresponding to the specialization of a term from multiple general terms. A gene can be annotated by multiple GO terms. If a gene is annotated by a GO term, it is also annotated by all its ancestral terms automatically. GO has been extensively used in various applications, such as assessing functional similarity of genes2,3,4, predicting gene functions5,6,7, and interpreting biological data8,9,10. Most of the term-term relationships in GO are defined manually, assisted by text-mining of the literature. There are several limitations to this manual curation process. First, with the rapid expansion of biological knowledge, both the number and complexity of biological publications have become difficult to handle even with the help of text-mining. Second, the same biological concept can be described in different ways in different publications, which creates a challenge for different curators to represent the concept in a consistent manner. Finally, there is considerably more research on a subset of well-studied genes and their relationships, leading to unbalanced levels of detail in different parts of GO. One complementary approach to GO construction is to infer terms and term-term relationships automatically from biological networks. This approach is attractive given the large amount and variety of network data already available, and the relative low cost of creating new networks and expanding existing ones using high-throughput experimental methods. The feasibility of inferring GO automatically from biological networks has been recently demonstrated11. In this study, a method called Network-eXtracted Ontology (NeXO) was proposed to cluster genes hierarchically based on their connections in the networks and subsequently transform the resulting clustering tree into a DAG. By using four types of molecular networks as input, NeXO was able to recover around 40% of the terms in GO based on an alignment of the terms in the NeXO and GO DAGs. Later, another method called Clique eXtracted Ontology (CliXO) was proposed to further improve the accuracy of the automatically constructed ontology12. This method identifies cliques of different sizes in an integrated biological network by progressively loosening the stringency for an edge to be drawn between two genes in the networks. Each identified clique forms a term that annotates the composing genes, and a new term becomes a parent of an existing term if the clique corresponding to the new term is a superset of the existing term. A major novelty of CliXO was its ability to use quantitative measures in the biological networks, such as the confidence score of the existence of an edge, in the ontology inference process. The best DAG constructed by CliXO achieved about 40% in both precision and recall when compared to the actual GO DAG. These two studies clearly show that existing biological networks, though incomplete and noisy, contain useful information that can be used to automatically infer GO with a reasonable accuracy. On the other hand, one limitation of both NeXO and CliXO is that they infer DAGs purely based on the input network (either a single biological network or a network integrated from multiple biological networks), which implies that 1) they are unsupervised methods that cannot make use of the information contained in the existing GO, 2) the way of integrating the biological networks is not guaranteed to optimize the accuracy of ontology construction, and 3) given a fixed set of input networks, both methods cannot infer different DAGs specifically for the three different sub-ontologies of GO. Here we extend these previous works by describing a semi-supervised method called Unicorn (Unification of Discordant Networks), which integrates multiple biological networks in a way tailored for inferring a particular sub-ontology of GO. The key idea is that each existing GO sub-ontology contains parts that are highly accurate and complete, which can be used as a training set to find out the best way to integrate biological networks for inferring the whole sub-ontology. The resulting DAG inferred by Unicorn is then expected to supplement parts of the sub-ontology not as well constructed. By using training data from a particular sub-ontology, the way to integrate the biological networks is specific to this sub-ontology. Unicorn is semi-supervised because it considers both the training part of GO and the natural distribution of edge weights in the biological networks during data processing and integration. One major challenge of integrating different biological networks is their different distributions of edge weights and semantics, such as expression correlations in a co-expression network and similarity scores in a functional network. Unicorn uses a novel discretization procedure to turn edge weights into nominal values such that they are highly correlated with the gene-gene similarity values based on the training set of the GO sub-ontology. The resulting discretized values in the different networks can then be integrated easily. We tested Un (...truncated)