Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-14-285.pdf

Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction

Daniela Stojanova 1 3 4 Michelangelo Ceci 0 2 Donato Malerba 0 2 Saso Dzeroski 1 3 4 5 0 Dipartimento di Informatica, Universita degli Studi di Bari Aldo Moro , via Orabona 4, Bari , Italy 1 Department of Knowledge Technologies, Jozef Stefan Institute , Jamova cesta 39, Ljubljana , Slovenia 2 Dipartimento di Informatica, Universita degli Studi di Bari Aldo Moro , via Orabona 4, Bari , Italy 3 Jozef Stefan International Postgraduate School , Jamova 39, 1000 Ljubljana , Slovenia 4 Department of Knowledge Technologies, Jozef Stefan Institute , Jamova cesta 39, Ljubljana , Slovenia 5 Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins , Jamova 39, 1000 Ljubljana , Slovenia Background: Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers. Results: This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function. Conclusions: Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions. - In the era of high-throughput computational biology, discovering the biological functions of the genes/proteins within an organism is a central goal. Many studies have applied machine learning to infer functional properties of proteins, or directly predict one or more functions for unknown proteins [1-3]. The prediction of multiple biological functions with a single model, by using learning methods for multi-label prediction, has made considerable progress in recent years [3]. A major step forward is the learning of models which take into account the possible structural relationships among functional classes [4,5]. This is motivated by the presence of ontologies and catalogs such as Gene Ontology (GO) [6] and MIPS-FUN (FUN henceforth) [7], which are organized hierarchically (and, possibly, in the form of Direct Acyclic Graphs (DAGs), where classes may have multiple parents), where general functions include other more specific functions (see Figure 1(a)). In this context, the hierarchial constraint must be observed: A gene annotated with a function must be annotated with all the ancestor functions from the hierarchy. In order to tackle this problem, hierarchical multi-label classifiers, that are able to take the hierarchical organization of the classes into account during both the learning and the prediction phase, have been recently used [8]. The topic of using protein-protein interaction (PPI) networks in the identification and prediction of protein functions has attracted increasing attention in recent years. The motivation for this stream of research is best summarized by the statement that when two proteins are found to interact in a high throughput assay, we also tend to use this as evidence of functional linkage [5]. As a confirmation, numerous studies have demonstrated the guiltby-association (GBA) principle, which states that proteins sharing similar functional annotations tend to interact more frequently than proteins which do not share them. Interactions reflect the relation or dependence between proteins. In the context of networks of such interactions, gene functions show some form of autocorrelation [9]. While correlation denotes any statistical relationship between two different variables (properties) of the same objects (in a collection of independently selected objects), autocorrelation denotes the statistical relationships between the same variable (e.g., protein function) on different but related (dependent) objects (e.g., interacting proteins). Although autocorrelation has never been investigated in the context of Hierarchical Multi-label Classification (HMC), it is not a new phenomenon in protein studies. For example, it has been used for predicting protein properties using sequence-derived structural and physicochemical features of protein sequences [10]. In this work, we introduce a definition of autocorrelation for the case of HMC and propose a method that leverages on it for improving the accuracy of gene function prediction. Figure 1 Example of a hierarchy. (a) A part of the FUN hierarchy [7]. (b) An example of input data: The FUN class hierarchy of an example and corresponding class vector and attribute set. (c) An example of a predictive clustering tree for HMC. The internal nodes contain tests on attribute values and the leaves vectors of probabilities associated with the class values. Motivation and contributions The method developed in this work, named NHMC, addresses the task of hierarchical multi-label classification where, in addition to attributes describing the genes, such as microarray-derived expression values, phenotype and sequence data, the network autocorrelation of the class values (gene functions) is also considered. The main goal is gene function prediction in the context of gene interaction networks, where network autocorrelation (...truncated)