Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction
Daniela Stojanova
1
3
4
Michelangelo Ceci
0
2
Donato Malerba
0
2
Saso Dzeroski
1
3
4
5
0
Dipartimento di Informatica, Universita degli Studi di Bari Aldo Moro
,
via Orabona 4, Bari
,
Italy
1
Department of Knowledge Technologies, Jozef Stefan Institute
,
Jamova cesta 39, Ljubljana
,
Slovenia
2
Dipartimento di Informatica, Universita degli Studi di Bari Aldo Moro
,
via Orabona 4, Bari
,
Italy
3
Jozef Stefan International Postgraduate School
,
Jamova 39, 1000 Ljubljana
,
Slovenia
4
Department of Knowledge Technologies, Jozef Stefan Institute
,
Jamova cesta 39, Ljubljana
,
Slovenia
5
Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins
,
Jamova 39, 1000 Ljubljana
,
Slovenia
Background: Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers. Results: This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function. Conclusions: Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions.
-
In the era of high-throughput computational biology,
discovering the biological functions of the genes/proteins
within an organism is a central goal. Many studies have
applied machine learning to infer functional properties
of proteins, or directly predict one or more functions for
unknown proteins [1-3]. The prediction of multiple
biological functions with a single model, by using learning
methods for multi-label prediction, has made
considerable progress in recent years [3].
A major step forward is the learning of models which
take into account the possible structural relationships
among functional classes [4,5]. This is motivated by the
presence of ontologies and catalogs such as Gene
Ontology (GO) [6] and MIPS-FUN (FUN henceforth) [7], which
are organized hierarchically (and, possibly, in the form of
Direct Acyclic Graphs (DAGs), where classes may have
multiple parents), where general functions include other
more specific functions (see Figure 1(a)). In this
context, the hierarchial constraint must be observed: A gene
annotated with a function must be annotated with all the
ancestor functions from the hierarchy. In order to tackle
this problem, hierarchical multi-label classifiers, that are
able to take the hierarchical organization of the classes
into account during both the learning and the prediction
phase, have been recently used [8].
The topic of using protein-protein interaction (PPI)
networks in the identification and prediction of protein
functions has attracted increasing attention in recent years.
The motivation for this stream of research is best
summarized by the statement that when two proteins are found
to interact in a high throughput assay, we also tend to
use this as evidence of functional linkage [5]. As a
confirmation, numerous studies have demonstrated the
guiltby-association (GBA) principle, which states that proteins
sharing similar functional annotations tend to interact
more frequently than proteins which do not share them.
Interactions reflect the relation or dependence between
proteins. In the context of networks of such interactions,
gene functions show some form of autocorrelation [9].
While correlation denotes any statistical relationship
between two different variables (properties) of the
same objects (in a collection of independently selected
objects), autocorrelation denotes the statistical
relationships between the same variable (e.g., protein function)
on different but related (dependent) objects (e.g.,
interacting proteins). Although autocorrelation has never been
investigated in the context of Hierarchical Multi-label
Classification (HMC), it is not a new phenomenon in
protein studies. For example, it has been used for predicting
protein properties using sequence-derived structural and
physicochemical features of protein sequences [10]. In this
work, we introduce a definition of autocorrelation for the
case of HMC and propose a method that leverages on it
for improving the accuracy of gene function prediction.
Figure 1 Example of a hierarchy. (a) A part of the FUN hierarchy [7]. (b) An example of input data: The FUN class hierarchy of an example and
corresponding class vector and attribute set. (c) An example of a predictive clustering tree for HMC. The internal nodes contain tests on attribute
values and the leaves vectors of probabilities associated with the class values.
Motivation and contributions
The method developed in this work, named NHMC,
addresses the task of hierarchical multi-label classification
where, in addition to attributes describing the genes, such
as microarray-derived expression values, phenotype and
sequence data, the network autocorrelation of the class
values (gene functions) is also considered. The main goal
is gene function prediction in the context of gene
interaction networks, where network autocorrelation (...truncated)