Predicting gene function using hierarchical multi-label decision tree ensembles
Leander Schietgat
0
Celine Vens
0
Jan Struyf
0
Hendrik Blockeel
0
Dragi Kocev
1
Sao Deroski
1
0
Department of Computer Science, Katholieke Universiteit Leuven
,
Celestijnenlaan 200A, 3001 Leuven
,
Belgium
1
Department of Knowledge Technologies, Jozef Stefan Institute
,
Jamova cesta 39, 1000 Ljubljana
,
Slovenia
Background: S. cerevisiae, A. thaliana and M. musculus are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability. Results: We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use. Conclusions: Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.
-
Background
The completion of several genome projects in the past
decade has generated the full genome sequence of many
organisms. Identifying open reading frames (ORFs) in
the sequences and assigning biological functions to
them has now become a key challenge in modern
biology. This last step, which is the focus of our paper, is
often guided by automatic discovery processes which
interact with the laboratory experiments.
More precisely, machine learning techniques are used
to predict gene functions from a predefined set of
possible functions (e.g., the functions in the Gene Ontology).
Afterwards, the predictions with highest confidence can
be tested in the lab. There are two characteristics of the
function prediction task that distinguish it from
common machine learning tasks: (1) a single gene may have
multiple functions; and (2) the functions are organized
in a hierarchy: a gene that is related to some function is
automatically related to all its ancestor functions (this is
called the hierarchy constraint). This particular problem
setting is known in machine learning as hierarchical
multi-label classification (HMC) and recently, many
approaches have been proposed to deal with it [1-19].
These approaches differ with respect to a number of
characteristics: which learning algorithm they are based
on, whether the hierarchy constraint is always met and
whether they can deal with hierarchies structured as a
directed acyclic graph (DAG), such as the Gene
Ontology, or are restricted to hierarchies structured as a
rooted tree, like MIPSs FunCat.
Decision trees are a well-known type of classifiers that
can be learned efficiently from large datasets, produce
accurate predictions and can lead to knowledge that
provides insight in the biology behind the predictions,
as demonstrated by Clare et al. [3]. They have been
applied to several machine learning tasks [20]. In earlier
work [14], we have investigated how they can be
extended to the HMC setting: we presented an HMC
decision tree learner that takes into account the
hierarchy constraint and that is able to process DAG
structured hierarchies.
In this article, we show that our HMC decision tree
method outperforms previously published approaches
applied to S. cerevisiae and A. thaliana. Our
comparisons primarily use precision-recall curves. This
evaluation method is well-suited for the HMC tasks
considered here, due to the large class skew present in
these tasks.
Moreover, we show that by upgrading our method to
an ensemble technique, classification performance
improves further. Ensemble techniques are learning
methods that construct a set of classifiers and classify
new data instances by taking a vote over their
predictions. Experiments show that ensembles of decision
trees outperform Bayesian corrected support vector
machines [10], a statistical learning method for gene
function prediction, on S. cerevisiae data, and methods
participating in the MouseFunc challenge [21,22] on
M. musculus data.
Related work
A number of machine learning approaches have been
proposed in the area of functional genomics. They have
been applied in the context of gene function prediction
in S. cerevisiae, A. thaliana or M. musculus. We have
grouped them according to the learning approach they
use.
Network based methods
Several approaches predict functions of unannotated
genes based on known functions of genes that are
nearby in a functional association network or
proteinprotein interaction network [2,4,5,8,15-17]. GENEFAS
[4], for example, predicts functions of unannotated yeast
genes based on known functions of genes that are
nearby in a functional association network.
GENEMANIA [15] calculates per gene function a composite
functional association network from multiple networks
derived from different genomic and proteomic data
sources.
These approaches are based on label propagation and
do not return a global predictive model. However, a
number of approaches were proposed to combine
predictions of functional networks with those of a
predictive model. Kim et al. [16] combine them with
predictions from a Naive Bayes classifier. The
combination is based on a simple aggregation function. The
Funckenstein system [17] uses logistic regression to
combine predictions made by a functional association
network with predictions from a random forest.
Kernel based methods
Deng et al. [1] predict gene functions with Markov
random fields using protein interaction data. They learn a
model for each gene function separately and ignore the
hierarchical relationships between the functions.
Lanckriet et al. [6] represent the data by means of a kernel
function and construct support vector machines for
each gene function separately. They only predict
toplevel classes in the hierarchy. Lee et al. [13] have
combined the Markov random field approach of [1] with the
SVM approach of [6] by computing diffusion kernels
and using them in kernel logistic regression.
Obozinski et al. [19] present a two-step approach in
which SVMs are first learned independently for each
gene function separately (allowing violations of the
hierarchy constraint) and are then reconcilia (...truncated)