Adaptive partitioning schemes for bipartite ranking
Hendrik Blockeel.
0
N. Vayatis ( ) ENS Cachan & UniverSud, CMLA UMR CNRS No. 8536, 61, avenue du Prsident Wilson, 94235 Cachan cedex,
France
Recursive partitioning methods are among the most popular techniques in machine learning. The purpose of this paper is to investigate how to adapt this methodology to the bipartite ranking problem. Following in the footsteps of the TREERANK approach developed in Clmenon and Vayatis (Proceedings of the 2008 Conference on Algorithmic Learning Theory, 2008 and IEEE Trans. Inf. Theory 55(9):4316-4336, 2009), we present tree-structured algorithms designed for learning to rank instances based on classification data. The main contributions of the present work are the following: the practical implementation of the TREERANK algorithm, well-founded solutions to the crucial issues related to the splitting rule and the choice of the right size for the ranking tree. From the angle embraced in this paper, splitting is viewed as a cost-sensitive classification task with data-dependent cost. Hence, up to straightforward modifications, any classification algorithm may serve as a splitting rule. Also, we propose to implement a cost-complexity pruning method after the growing stage in order to produce a right-sized ranking sub-tree with large AUC. In particular, performance bounds are established for pruning schemes inspired by recent work on nonparametric model selection. Eventually, we propose indicators for variable importance and variable dependence, plus various simulation studies illustrating the potential of our method.
1 Introduction
The goal of bipartite ranking procedures is to order all possible values x X of a random
variable X over a measurable space X . The available output information on each realization
X is modeled by a random binary label Y {1, +1}. Consider the classification dataset
{(Xi , Yi ) : 1 i n} obtained by sampling the random pair (X, Y ). The scoring approach
to ranking binary classification data consists of building a scoring function s : X R which
takes higher values when the event Y = +1 is more likely to be observed. This problem
arises in a large variety of applications, ranging from the design of search engines in
information retrieval to medical diagnosis through credit-risk screening or anomaly detection in
signal processing.
Several approaches have been considered in order to develop ranking algorithms
under binary label information. Standard methods build a scoring rule based on the
plugin approach (such as logistic regression models, see for instance Hastie and Tibshirani
1990). Machine learning methods are mostly based on the maximization of a
performance functional, like the AUC criterion, which depends on pairs of observations (refer
to RankSVM (Joachims 2002), RankNet (Burges et al. 2005), RankBoost (Freund et al.
2003), RankRLS (Pahikkala et al. 2007)). A natural direction to explore is also the
adaptation of decision trees in the spirit of CART (Breiman et al. 1984) for ranking purposes.
The number of papers introducing modifications of decision trees is considerable (see
for instance Provost and Domingos 2003; Ferri et al. 2003; Flach and Matsubara 2007;
Hllermeier and Vanderlooy 2008, 2009; Yu et al. 2008 and references therein). The main
ideas underlying these works are: (i) the use of classification decision trees as estimators of
the regression function, also known as Probabilistic Estimation Trees (PET), (ii) the choice
of a splitting rule adapted to the bipartite ranking problem. Indeed, adapting successful
classification or regression methods to ranking may require significant innovations since the
ranking problem is of different nature. We point out that popular classification rules are
based on the concept of local learning (see Friedman 1996). For classification procedures
such as those obtained through recursive partitioning of the input space X , the predicted
label of a given instance x X only depends on the data lying in the subregion of the partition
containing x. In contrast, the notion of ranking/ordering would rather involve comparing the
subregions to each other.
Following this line of thought, we have proposed, in our previous work (Clmenon and
Vayatis 2008, 2009), a different description of ranking decision trees. We characterize the
output of a decision tree algorithm not only by a partition of the feature space and the
local properties of the cells composing the partition, but also by a permutation over the cells.
The permutation indicates how to rank new observations (points lying in the same cell being
ranked equal). These two ingredients (partition and permutation) define a piecewise constant
real-valued function, a so-termed scoring rule. We also developed, and thoroughly
investigated, a specific recursive partitioning method, called the TREERANK algorithm. This
algorithm produces scoring rules in a simple top-down approach. An important contribution of
this work also consists in the connection established between the partitioning of the feature
space through this algorithm and the approximation/estimation of the optimal ROC curve by
splines of order 1. In Clmenon and Vayatis (2009), it was proved that, under general
assumptions, the resulting piecewise linear ROC curve converges to the optimal one not only
in the AUC sense but also in a stronger sense (with respect to the supremum norm).
However, due to the very principle of recursive partitioning, the TREERANK algorithm suffers
from the same drawback as the popular CART method (see Breiman et al. 1984): it may be
fooled by an XOR configuration, yielding inappropriate first splits and compromising the
results of the tree growing procedure. In classification, given the local aspect of the decision
rule, a bad start may nevertheless be compensated by growing the tree further at the cost of
a certain amount of artificial complexity. With ranking, this drawback may have much more
dramatic consequences due to the global nature of the ranking task. In some sense, ranking
errors are stacked as one grows the tree and the performance of the TREERANK algorithm is
very sensitive to the chosen splitting rule. Recursive splitting is achieved by the means of the
optimization of an entropic measure which accounts for AUC maximization on a given cell
of the partition induced by the tree. This is called the Optimization step of the TREERANK
algorithm and it is the critical step both from computational and approximation viewpoints.
The present paper proposes to solve the practical issues inherent to the nature of the
TREERANK algorithm. The primary goal of this paper is to propose pragmatic strategies for
performing the Optimization step of the TREERANK algorithm efficiently. Technically, the
question addressed is how to split the cells in a flexible manner, so that accurate
approximants of bilevel sets of the regression function may be obtained. Partition-based splitting
rules, both fixed and adaptive, are considered for this pu (...truncated)