Adaptive partitioning schemes for bipartite ranking (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs10994-010-5190-y.pdf

Adaptive partitioning schemes for bipartite ranking

Hendrik Blockeel. 0 N. Vayatis ( ) ENS Cachan & UniverSud, CMLA UMR CNRS No. 8536, 61, avenue du Prsident Wilson, 94235 Cachan cedex, France Recursive partitioning methods are among the most popular techniques in machine learning. The purpose of this paper is to investigate how to adapt this methodology to the bipartite ranking problem. Following in the footsteps of the TREERANK approach developed in Clmenon and Vayatis (Proceedings of the 2008 Conference on Algorithmic Learning Theory, 2008 and IEEE Trans. Inf. Theory 55(9):4316-4336, 2009), we present tree-structured algorithms designed for learning to rank instances based on classification data. The main contributions of the present work are the following: the practical implementation of the TREERANK algorithm, well-founded solutions to the crucial issues related to the splitting rule and the choice of the right size for the ranking tree. From the angle embraced in this paper, splitting is viewed as a cost-sensitive classification task with data-dependent cost. Hence, up to straightforward modifications, any classification algorithm may serve as a splitting rule. Also, we propose to implement a cost-complexity pruning method after the growing stage in order to produce a right-sized ranking sub-tree with large AUC. In particular, performance bounds are established for pruning schemes inspired by recent work on nonparametric model selection. Eventually, we propose indicators for variable importance and variable dependence, plus various simulation studies illustrating the potential of our method. 1 Introduction The goal of bipartite ranking procedures is to order all possible values x X of a random variable X over a measurable space X . The available output information on each realization X is modeled by a random binary label Y {1, +1}. Consider the classification dataset {(Xi , Yi ) : 1 i n} obtained by sampling the random pair (X, Y ). The scoring approach to ranking binary classification data consists of building a scoring function s : X R which takes higher values when the event Y = +1 is more likely to be observed. This problem arises in a large variety of applications, ranging from the design of search engines in information retrieval to medical diagnosis through credit-risk screening or anomaly detection in signal processing. Several approaches have been considered in order to develop ranking algorithms under binary label information. Standard methods build a scoring rule based on the plugin approach (such as logistic regression models, see for instance Hastie and Tibshirani 1990). Machine learning methods are mostly based on the maximization of a performance functional, like the AUC criterion, which depends on pairs of observations (refer to RankSVM (Joachims 2002), RankNet (Burges et al. 2005), RankBoost (Freund et al. 2003), RankRLS (Pahikkala et al. 2007)). A natural direction to explore is also the adaptation of decision trees in the spirit of CART (Breiman et al. 1984) for ranking purposes. The number of papers introducing modifications of decision trees is considerable (see for instance Provost and Domingos 2003; Ferri et al. 2003; Flach and Matsubara 2007; Hllermeier and Vanderlooy 2008, 2009; Yu et al. 2008 and references therein). The main ideas underlying these works are: (i) the use of classification decision trees as estimators of the regression function, also known as Probabilistic Estimation Trees (PET), (ii) the choice of a splitting rule adapted to the bipartite ranking problem. Indeed, adapting successful classification or regression methods to ranking may require significant innovations since the ranking problem is of different nature. We point out that popular classification rules are based on the concept of local learning (see Friedman 1996). For classification procedures such as those obtained through recursive partitioning of the input space X , the predicted label of a given instance x X only depends on the data lying in the subregion of the partition containing x. In contrast, the notion of ranking/ordering would rather involve comparing the subregions to each other. Following this line of thought, we have proposed, in our previous work (Clmenon and Vayatis 2008, 2009), a different description of ranking decision trees. We characterize the output of a decision tree algorithm not only by a partition of the feature space and the local properties of the cells composing the partition, but also by a permutation over the cells. The permutation indicates how to rank new observations (points lying in the same cell being ranked equal). These two ingredients (partition and permutation) define a piecewise constant real-valued function, a so-termed scoring rule. We also developed, and thoroughly investigated, a specific recursive partitioning method, called the TREERANK algorithm. This algorithm produces scoring rules in a simple top-down approach. An important contribution of this work also consists in the connection established between the partitioning of the feature space through this algorithm and the approximation/estimation of the optimal ROC curve by splines of order 1. In Clmenon and Vayatis (2009), it was proved that, under general assumptions, the resulting piecewise linear ROC curve converges to the optimal one not only in the AUC sense but also in a stronger sense (with respect to the supremum norm). However, due to the very principle of recursive partitioning, the TREERANK algorithm suffers from the same drawback as the popular CART method (see Breiman et al. 1984): it may be fooled by an XOR configuration, yielding inappropriate first splits and compromising the results of the tree growing procedure. In classification, given the local aspect of the decision rule, a bad start may nevertheless be compensated by growing the tree further at the cost of a certain amount of artificial complexity. With ranking, this drawback may have much more dramatic consequences due to the global nature of the ranking task. In some sense, ranking errors are stacked as one grows the tree and the performance of the TREERANK algorithm is very sensitive to the chosen splitting rule. Recursive splitting is achieved by the means of the optimization of an entropic measure which accounts for AUC maximization on a given cell of the partition induced by the tree. This is called the Optimization step of the TREERANK algorithm and it is the critical step both from computational and approximation viewpoints. The present paper proposes to solve the practical issues inherent to the nature of the TREERANK algorithm. The primary goal of this paper is to propose pragmatic strategies for performing the Optimization step of the TREERANK algorithm efficiently. Technically, the question addressed is how to split the cells in a flexible manner, so that accurate approximants of bilevel sets of the regression function may be obtained. Partition-based splitting rules, both fixed and adaptive, are considered for this pu (...truncated)