Random KNN feature selection - a fast and stable alternative to Random Forests (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-12-450.pdf

Random KNN feature selection - a fast and stable alternative to Random Forests

BMC Bioinformatics Random KNN feature selection - a fast and stable alternative to Random Forests Shengqiao Li 1 2 E James Harner 1 Donald A Adjeroh 0 0 The Lane Department of Computer Science and Electrical Engineering, West Virginia University , Morgantown, WV 26506 , USA 1 The Department of Statistics, West Virginia University , Morgantown, WV 26506 , USA 2 Health Effects Laboratory Division, the National Institute for Occupational Safety and Health , Morgantown, WV 26505 , USA Background: Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such small n, large p problems. However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs. Results: We present RKNN-FS, an innovative feature selection procedure for small n, large p problems. RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes. Conclusions: Given the superiority of Random KNN in classification performance when compared with Random Forests, RKNN-FS's simplicity and ease of implementation, and its superiority in speed and stability, we propose RKNN-FS as a faster and more stable alternative to Random Forests in classification problems involving feature selection for high-dimensional datasets. - Background Selection of a subset of important features (variables) is crucial for modeling high dimensional data in bioinformatics. For example, microarray gene expression data may include p 10, 000 genes. But the sample size, n, is much smaller, often less than 100. A model cannot be built directly since the model complexity is larger than the sample size. Technically, linear discriminant analysis can only fit a linear model up to n parameters. Such a model would provide a perfect fit, but it has no predictive power. This small n, large p problem has attracted a lot of research attention, aimed at removing nonessential or noisy features from the data, and thus determining a relatively small number of features which can mostly explain the observed data and the related biological processes. Though much work has been done, feature selection still remains an active research area. The significant interest is attributed to its many benefits. As enumerated in [1], these include (i) reducing the complexity of computation for prediction; (ii) removing information redundancy (cost savings); (iii) avoiding the issue of overfitting; and (iv) easing interpretation. In general, the generalization error becomes lower as fewer features are included, and the higher the number of samples per feature, the better. This is sometimes referred to as the Occams razor principle [2]. Here we give a brief summary on feature selection. For a recent review, see [3]. Basically, feature selection techniques can be grouped into three classes: Class I: Internal variable selection. This class mainly consists of Decision Trees (DT) [4], in which a variable is selected and split at each node by maximizing the purity of its descendant nodes. The variable selection process is done in the tree building process. The decision tree has the advantage of being easy to interpret, but it suffers from the instability of its hierarchical structures. Errors from ancestors pass to multiple descendant nodes and thus have an inflated effect. Even worse, a minor change in the root may change the tree structure significantly. An improved method based on decision trees is Random Forests [5], which grows a collection of trees by bootstrapping the samples and using a random selection of the variables. This approach decreases the prediction variance of a single tree. However, Random Forests may not remove certain variables, as they may appear in multiple trees. But Random Forests also provides a variable ranking mechanism that can be used to select important variables. Class II: Variable filtering. This class encompasses a variety of filters that are principally used for the classification problem. A specific type of model may not be invoked in the filtering process. A filter is a statistic defined on a random variable over multiple populations. With the choice of a threshold, some variables can be removed. Such filters include t-statistics, F-statistics, Kullback-Leibler divergence, Fishers discriminant ratio, mutual information [6], information-theoretic networks [7], maximum entropy [8], maximum information compression index [9], relief [10,11], correlation-based filters [12,13], relevance and redundancy analysis [14], etc. Class III: Wrapped methods. These techniques wrap a model into a search algorithm [15,16]. This class includes foreward/backword, stepwise selection using a defined criterion, for instance, partial F-statistics, Aikaikes Information Criterion (AIC) [17], Bayesian Information Criterion (BIC) [18], etc. In [19], sequential projection pursuit (SPP) was combined with partial least square (PLS) analysis for variable selection. Wrapped feature selection based on Random Forests has also been studied [20,21]. There are two measures of importance for the variables with Random Forests, namely, mean decrease accuracy (MDA) and mean decrease Gini (MDG). Both measures are, however, biased [22]. One study shows that MDG is more robust than MDA [23]; however another study shows the contrary [24]. Our experiments show that both methods give very similar results. In this paper we present results only for MDA. The software package varSelRF in R developed in [21] will be used in this paper for comparisons. We call this method RF-FS or RF when there is no confusion. Given the hierarchical structure of the trees in the forest, stability is (...truncated)