Random KNN feature selection - a fast and stable alternative to Random Forests
BMC Bioinformatics
Random KNN feature selection - a fast and stable alternative to Random Forests
Shengqiao Li 1 2
E James Harner 1
Donald A Adjeroh 0
0 The Lane Department of Computer Science and Electrical Engineering, West Virginia University , Morgantown, WV 26506 , USA
1 The Department of Statistics, West Virginia University , Morgantown, WV 26506 , USA
2 Health Effects Laboratory Division, the National Institute for Occupational Safety and Health , Morgantown, WV 26505 , USA
Background: Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such small n, large p problems. However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs. Results: We present RKNN-FS, an innovative feature selection procedure for small n, large p problems. RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes. Conclusions: Given the superiority of Random KNN in classification performance when compared with Random Forests, RKNN-FS's simplicity and ease of implementation, and its superiority in speed and stability, we propose RKNN-FS as a faster and more stable alternative to Random Forests in classification problems involving feature selection for high-dimensional datasets.
-
Background
Selection of a subset of important features (variables) is
crucial for modeling high dimensional data in
bioinformatics. For example, microarray gene expression data
may include p 10, 000 genes. But the sample size, n, is
much smaller, often less than 100. A model cannot be
built directly since the model complexity is larger than
the sample size. Technically, linear discriminant analysis
can only fit a linear model up to n parameters. Such a
model would provide a perfect fit, but it has no
predictive power. This small n, large p problem has
attracted a lot of research attention, aimed at removing
nonessential or noisy features from the data, and thus
determining a relatively small number of features which
can mostly explain the observed data and the related
biological processes.
Though much work has been done, feature selection
still remains an active research area. The significant
interest is attributed to its many benefits. As
enumerated in [1], these include (i) reducing the complexity of
computation for prediction; (ii) removing information
redundancy (cost savings); (iii) avoiding the issue of
overfitting; and (iv) easing interpretation. In general, the
generalization error becomes lower as fewer features are
included, and the higher the number of samples per
feature, the better. This is sometimes referred to as the
Occams razor principle [2]. Here we give a brief
summary on feature selection. For a recent review, see [3].
Basically, feature selection techniques can be grouped
into three classes: Class I: Internal variable selection.
This class mainly consists of Decision Trees (DT) [4], in
which a variable is selected and split at each node by
maximizing the purity of its descendant nodes. The
variable selection process is done in the tree building
process. The decision tree has the advantage of being easy
to interpret, but it suffers from the instability of its
hierarchical structures. Errors from ancestors pass to
multiple descendant nodes and thus have an inflated effect.
Even worse, a minor change in the root may change the
tree structure significantly. An improved method based
on decision trees is Random Forests [5], which grows a
collection of trees by bootstrapping the samples and
using a random selection of the variables. This approach
decreases the prediction variance of a single tree.
However, Random Forests may not remove certain variables,
as they may appear in multiple trees. But Random
Forests also provides a variable ranking mechanism that
can be used to select important variables.
Class II: Variable filtering. This class encompasses a
variety of filters that are principally used for the
classification problem. A specific type of model may not be
invoked in the filtering process. A filter is a statistic
defined on a random variable over multiple populations.
With the choice of a threshold, some variables can be
removed. Such filters include t-statistics, F-statistics,
Kullback-Leibler divergence, Fishers discriminant ratio,
mutual information [6], information-theoretic networks
[7], maximum entropy [8], maximum information
compression index [9], relief [10,11], correlation-based filters
[12,13], relevance and redundancy analysis [14], etc.
Class III: Wrapped methods. These techniques wrap a
model into a search algorithm [15,16]. This class
includes foreward/backword, stepwise selection using a
defined criterion, for instance, partial F-statistics,
Aikaikes Information Criterion (AIC) [17], Bayesian
Information Criterion (BIC) [18], etc. In [19], sequential
projection pursuit (SPP) was combined with partial least
square (PLS) analysis for variable selection. Wrapped
feature selection based on Random Forests has also
been studied [20,21]. There are two measures of
importance for the variables with Random Forests, namely,
mean decrease accuracy (MDA) and mean decrease Gini
(MDG). Both measures are, however, biased [22]. One
study shows that MDG is more robust than MDA [23];
however another study shows the contrary [24]. Our
experiments show that both methods give very similar
results. In this paper we present results only for MDA.
The software package varSelRF in R developed in [21]
will be used in this paper for comparisons. We call this
method RF-FS or RF when there is no confusion. Given
the hierarchical structure of the trees in the forest,
stability is (...truncated)