Entropy-based gene ranking without selection bias for the predictive classification of microarray data
BMC Bioinformatics
Entropy-based gene ranking without selection bias for the predictive classification of microarray data
Cesare Furlanello 0
Maria Serafini 0
Stefano Merler 0
Giuseppe Jurman 0
0 Address: ITC-irst , Trento , Italy
Background: We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). Results: With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Conclusions: Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.
-
Background
The study of gene expression patterns is expected to
enable significant advances indisease diagnosis and
prognosis. The main objectives of a discovery process based on
microarray data are the understanding of the molecular
pathways of diseases, their early detection, and the
development of measures of individual responsiveness to
existing or new therapies. In particular, the perspective of
providing new targets for therapy and of developing
clinical biomarkers has given a strong impulse to methods for
ranking genes in terms of their importance as predictor
variables in the construction of classification models from
arrays [1-6].
In this paper, we address the problem of developing a
practical methodology for gene ranking based on the
support vector machine classifier (SVM), a machine learning
method that is considered particularly suitable in the
classification of microarray data [7-9]. A typical prediction
task for the methodology would be the identification of
patients resistant to a therapy or the definition of a
'terminal signature', a set of genes and a decision rule
identifying short-term survivors who might benefit from specific
therapies [10,11]. For example, recent results have shown
that the clinical outcomes of high grade gliomas [12] and
of cutaneous T cell lymphoma [11] may be better
identified by gene expression-based classification than by
histological classification or measures of tumor burden.
The methodology described in this paper is designed to
obtain a list of candidate genes, ranked for importance in
discriminating between classes, and the corresponding
SVM classification model. The method also provides an
honest estimate of the model accuracy on novel cases
(predictive accuracy).
Feature elimination for SVM
We have developed the entropy-based recursive feature
elimination (E-RFE) as a non-parametric procedure for
gene ranking, which accelerates without reducing
accuracy the standard recursive feature elimination (RFE)
method for SVMs [6]. The RFE procedure for SVM has
been evaluated in experimental analyses [13] and it is
considered a relevant method for gene selection and
classification on microarrays. However, RFE for SVM has high
computational costs. At each model building step, a pair
(classifier, ranked gene set) is constructed from samples in
a training set and evaluated on a test set, where training
and test are subsets of the data available for development
at this step. The contribution of each variable is defined
through a function of the corresponding weight
coefficient that appears in the formula defining the SVM model.
The elimination of a single variable at each step (as in the
basic RFE procedure) is, however, inefficient. In a typical
microarray study, thousands of genes have very low SVM
weights in the initial steps. An alternative is the
simultaneous removal of a fixed fraction of the genes (decimation)
or according to a parametric rule (e.g. the square root
function). These basic, parametric, acceleration
techniques or gradient based methods have been proposed in
machine learning studies [6,14,15], showing that accuracy
close to basic RFE may be obtained.
The aim of our E-RFE procedure is to provide a more
flexible feature elimination mechanism in which the ranking
is obtained by adaptively discarding chunks of genes
which contribute least to the SVM classifier. In our E-RFE
method, we cautiously discard, according to the entropy
of the weight distribution, several (possibly many) genes
at each step to drive the weight distribution in a high
entropy structure of few equally important variables (see
Methods for details). The procedure should accommodate
for the different SVM weight distributions arising from
supervised classification tasks on different microarray
data.
The selection bias problem
As shown in the Results section, the E-RFE method
achieves a speed-up factor of 100 with respect to RFE. It
also produces a faster and more flexible gene elimination
curve than parametric versions of RFE. Finally, feature
elimination with E-RFE does not significantly degrade
accuracy with respect to the slower, one-step RFE.
These results have allowed us to adopt E-RFE for SVM as
the basis for a complete methodology scheme for gene
selection designed to control the "selection bias". This
bias causes a methodology flaw which is easily introduced
within gene selection procedures that depend on the
optimization of a classification rule ("wrapper" algorithms).
While this flaw can be reproduced with any wrapper
algorithm, the selection bias is a specific risk for RFE-SVM gene
selection procedures.
To separate the feature-selection process from the
performance assessment, the bias has to be corrected in the
estimates of prediction error whenever the selected model
is tested on data previously used to find the best features
[16]. This occurred in several early studies on microarrays
that discovered very few genes yielding classification
models with negligible or zero error rates ("perfect" or
"nearperfect" classification with very few genes on arrays of
dozens of subjects and up to 20 000 genes). The flaw
unfortunately leaked into the original work on RFE, and it
is still being replicated in different supervised machine
learning approaches [17]. A typical contamination pattern
is the follo (...truncated)