Entropy-based gene ranking without selection bias for the predictive classification of microarray data (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-4-54.pdf

Entropy-based gene ranking without selection bias for the predictive classification of microarray data

BMC Bioinformatics Entropy-based gene ranking without selection bias for the predictive classification of microarray data Cesare Furlanello 0 Maria Serafini 0 Stefano Merler 0 Giuseppe Jurman 0 0 Address: ITC-irst , Trento , Italy Background: We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). Results: With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Conclusions: Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance. - Background The study of gene expression patterns is expected to enable significant advances indisease diagnosis and prognosis. The main objectives of a discovery process based on microarray data are the understanding of the molecular pathways of diseases, their early detection, and the development of measures of individual responsiveness to existing or new therapies. In particular, the perspective of providing new targets for therapy and of developing clinical biomarkers has given a strong impulse to methods for ranking genes in terms of their importance as predictor variables in the construction of classification models from arrays [1-6]. In this paper, we address the problem of developing a practical methodology for gene ranking based on the support vector machine classifier (SVM), a machine learning method that is considered particularly suitable in the classification of microarray data [7-9]. A typical prediction task for the methodology would be the identification of patients resistant to a therapy or the definition of a 'terminal signature', a set of genes and a decision rule identifying short-term survivors who might benefit from specific therapies [10,11]. For example, recent results have shown that the clinical outcomes of high grade gliomas [12] and of cutaneous T cell lymphoma [11] may be better identified by gene expression-based classification than by histological classification or measures of tumor burden. The methodology described in this paper is designed to obtain a list of candidate genes, ranked for importance in discriminating between classes, and the corresponding SVM classification model. The method also provides an honest estimate of the model accuracy on novel cases (predictive accuracy). Feature elimination for SVM We have developed the entropy-based recursive feature elimination (E-RFE) as a non-parametric procedure for gene ranking, which accelerates without reducing accuracy the standard recursive feature elimination (RFE) method for SVMs [6]. The RFE procedure for SVM has been evaluated in experimental analyses [13] and it is considered a relevant method for gene selection and classification on microarrays. However, RFE for SVM has high computational costs. At each model building step, a pair (classifier, ranked gene set) is constructed from samples in a training set and evaluated on a test set, where training and test are subsets of the data available for development at this step. The contribution of each variable is defined through a function of the corresponding weight coefficient that appears in the formula defining the SVM model. The elimination of a single variable at each step (as in the basic RFE procedure) is, however, inefficient. In a typical microarray study, thousands of genes have very low SVM weights in the initial steps. An alternative is the simultaneous removal of a fixed fraction of the genes (decimation) or according to a parametric rule (e.g. the square root function). These basic, parametric, acceleration techniques or gradient based methods have been proposed in machine learning studies [6,14,15], showing that accuracy close to basic RFE may be obtained. The aim of our E-RFE procedure is to provide a more flexible feature elimination mechanism in which the ranking is obtained by adaptively discarding chunks of genes which contribute least to the SVM classifier. In our E-RFE method, we cautiously discard, according to the entropy of the weight distribution, several (possibly many) genes at each step to drive the weight distribution in a high entropy structure of few equally important variables (see Methods for details). The procedure should accommodate for the different SVM weight distributions arising from supervised classification tasks on different microarray data. The selection bias problem As shown in the Results section, the E-RFE method achieves a speed-up factor of 100 with respect to RFE. It also produces a faster and more flexible gene elimination curve than parametric versions of RFE. Finally, feature elimination with E-RFE does not significantly degrade accuracy with respect to the slower, one-step RFE. These results have allowed us to adopt E-RFE for SVM as the basis for a complete methodology scheme for gene selection designed to control the "selection bias". This bias causes a methodology flaw which is easily introduced within gene selection procedures that depend on the optimization of a classification rule ("wrapper" algorithms). While this flaw can be reproduced with any wrapper algorithm, the selection bias is a specific risk for RFE-SVM gene selection procedures. To separate the feature-selection process from the performance assessment, the bias has to be corrected in the estimates of prediction error whenever the selected model is tested on data previously used to find the best features [16]. This occurred in several early studies on microarrays that discovered very few genes yielding classification models with negligible or zero error rates ("perfect" or "nearperfect" classification with very few genes on arrays of dozens of subjects and up to 20 000 genes). The flaw unfortunately leaked into the original work on RFE, and it is still being replicated in different supervised machine learning approaches [17]. A typical contamination pattern is the follo (...truncated)