Improving the Performance of SVM-RFE to Select Genes in Microarray Data (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-7-S2-S12.pdf

Improving the Performance of SVM-RFE to Select Genes in Microarray Data

BMC Bioinformatics Proceedings Improving the Performance of SVM-RFE to Select Genes in Microarray Data Yuanyuan Ding* and Dawn Wilkins* 0 from The Third Annual Conference of the MidSouth Computational Biology and Bioinformatics Society Baton Rouge , Louisiana. 2-4 March, 2006 1 Address: Computer & Information Science Department, The University of Mississippi, University , MS , USA Background: Recursive Feature Elimination is a common and well-studied method for reducing the number of attributes used for further analysis or development of prediction models. The effectiveness of the RFE algorithm is generally considered excellent, but the primary obstacle in using it is the amount of computational power required. Results: Here we introduce a variant of RFE which employs ideas from simulated annealing. The goal of the algorithm is to improve the computational performance of recursive feature elimination by eliminating chunks of features at a time with as little effect on the quality of the reduced feature set as possible. The algorithm has been tested on several large gene expression data sets. The RFE algorithm is implemented using a Support Vector Machine to assist in identifying the least useful gene(s) to eliminate. Conclusion: The algorithm is simple and efficient and generates a set of attributes that is very similar to the set produced by RFE. - Background In many machine learning applications, a prediction is to be made from a data set of historical information. Gene expression data sets have been constructed with the goal of predicting whether or not disease is present (e.g. colon cancer), or which type of disease exists in the patient. One of the primary difficulties in working with gene expression data sets is the large number of attributes (genes). A major focus of gene expression analysis is in the area of feature selection or dimension reduction. Most of the algorithms for elucidating models for prediction are less effective when the number of genes is too large. There are many approaches for reducing the size of the feature set, and among them is recursive feature elimination (RFE). The idea of RFE is to start with all features, select the "least useful" feature (using some metric or heuristic), remove that feature, and repeat until some stopping condition is met. There are many variations of RFE based on how the feature to be removed is selected, and when to stop. RFE is well-studied for use in gene expression studies [13]. Finding an optimal subset of features is combinatorially prohibitive, so RFE reduces the complexity of feature selection by being "greedy". That is, once a feature is selected for removal, it is never reintroduced. Most studies have found RFE to select very good gene sets, but with 12000 or more genes to select from, when the number of samples (patients) is large, RFE takes considerable computation time. Recursive feature elimination is extremely computationally expensive when only one least useful feature is removed during each iteration. A modified version of RFE, RFE-Annealing, is proposed here, aimed at greatly reducing the computational time required to perform the RFE ranking process while maintaining comparable performance with respect to prediction accuracy. Instead of removing only one feature at a time, RFE-Annealing removes a set of features each time, with the number of features removed decreasing in each iteration. As its name implies, the process is similar to the well-known method of simulated annealing [4-6]. Simulated annealing has its roots in metallurgy and thermodynamics. Annealing is used in metallurgy to create materials with fewer defects. In thermodynamics, annealing is used to find an optimal state which has minimum energy. The basic process has two steps: heat to a high temperature and then cool very slowly. The annealing schedule, which is the heart of the process, defines how to reduce the temperature during the cooling phase. Simulated annealing is a metaheuristic (not problem specific) that is used in many combinatorial optimization problems. Combinatorial search techniques have difficulty distinguishing between local minima (or maxima) and global minima (or maxima). Simulated annealing is used in a search by selecting a random state to start and using the annealing schedule to guide the search. Early in the search there is a higher probability of making a move in the search space to a solution that is worse than the one before. This is appropriate since the initial solution was random and the optimal solution may be far off. As the search proceeds, the probability of making a move to a worse solution is decreased slowly. The temperature decrease corresponds to the decreasing probability of moving to a worse solution. RFE-Annealing uses the annealing schedule idea to remove a large number of genes in the initial iterations (when it is easy to identify unimportant genes). In later iterations, the number of genes removed is reduced so that important genes are not 1 removed. The simple schedule of removing of the i + 1 remaining genes during iteration i is used. That is, half are removed in the first iteration, one-third in the second, one-fourth in the third, and so on. Details of the algorithm can be found in the Methods section. Vladimir Vapnik invented Support Vector Machines (SVMs) in 1979 [7]. SVMs often achieve superior classification performance compared to other learning algorithms across most domains and tasks. They are efficient enough to handle very large-scale classification in both number of samples and number of variables [8]. SVMs are generated in two steps. First, the data vectors are mapped to a high-dimensional space. Second, the SVM tries to find a hyperplane in this new space with maximum margin separating the classes of data. Sometimes it is not possible to find a separating hyperplane even in a very high-dimensional space. In this case, a trade-off is introduced between the size of the separating margin and penalties for every vector that is within the margin [9]. The margin denotes the distance from the boundary to the closest data point in the feature space. In its simplest, linear form, a SVM is a hyperplane that separates two classes of examples (postive and negative) with maximum margin (see Figure 1). The SVM creates and outputs a weight vector, where each dimension (feature) is assigned a weight. The weight vector is used to determine the least important feature, which is defined to be the one with the smallest weight in the weight vector. The least important feature is selected for removal in each iteration of the recursive feature elimination procedure. Results and Discussion Results Data Sets We analyzed three well known data sets: 1) the data of Bhattacharjee et al. [10], which is a set of 12,600 gene expression measurements (Affymetrix oligonucleotide arrays) per patient from 203 patients with normal subjects and four subtypes of lung carcinomas; 2) a colon cancer dat (...truncated)