Improving the Performance of SVM-RFE to Select Genes in Microarray Data
BMC Bioinformatics
Proceedings Improving the Performance of SVM-RFE to Select Genes in Microarray Data Yuanyuan Ding* and Dawn Wilkins*
0 from The Third Annual Conference of the MidSouth Computational Biology and Bioinformatics Society Baton Rouge , Louisiana. 2-4 March, 2006
1 Address: Computer & Information Science Department, The University of Mississippi, University , MS , USA
Background: Recursive Feature Elimination is a common and well-studied method for reducing the number of attributes used for further analysis or development of prediction models. The effectiveness of the RFE algorithm is generally considered excellent, but the primary obstacle in using it is the amount of computational power required. Results: Here we introduce a variant of RFE which employs ideas from simulated annealing. The goal of the algorithm is to improve the computational performance of recursive feature elimination by eliminating chunks of features at a time with as little effect on the quality of the reduced feature set as possible. The algorithm has been tested on several large gene expression data sets. The RFE algorithm is implemented using a Support Vector Machine to assist in identifying the least useful gene(s) to eliminate. Conclusion: The algorithm is simple and efficient and generates a set of attributes that is very similar to the set produced by RFE.
-
Background
In many machine learning applications, a prediction is to
be made from a data set of historical information. Gene
expression data sets have been constructed with the goal
of predicting whether or not disease is present (e.g. colon
cancer), or which type of disease exists in the patient. One
of the primary difficulties in working with gene expression
data sets is the large number of attributes (genes). A major
focus of gene expression analysis is in the area of feature
selection or dimension reduction. Most of the algorithms
for elucidating models for prediction are less effective
when the number of genes is too large. There are many
approaches for reducing the size of the feature set, and
among them is recursive feature elimination (RFE). The
idea of RFE is to start with all features, select the "least
useful" feature (using some metric or heuristic), remove that
feature, and repeat until some stopping condition is met.
There are many variations of RFE based on how the
feature to be removed is selected, and when to stop.
RFE is well-studied for use in gene expression studies
[13]. Finding an optimal subset of features is
combinatorially prohibitive, so RFE reduces the complexity of feature
selection by being "greedy". That is, once a feature is
selected for removal, it is never reintroduced. Most studies
have found RFE to select very good gene sets, but with
12000 or more genes to select from, when the number of
samples (patients) is large, RFE takes considerable
computation time. Recursive feature elimination is extremely
computationally expensive when only one least useful
feature is removed during each iteration. A modified version
of RFE, RFE-Annealing, is proposed here, aimed at greatly
reducing the computational time required to perform the
RFE ranking process while maintaining comparable
performance with respect to prediction accuracy. Instead of
removing only one feature at a time, RFE-Annealing
removes a set of features each time, with the number of
features removed decreasing in each iteration. As its name
implies, the process is similar to the well-known method
of simulated annealing [4-6].
Simulated annealing has its roots in metallurgy and
thermodynamics. Annealing is used in metallurgy to create
materials with fewer defects. In thermodynamics,
annealing is used to find an optimal state which has minimum
energy. The basic process has two steps: heat to a high
temperature and then cool very slowly. The annealing
schedule, which is the heart of the process, defines how to
reduce the temperature during the cooling phase.
Simulated annealing is a metaheuristic (not problem specific)
that is used in many combinatorial optimization
problems. Combinatorial search techniques have difficulty
distinguishing between local minima (or maxima) and global
minima (or maxima). Simulated annealing is used in a
search by selecting a random state to start and using the
annealing schedule to guide the search. Early in the search
there is a higher probability of making a move in the
search space to a solution that is worse than the one
before. This is appropriate since the initial solution was
random and the optimal solution may be far off. As the
search proceeds, the probability of making a move to a
worse solution is decreased slowly. The temperature
decrease corresponds to the decreasing probability of
moving to a worse solution. RFE-Annealing uses the
annealing schedule idea to remove a large number of
genes in the initial iterations (when it is easy to identify
unimportant genes). In later iterations, the number of
genes removed is reduced so that important genes are not
1
removed. The simple schedule of removing of the
i + 1
remaining genes during iteration i is used. That is, half are
removed in the first iteration, one-third in the second,
one-fourth in the third, and so on. Details of the
algorithm can be found in the Methods section.
Vladimir Vapnik invented Support Vector Machines
(SVMs) in 1979 [7]. SVMs often achieve superior
classification performance compared to other learning
algorithms across most domains and tasks. They are efficient
enough to handle very large-scale classification in both
number of samples and number of variables [8]. SVMs are
generated in two steps. First, the data vectors are mapped
to a high-dimensional space. Second, the SVM tries to find
a hyperplane in this new space with maximum margin
separating the classes of data. Sometimes it is not possible
to find a separating hyperplane even in a very
high-dimensional space. In this case, a trade-off is introduced between
the size of the separating margin and penalties for every
vector that is within the margin [9]. The margin denotes
the distance from the boundary to the closest data point
in the feature space.
In its simplest, linear form, a SVM is a hyperplane that
separates two classes of examples (postive and negative)
with maximum margin (see Figure 1). The SVM creates
and outputs a weight vector, where each dimension
(feature) is assigned a weight. The weight vector is used to
determine the least important feature, which is defined to
be the one with the smallest weight in the weight vector.
The least important feature is selected for removal in each
iteration of the recursive feature elimination procedure.
Results and Discussion
Results
Data Sets
We analyzed three well known data sets: 1) the data of
Bhattacharjee et al. [10], which is a set of 12,600 gene
expression measurements (Affymetrix oligonucleotide
arrays) per patient from 203 patients with normal subjects
and four subtypes of lung carcinomas; 2) a colon cancer
dat (...truncated)