Estimation of Relevant Variables on High-Dimensional Biological Patterns Using Iterated Weighted Kernel Functions
Fernandez-Reyes D (2008) Estimation of Relevant Variables on High-Dimensional Biological Patterns
Using Iterated Weighted Kernel Functions. PLoS ONE 3(3): e1806. doi:10.1371/journal.pone.0001806
Estimation of Relevant Variables on High-Dimensional Biological Patterns Using Iterated Weighted Kernel Functions
Sergio Rojas-Galeano 0
Emily Hsieh 0
Dan Agranoff 0
Sanjeev Krishna 0
Delmiro Fernandez-Reyes 0
Gustavo Stolovitzky, IBM Thomas J. Watson Research Center, United States of America
0 1 Division of Parasitology, National Institute for Medical Research , London , United Kingdom , 2 Department of Computer Science, University College London , London , United Kingdom , 3 Department of Infectious Diseases and Immunity, Faculty of Medicine, Imperial College London , London , United Kingdom , 4 Division of Cellular and Molecular Medicine, Centre for Infection, St George's University of London , London , United Kingdom
Background: The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder. Methodology/Principal Findings: Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables. Conclusions: Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation.
-
Funding: The study was funded by The Medical Research Council, United Kingdom. The sponsor of the study had no direct role in study design, data collection,
data analysis, data interpretation, or writing of the report.
Competing Interests: The authors have declared that no competing interests exist.
High-throughput genomic and proteomic screening of
biological samples produces large data arrays [13] characterizing
instances of two different conditions in a very high dimensional
space; that is, the space consisting of a vast number of
observations or variables that are acquired for each sample.
This is the case for mass spectrometry profiles of complex protein
mixtures with hundreds of measures of mass-to-charge ratios for
polypeptide chains detected in samples such as serum, or
genomic microarray studies profiling tens of thousands of genes
expressed in tissue samples. The computational analysis of these
biological datasets involves the discovery of informative patterns
between sample instances and the identification of the specific
biomarkers of disease. These analyses facilitate the design of new
diagnostic tests or can be used to focus further biological research
on specific drug or vaccine candidate molecules. Intuitively, such
patterns should not span the entire spectrum of observations but
ought to be encoded in a few relevant variables, with the
remainder representing noise. The search for such a subset of
relevant variables would imply an exhaustive test of all possible
combinations, a task that even for the dimensionality of serum
proteomic datasets would prove unfeasible. The computational
complexity of such searches increases exponentially with the
number of variables; it is known as a NP-complete problem and
hence computationally intractable [4,5]. Consequentially
heuristic methods with the aim of selecting an approximate-best
variable subset must be considered.
There are two approaches to variable selection: filter and
wrapper methods [6]. Filter methods rank the complete set of
variables with a given criterion, independently from the applied
classifier. They have been widely-used in the analysis of proteomic
signatures of diseases such as prostate cancer, sleeping sickness and
tuberculosis [79]. Several variants which have also been applied
to genomic cancer datasets include lists of permutations of
significant variables that are filtered by genetic algorithms (GA)
coupled with support vector machines (SVMs) [1013]. Wrapper
methodologies on the other hand, implicitly use the classifier to
evaluate variables according to their contribution to its predictive
power. Although variable selection using wrapper strategies may
incur extra computational costs, this is compensated by the ability
to explore complex associations between variables detected within
the intrinsic patterns incorporated in the discrimination rules.
Recursive feature elimination (RFE) uses SVM functions to
iteratively rank and discard relevant variables via a greedy search
and has been applied to cancer microarray datasets [1418]. The
main drawback of this approach lies in the greedy strategy that
may disrupt relationships between variables discarded in different
stages of the algorithm, leading to sub-optimal selected subsets. To
sidestep this difficulty, an alternative approach combines weighted
kernels with SVMs [1922]; this approach assigns a weight to each
variable to indicate its relevance. In [19] the weight vector is
computed using a gradient-descent formulation, which uses
bounds on the expected generalization error of the SVM.
However, the applicability of this method is restricted by
assumptions requiring the kernel and objective functions to be
continuous and differentiable, as well as the dataset being
separable. In a previous work [22] we proposed to adapt the
weighted-kernel SVM using a GA instead of the gradient descent
algorithm to improve model selection on weighted radial basis
kernels rather than to select variables. In a similar direction, a
recent technique using evolutionary strategies to adjust both
scaling and orientation of generalized Gaussian kernels in SVMs
has been reported [23]; the evolved matrices, however, must be
constrained to meet the requirements of proper kernels and,
similarly, the aim is to improve the performance of classification
instead of selecting variables.
The wrapper method we describe in (...truncated)