Prediction error estimation: a comparison of resampling methods
BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 15 2005, pages 3301–3307
doi:10.1093/bioinformatics/bti499
Data and text mining
Prediction error estimation: a comparison of resampling
methods
Annette M. Molinaro1,3,∗ , Richard Simon2 and Ruth M. Pfeiffer1
1 Biostatistics
Branch, Division of Cancer Epidemiology and Genetics and 2 Biometric Research Branch, Division of
Cancer Treatment and Diagnostics, NCI, NIH, Rockville, MD 20852 USA and 3 Department of Epidemiology and
Public Health, Yale University School of Medicine, New Haven, CT 06520, USA
ABSTRACT
Motivation: In genomic studies, thousands of features are collected
on relatively few samples. One of the goals of these studies is to build
classifiers to predict the outcome of future observations. There are
three inherent steps to this process: feature selection, model selection
and prediction assessment. With a focus on prediction assessment,
we compare several methods for estimating the ‘true’ prediction error
of a prediction model in the presence of feature selection.
Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out
cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+
bootstrap have the smallest bias for diagonal discriminant analysis,
nearest neighbor and classification trees. LOOCV and 10-fold CV have
the smallest bias for linear discriminant analysis. Additionally, LOOCV,
5- and 10-fold CV, and the .632+ bootstrap have the lowest mean
square error. The .632+ bootstrap is quite biased in small sample
sizes with strong signal-to-noise ratios. Differences in performance
among resampling methods are reduced as the number of specimens
available increase.
Contact:
Supplementary Information: A complete compilation of results and
R code for simulations and analyses are available in Molinaro et al.
(2005) (http://linus.nci.nih.gov/brb/TechReport.htm).
1
INTRODUCTION
In genomic experiments one frequently encounters high dimensional
data and small sample sizes. Microarrays simultaneously monitor
expression levels for several thousands of genes. Proteomic profiling studies using SELDI-TOF (surface-enhanced laser desorption
and ionization time-of-flight) measure size and charge of proteins
and protein fragments by mass spectroscopy, and result in up to
15 000 intensity levels at prespecified mass values for each spectrum.
Sample sizes in such experiments are typically <100.
In many studies, observations are known to belong to predetermined classes and the task is to build predictors or classifiers for
new observations whose class is unknown. Deciding which genes or
proteomic measurements to include in the prediction is called feature selection and is a crucial step in developing a class predictor.
∗ To
whom correspondence should be addressed.
Published by Oxford University Press 2005
Including too many noisy variables reduces accuracy of the prediction and may lead to over-fitting of data, resulting in promising but
often non-reproducible results (Ransohoff, 2004).
Another difficulty is model selection with numerous classification
models available. An important step in reporting results is assessing
the chosen model’s error rate, or generalizability. In the absence of
independent validation data, a common approach to estimating predictive accuracy is based on some form of resampling the original
data, e.g. cross-validation. These techniques divide the data into a
learning set and a test set, and range in complexity from the popular learning-test split to v-fold cross-validation, Monte-Carlo v-fold
cross-validation and bootstrap resampling. Few comparisons of
standard resampling methods have been performed to date, and all of
them exhibit limitations that make their conclusions inapplicable to
most genomic settings. Early comparisons of resampling techniques
in the literature are focussed on model selection as opposed to prediction error estimation (Breiman and Spector, 1992; Burman, 1989). In
two recent assessments of resampling techniques for error estimation
(Braga-Neto and Dougherty, 2004; Efron, 2004), feature selection
was not included as part of the resampling procedures, causing the
conclusions to be inappropriate for the high-dimensional setting.
We have performed an extensive comparison of resampling methods to estimate prediction error using simulated (large signal-to-noise
ratio), microarray (intermediate signal to noise ratio) and proteomic
data (low signal-to-noise ratio), encompassing increasing sample
sizes with large numbers of features. The impact of feature selection
on the performance of various cross-validation methods is highlighted. The results elucidate the ‘best’ resampling techniques for
future research involving high dimensional data to avoid overly
optimistic assessment of the performance of a model.
2
METHODS
In the prediction problem, one observes n independent and identically distributed (i.i.d.) random variables O1 , . . . , On with unknown distribution P .
Each observation in O consists of an outcome Y with range Y and an l-vector
of measured covariates, or features, X with range X , such that Oi = (Xi , Yi ),
i = 1, . . . , n. In microarray experiments X includes gene expression measurements, while in proteomic data, it includes the intensities at the mass
over charge (m/z) values. X may also contain covariates such as a patient’s
age and/or histopathologic measurements. The outcome Y may be a continuous measure such as months to disease or a categorical measure such as
disease status.
3301
Received on April 6, 2005; revised on April 28, 2005; accepted on May 12, 2005
Advance Access publication May 19, 2005
A.M.Molinaro et al.
The rule in Equation (1) is constructed and evaluated upon the distribution
P , as such, θ̃ is referred to as the asymptotic risk. However, in reality P
is unknown, thus, the rule based upon the observations O1 , . . . , On has an
expected loss, or conditional risk (also known as the generalization error),
defined as:
θ̃n = R(ψ(·|Pn ), P ) = L(y, ψ(x|Pn )) dP (x, y).
(2)
There are two impetuses for evaluating the conditional risk: model selection and performance assessment. In model selection, the goal is to find the
one which minimizes the conditional risk over a collection of potential models. In performance assessment, the goal is to estimate the generalization
error for a given model, i.e. assess how well it predicts the outcome of an
observation not included in O.
In an ideal setting an independent dataset would be available for the purposes of model selection and estimating the generalization error. Typically,
however, one must use the observed sample O for model building, selection and performance assessment. The simplest method for estimating the
conditional risk is with the resubstitution or apparent error:
θ̂nRS = R(ψ(·|Pn ), Pn ) = L(y, ψ(x|Pn )) dPn (x, y).
(3)
Here each o (...truncated)