Prediction error estimation: a comparison of resampling methods (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/21/15/3301/50340684/bioinformatics_21_15_3301.pdf

Prediction error estimation: a comparison of resampling methods

BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 15 2005, pages 3301–3307 doi:10.1093/bioinformatics/bti499 Data and text mining Prediction error estimation: a comparison of resampling methods Annette M. Molinaro1,3,∗ , Richard Simon2 and Ruth M. Pfeiffer1 1 Biostatistics Branch, Division of Cancer Epidemiology and Genetics and 2 Biometric Research Branch, Division of Cancer Treatment and Diagnostics, NCI, NIH, Rockville, MD 20852 USA and 3 Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520, USA ABSTRACT Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the ‘true’ prediction error of a prediction model in the presence of feature selection. Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. Contact: Supplementary Information: A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm). 1 INTRODUCTION In genomic experiments one frequently encounters high dimensional data and small sample sizes. Microarrays simultaneously monitor expression levels for several thousands of genes. Proteomic profiling studies using SELDI-TOF (surface-enhanced laser desorption and ionization time-of-flight) measure size and charge of proteins and protein fragments by mass spectroscopy, and result in up to 15 000 intensity levels at prespecified mass values for each spectrum. Sample sizes in such experiments are typically <100. In many studies, observations are known to belong to predetermined classes and the task is to build predictors or classifiers for new observations whose class is unknown. Deciding which genes or proteomic measurements to include in the prediction is called feature selection and is a crucial step in developing a class predictor. ∗ To whom correspondence should be addressed. Published by Oxford University Press 2005 Including too many noisy variables reduces accuracy of the prediction and may lead to over-fitting of data, resulting in promising but often non-reproducible results (Ransohoff, 2004). Another difficulty is model selection with numerous classification models available. An important step in reporting results is assessing the chosen model’s error rate, or generalizability. In the absence of independent validation data, a common approach to estimating predictive accuracy is based on some form of resampling the original data, e.g. cross-validation. These techniques divide the data into a learning set and a test set, and range in complexity from the popular learning-test split to v-fold cross-validation, Monte-Carlo v-fold cross-validation and bootstrap resampling. Few comparisons of standard resampling methods have been performed to date, and all of them exhibit limitations that make their conclusions inapplicable to most genomic settings. Early comparisons of resampling techniques in the literature are focussed on model selection as opposed to prediction error estimation (Breiman and Spector, 1992; Burman, 1989). In two recent assessments of resampling techniques for error estimation (Braga-Neto and Dougherty, 2004; Efron, 2004), feature selection was not included as part of the resampling procedures, causing the conclusions to be inappropriate for the high-dimensional setting. We have performed an extensive comparison of resampling methods to estimate prediction error using simulated (large signal-to-noise ratio), microarray (intermediate signal to noise ratio) and proteomic data (low signal-to-noise ratio), encompassing increasing sample sizes with large numbers of features. The impact of feature selection on the performance of various cross-validation methods is highlighted. The results elucidate the ‘best’ resampling techniques for future research involving high dimensional data to avoid overly optimistic assessment of the performance of a model. 2 METHODS In the prediction problem, one observes n independent and identically distributed (i.i.d.) random variables O1 , . . . , On with unknown distribution P . Each observation in O consists of an outcome Y with range Y and an l-vector of measured covariates, or features, X with range X , such that Oi = (Xi , Yi ), i = 1, . . . , n. In microarray experiments X includes gene expression measurements, while in proteomic data, it includes the intensities at the mass over charge (m/z) values. X may also contain covariates such as a patient’s age and/or histopathologic measurements. The outcome Y may be a continuous measure such as months to disease or a categorical measure such as disease status. 3301 Received on April 6, 2005; revised on April 28, 2005; accepted on May 12, 2005 Advance Access publication May 19, 2005 A.M.Molinaro et al. The rule in Equation (1) is constructed and evaluated upon the distribution P , as such, θ̃ is referred to as the asymptotic risk. However, in reality P is unknown, thus, the rule based upon the observations O1 , . . . , On has an expected loss, or conditional risk (also known as the generalization error), defined as: θ̃n = R(ψ(·|Pn ), P ) = L(y, ψ(x|Pn )) dP (x, y). (2) There are two impetuses for evaluating the conditional risk: model selection and performance assessment. In model selection, the goal is to find the one which minimizes the conditional risk over a collection of potential models. In performance assessment, the goal is to estimate the generalization error for a given model, i.e. assess how well it predicts the outcome of an observation not included in O. In an ideal setting an independent dataset would be available for the purposes of model selection and estimating the generalization error. Typically, however, one must use the observed sample O for model building, selection and performance assessment. The simplest method for estimating the conditional risk is with the resubstitution or apparent error: θ̂nRS = R(ψ(·|Pn ), Pn ) = L(y, ψ(x|Pn )) dPn (x, y). (3) Here each o (...truncated)