Optimally splitting cases for training and testing high dimensional classifiers (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1755-8794-4-31.pdf

Optimally splitting cases for training and testing high dimensional classifiers

BMC Medical Genomics Optimally splitting cases for training and testing high dimensional classifiers Kevin K Dobbin 0 Richard M Simon 1 0 Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia , Athens, GA , USA 1 Biometric Research Branch, National Cancer Institute, National Institutes of Health , Rockville, MD , USA Background: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results: We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions: By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split. - Background The split sample approach is a widely used study design in high dimensional settings. This design divides the collection into a training set and a test set as a means of estimating classification accuracy. A classifier is developed on the training set and applied to each sample in the test set. In practice, statistical prediction models have often been developed without separating the data used for model development from the data used for estimation of prediction accuracy [1]. When the number of candidate predictors (p) is larger than the number of cases as in microarray data, such separation is essential to avoid large bias in estimation of prediction accuracy [2]. This paper addresses the question of how to optimally split a sample into a training set and a test set for a high dimensional gene expression study, that is, how many samples to allocate to each group. Two approaches to evaluating splits of the data are examined. The first approach is based on simulations designed to understand qualitatively the relationships among dataset characteristics and optimal split proportions. We use these results also to evaluate commonly used rules-of-thumb for allocation of the data to training and test sets. Our second approach involves development of a non-parametric method that does not rely on distributional assumptions and can be applied directly to any existing dataset without stipulating any parameter values. The nonparametric method can be used with any predictor development method (e.g., nearest neighbor, support vector machine). This paper addresses the situation in which the accuracy of a predictor will be assessed by its performance on a separate test set. An alternative approach is to apply resampling-based methods to the whole dataset. Because re-sampling strategies have been commonly mis-used, often resulting in highly biased estimates of prediction accuracy [2,3], many journals and reviewers mis-trust cross-validation and require validation on a sample not used for model development. Another advantage of the split sample method, particularly in large collaborative studies in which multiple groups will be developing predictors, is that the test set can be kept under lock and key by a honest broker [4]. The question addressed in this paper has not to our knowledge been addressed before. Sample splitting has been addressed in other contexts, such as comparing different k-fold cross validations [5] or developing hold out estimation theory [6] and bounds on Bayes error [7]. Mukherjee et al. [8], Fu et al. [9], and Dobbin and Simon [10] developed methods for planning the size of a training set, but these methods do not address the allocation of cases in an existing dataset to training and test portions. Since many gene expression based classifiers are developed retrospectively, there is often little control of the sample size. In the next section we describe the parametric modeling approach and the nonparametric approach that can be applied to specific datasets. We also present the results of application of these methods to synthetic and real world datasets. In the Conclusions section, recommendations for dividing a sample into a training set and test set are discussed. Approach The classifier taken forward from a split-sample study is often the one developed on the full dataset. This full-dataset classifier comes from combining the training and test sets together. The full-dataset classifier has an unknown accuracy which is estimated by applying the classifier derived on the training set to the test set. The optimal split will then be the one that minimizes the mean squared error (MSE) with respect to this full-dataset classifier. The MSE naturally penalizes for bias (from using a training set smaller than n) and variance. MSE decomposition In the supplemental material [Additional file 1: Supplemental Section 1.2], it is shown that under mild assumptions the MSE is proportional to MSE A + V + B. Here we have symbols A, V and B to depict the decomposition, and these are used throughout the discussion below. Here is a description of each term in Equation (1). Figure 1 shows the breakdown visually. A = Accuracy Variance Term The first term in Equation (1) reflects the variance in the true accuracy of a classifier developed on a training set T selected from the full dataset S. Not all training sets T S will result in predictors with exactly the same accuracy. The variation in actual (true) accuracy among all these different predictors is the A term. V = Binomial Variance Term The second term in Equation (1) is the variance in the estimated accuracy that results from applying the classifier to the test set. This is a binomial varianc (...truncated)