Optimally splitting cases for training and testing high dimensional classifiers
BMC Medical Genomics
Optimally splitting cases for training and testing high dimensional classifiers
Kevin K Dobbin 0
Richard M Simon 1
0 Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia , Athens, GA , USA
1 Biometric Research Branch, National Cancer Institute, National Institutes of Health , Rockville, MD , USA
Background: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results: We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions: By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.
-
Background
The split sample approach is a widely used study design
in high dimensional settings. This design divides the
collection into a training set and a test set as a means of
estimating classification accuracy. A classifier is
developed on the training set and applied to each sample in
the test set. In practice, statistical prediction models
have often been developed without separating the data
used for model development from the data used for
estimation of prediction accuracy [1]. When the number of
candidate predictors (p) is larger than the number of
cases as in microarray data, such separation is essential
to avoid large bias in estimation of prediction accuracy
[2]. This paper addresses the question of how to
optimally split a sample into a training set and a test set
for a high dimensional gene expression study, that is,
how many samples to allocate to each group.
Two approaches to evaluating splits of the data are
examined. The first approach is based on simulations
designed to understand qualitatively the relationships
among dataset characteristics and optimal split
proportions. We use these results also to evaluate commonly
used rules-of-thumb for allocation of the data to
training and test sets. Our second approach involves
development of a non-parametric method that does not rely
on distributional assumptions and can be applied
directly to any existing dataset without stipulating any
parameter values. The nonparametric method can be
used with any predictor development method (e.g.,
nearest neighbor, support vector machine).
This paper addresses the situation in which the
accuracy of a predictor will be assessed by its
performance on a separate test set. An alternative
approach is to apply resampling-based methods to the
whole dataset. Because re-sampling strategies have
been commonly mis-used, often resulting in highly
biased estimates of prediction accuracy [2,3], many
journals and reviewers mis-trust cross-validation and
require validation on a sample not used for model
development. Another advantage of the split sample
method, particularly in large collaborative studies in
which multiple groups will be developing predictors, is
that the test set can be kept under lock and key by a
honest broker [4].
The question addressed in this paper has not to our
knowledge been addressed before. Sample splitting has
been addressed in other contexts, such as comparing
different k-fold cross validations [5] or developing hold
out estimation theory [6] and bounds on Bayes error
[7]. Mukherjee et al. [8], Fu et al. [9], and Dobbin and
Simon [10] developed methods for planning the size of
a training set, but these methods do not address the
allocation of cases in an existing dataset to training and
test portions. Since many gene expression based
classifiers are developed retrospectively, there is often little
control of the sample size.
In the next section we describe the parametric
modeling approach and the nonparametric approach that can
be applied to specific datasets. We also present the
results of application of these methods to synthetic and
real world datasets. In the Conclusions section,
recommendations for dividing a sample into a training set and
test set are discussed.
Approach
The classifier taken forward from a split-sample study
is often the one developed on the full dataset. This
full-dataset classifier comes from combining the
training and test sets together. The full-dataset classifier
has an unknown accuracy which is estimated by
applying the classifier derived on the training set to the test
set. The optimal split will then be the one that
minimizes the mean squared error (MSE) with respect to
this full-dataset classifier. The MSE naturally penalizes
for bias (from using a training set smaller than n) and
variance.
MSE decomposition
In the supplemental material [Additional file 1:
Supplemental Section 1.2], it is shown that under mild
assumptions the MSE is proportional to
MSE A + V + B.
Here we have symbols A, V and B to depict the
decomposition, and these are used throughout the
discussion below. Here is a description of each term in
Equation (1). Figure 1 shows the breakdown visually.
A = Accuracy Variance Term The first term in
Equation (1) reflects the variance in the true accuracy of a
classifier developed on a training set T selected from
the full dataset S. Not all training sets T S will result
in predictors with exactly the same accuracy. The
variation in actual (true) accuracy among all these different
predictors is the A term.
V = Binomial Variance Term The second term in
Equation (1) is the variance in the estimated accuracy
that results from applying the classifier to the test set.
This is a binomial varianc (...truncated)