Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems
Kenneth R Hess
2
Caimiao Wei
2
Yuan Qi
2
Takayuki Iwamoto
0
W Fraser Symmans
1
Lajos Pusztai
0
0
Breast Medical Oncology, University of Texas MD Anderson Cancer Center Houston
,
Texas
,
USA
1
Pathology, University of Texas MD Anderson Cancer Center Houston
,
Texas
,
USA
2
Department of Biostatistics, University of Texas MD Anderson Cancer Center
,
Houston, Texas
,
USA
Background: Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation. Results: Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets. Conclusions: We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.
-
Background
Gene expression data are commonly used to develop
multi-gene prediction models for various clinical
classification problems. Several gene expression-based
multivariate prognostic and treatment sensitivity predictors
have been developed for breast cancer and numerous
other gene signatures have been reported to predict
specific biological states including pathway activity and
mutation status of p53, BRCA, PIK3 and other genes in
cancer [1-9]. However, many genomic predictors yielded
low accuracy in independent validation [10-14]. It also
seems apparent that some classification problems are
easier to solve than others in the mRNA expression
space. For example, it is straightforward to construct
accurate classifiers for breast cancer that predict
estrogen-receptor (ER) status or histologic grade due to the
large scale gene expression differences that exist
between ER-positive and -negative or low grade versus
high grade cancers [14-17]. Many of the empirically
developed first generation prognostic and predictive
gene signatures for breast cancer derive their predictive
value from recognizing molecular equivalents of ER
status and tumor grade. This is because prognosis, drug
response rates and even p53, PI3K or BRCA mutation
status are not evenly distributed between ER-positive
and -negative breast cancer [18]. When clinically more
homogeneous subtypes of breast cancers are analyzed, it
has been difficult to develop outcome predictors with
good performance metrics [19].
Supervised classification models are developed
through comparison of groups of samples that differ in
clinical outcome of interest. The first step typically
involves identification of informative probe sets/genes (i.
e. features) that are differentially expressed between the
groups. Next, these informative features are considered
as variables to train a multivariate classification model.
Intuitively, the predictive performance of classifiers must
depend on the number of informative features, the
magnitude of difference in feature expression levels between
the groups of interest, and the number of informative
cases in each group. These critical parameters are
expected to vary from classification problem to
classification problem and from data set to data set. However,
it is not well understood how each of these components
influence the success of the classifier development
process and what the minimum requirement to develop
successful predictors might be.
The goal of this analysis was to take public breast
cancer gene expression datasets, spike these with a series of
artificial gene signatures and assess how well these
spiked-in gene signatures could be recovered and used
to develop a multi-gene classifier to predict spiked-in
status of a sample. The artificial gene signatures
consisted of real probe sets whose expression values were
increased (i.e. spiked) with a constant. The extent of
perturbation varied over a broad range for three key
parameters: (i) the number of samples perturbed (i.e.
informative cases), (ii) the number of probe sets
included in the artificial signature (i.e. signature size),
and (iii) the fold increase in mean expression value for
the spiked probes (i.e. signature strength). To place our
findings into context, we also calculated gene signature
size and strength for nine different real-life clinical
prediction problems in six different data sets.
Methods
Data sets
We used 3 publically available human breast cancer
gene expression data sets each generated with
Affymetrix U133A gene chips. These included the Microarray
Quality Control Consortium (MAQC II) breast cancer
data (n = 233, Gene Expression Omnibus [GEO
accession number GSE 16716] [20], the TRANSBIG data set
[n = 199, GSE 7390] [3] and the Wang et al data set [n
= 286, GSE 2034] [2]. Each data set was analyzed
separately using identical analysis plan to assess consistency
of findings. The individual Affymetrix CEL files were
MAS5 normalized to a median target array intensity of
600 and expression values were transformed to log base
2 values using the Bioconductor software http://www.
bioconductor.org.
Perturbing of probe set expression values
We randomly selected s samples (s = 10, 15, 20, 25, 30,
40, 60, 80, 100) to be perturbed in each data set. In the
classification exercise described below, these s perturbed
samples represent one class and the remaining samples
in the dataset represent the other class. For each s sample
set, we randomly selected g probe sets (g = 10, 15, 20, 25,
30, 50, 100, 250, and 500) to represent the informative
features (i.e. spiked gene signature). We altered the
normalized, log2-transformed expression values of each g
probe sets by adding the same c constant (c = 0, 0.5, 1,
1.2, 1.5, 2, 3, 4). This is equivalent to multiplying the
original scale value by 2c. So, c = 0 corresponds to
unperturbed data, c = 1.0 corresponds to a 2-fold increase and
c = 2 corresponds to a 4-fold increase. (...truncated)