Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-12-463.pdf

Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems

Kenneth R Hess 2 Caimiao Wei 2 Yuan Qi 2 Takayuki Iwamoto 0 W Fraser Symmans 1 Lajos Pusztai 0 0 Breast Medical Oncology, University of Texas MD Anderson Cancer Center Houston , Texas , USA 1 Pathology, University of Texas MD Anderson Cancer Center Houston , Texas , USA 2 Department of Biostatistics, University of Texas MD Anderson Cancer Center , Houston, Texas , USA Background: Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation. Results: Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets. Conclusions: We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets. - Background Gene expression data are commonly used to develop multi-gene prediction models for various clinical classification problems. Several gene expression-based multivariate prognostic and treatment sensitivity predictors have been developed for breast cancer and numerous other gene signatures have been reported to predict specific biological states including pathway activity and mutation status of p53, BRCA, PIK3 and other genes in cancer [1-9]. However, many genomic predictors yielded low accuracy in independent validation [10-14]. It also seems apparent that some classification problems are easier to solve than others in the mRNA expression space. For example, it is straightforward to construct accurate classifiers for breast cancer that predict estrogen-receptor (ER) status or histologic grade due to the large scale gene expression differences that exist between ER-positive and -negative or low grade versus high grade cancers [14-17]. Many of the empirically developed first generation prognostic and predictive gene signatures for breast cancer derive their predictive value from recognizing molecular equivalents of ER status and tumor grade. This is because prognosis, drug response rates and even p53, PI3K or BRCA mutation status are not evenly distributed between ER-positive and -negative breast cancer [18]. When clinically more homogeneous subtypes of breast cancers are analyzed, it has been difficult to develop outcome predictors with good performance metrics [19]. Supervised classification models are developed through comparison of groups of samples that differ in clinical outcome of interest. The first step typically involves identification of informative probe sets/genes (i. e. features) that are differentially expressed between the groups. Next, these informative features are considered as variables to train a multivariate classification model. Intuitively, the predictive performance of classifiers must depend on the number of informative features, the magnitude of difference in feature expression levels between the groups of interest, and the number of informative cases in each group. These critical parameters are expected to vary from classification problem to classification problem and from data set to data set. However, it is not well understood how each of these components influence the success of the classifier development process and what the minimum requirement to develop successful predictors might be. The goal of this analysis was to take public breast cancer gene expression datasets, spike these with a series of artificial gene signatures and assess how well these spiked-in gene signatures could be recovered and used to develop a multi-gene classifier to predict spiked-in status of a sample. The artificial gene signatures consisted of real probe sets whose expression values were increased (i.e. spiked) with a constant. The extent of perturbation varied over a broad range for three key parameters: (i) the number of samples perturbed (i.e. informative cases), (ii) the number of probe sets included in the artificial signature (i.e. signature size), and (iii) the fold increase in mean expression value for the spiked probes (i.e. signature strength). To place our findings into context, we also calculated gene signature size and strength for nine different real-life clinical prediction problems in six different data sets. Methods Data sets We used 3 publically available human breast cancer gene expression data sets each generated with Affymetrix U133A gene chips. These included the Microarray Quality Control Consortium (MAQC II) breast cancer data (n = 233, Gene Expression Omnibus [GEO accession number GSE 16716] [20], the TRANSBIG data set [n = 199, GSE 7390] [3] and the Wang et al data set [n = 286, GSE 2034] [2]. Each data set was analyzed separately using identical analysis plan to assess consistency of findings. The individual Affymetrix CEL files were MAS5 normalized to a median target array intensity of 600 and expression values were transformed to log base 2 values using the Bioconductor software http://www. bioconductor.org. Perturbing of probe set expression values We randomly selected s samples (s = 10, 15, 20, 25, 30, 40, 60, 80, 100) to be perturbed in each data set. In the classification exercise described below, these s perturbed samples represent one class and the remaining samples in the dataset represent the other class. For each s sample set, we randomly selected g probe sets (g = 10, 15, 20, 25, 30, 50, 100, 250, and 500) to represent the informative features (i.e. spiked gene signature). We altered the normalized, log2-transformed expression values of each g probe sets by adding the same c constant (c = 0, 0.5, 1, 1.2, 1.5, 2, 3, 4). This is equivalent to multiplying the original scale value by 2c. So, c = 0 corresponds to unperturbed data, c = 1.0 corresponds to a 2-fold increase and c = 2 corresponds to a 4-fold increase. (...truncated)