Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier
BMC Cancer
Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier
Martin Lauss 0
Attila Frigyesi 2
Tobias Ryden 1
Mattias Hglund 0
0 Department of Oncology, Clinical Sciences, Lund University and Lund University Hospital , SE-221 85 LUND , Sweden
1 Centre for Mathematical Sciences, Lund University , Box 118, SE-221 00 Lund , Sweden
2 Department of Anesthesiology and Intensive Care, Lund University Hospital , SE-221 85 Lund , Sweden
Background: Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms. Results: The basis for the proposed approach is the use of metagenes, instead of collections of individual genes, and a feature selection using AUC values obtained by ROC analysis. Each gene in a data set is assigned an AUC value relative to the tumor class under investigation and the genes are ranked according to these values. Metagenes are then formed by calculating the mean expression level for an increasing number of ranked genes, and the metagene expression value that optimally discriminates tumor classes in the training set is used for classification of new samples. The performance of the metagene is then evaluated using LOOCV and balanced accuracies. Conclusions: We show that the simple uni-variate gene expression average algorithm performs as well as several alternative algorithms such as discriminant analysis and the more complex approaches such as SVM and neural networks. The R package rocc is freely available at http://cran.r-project.org/web/packages/rocc/index.html.
-
Background
One of the most promising clinical applications of
genome wide expression studies is the construction of
robust and reliable disease classifiers. Correct
identification and sub-classification of diseases such as cancer is
a prerequisite for proper and efficient treatment. To
date a large number of different algorithms for disease
classification have been described. They range in
complexity from neural network approaches [1] to the
simpler nearest-neighbor classification algorithms [2]. Even
though some of the more complex approaches such as
neural networks and self organized maps (SOM) [3]
have proved to be very efficient, these methods often
rely on the tuning of several parameters and hence are
liable for over-fitting. Furthermore, simple classifiers
seem to perform remarkably well when compared to
more sophisticated ones [4]. In the present investigation
our aim has been to design a simple predictor system
useful for cancer subtype classification. Features to be
included in the predictor signatures are selected based
on their classification capacity as determined by a
receiver operating characteristic (ROC) analysis and area
under the curve (AUC) estimates [5,6]. After selection
of the appropriate number of genes in the predictor
signature, the mean expression level of all genes included
is calculated, transforming the ensemble of genes into
one vector and used as a uni-variate gene expression
average, or a metagene, as classifier. Two features of
gene expression are exploited by the merging of genes,
genes are often co-regulated and hence correlated, and
by using the expression level of the metagene, effects by
random noise from single genes are minimized. Most of
the commonly used algorithms such as SVM [7] and
PAM [8] apply specifications such as support vectors
and weights to the individual features included in the
predictor gene signatures which potentially complicate
their application to independent data [9]. Hence in this
investigation we use an alternative way to evaluate the
results by using the obtained training set gene signature
genes only and then establish new parameters in the
validation set to evaluate the performance of the
classifier. We show that the proposed metagene classifier
produces excellent accuracies, similar to what is obtained
with a SVM approach, in several types of cancer data
sets using a variety of tumor classification criteria.
Implementation
Data sets
To establish the classifier we used bladder cancer
datasets produced by Sanchez-Carbayo et al. [10]
(Supplementary Table 10 in [10]) SanchezC, Stransky et al.
[11] (ArrayExpress: E-TABM-147) Stransky"; and Blaveri
et al. [12] (Supplementary Table 4 in [12]) Blaveri. The
remaining datasets were obtained from Gene Expression
Omnibus (GEO) [13], except for the vandeVijver breast
cancer dataset [14]. The following datasets were
downloaded from GEO; for breast GSE2034 (WangY),
GSE2990 (Sotiriou), for neuroblastoma GSE3960
(WangQ), GSE12460 (JanoueixL), GSE19274 (Attiyeh),
for lung GSE8569 (Angulo), GSE11969 (Takeuchi). For a
detailed description of the datasets see Additional file 1.
Normal urothelium samples, recurring tumors from the
same patient, cell lines, and technical replicates were not
included in the final bladder cancer data sets. The
SanchezC dataset was quantile-normalized using the
normalizeBetweenArrays function of the R package limma [15].
Robust Multi-array Average (RMA) was performed
separately for two samples sets of the Stransky dataset (on
U95A and U95Av2 respectively) using the affy package
[16]. Obtained RMA expression values were de-logged,
the samples sets combined, and quantile normalized
using limma. The SanchezC and Stransky datasets were
both transformed to log2 scale. To obtain gene-centered
values the gene expression values were subtracted by the
mean expression of the gene in each dataset separately.
The Blaveri dataset was imputed for missing values using
k-nearest neighbors (k = 10) for genes that had no more
than 20% missing data, and genes with >20% missing
data were omitted [17]. The HGNC GeneSymbols were
updated in all datasets with the official HGNC
GeneSymbols from the HGNC webpage [18]. The expression
values of GeneSymbols with multiple reporters were
merged by taking the median expression value. All
reporters in the datasets without a GeneSymbol were
discarded. The final SanchezC dataset contained 90 patients
and 12761 genes, the Stransky dataset 56 patients and
8955 genes, and the Blaveri dataset 74 patients and 4430
genes. The SanchezC and Stransky datasets share a total
of 8518 GeneSymbols and were used to explore the AUC
characteristics. For classification, Ta and T1 cases were
considered non-muscle invasive (NMI), and T2 cases as
muscle-invasive (MI). Grade is discriminated between
Grade 2 and 3 in SanchezC, and between Grade1+2 and
Grade 3 in Sanchez. Randomized versions of the datasets
were generated using the mean and st (...truncated)