EMT is the dominant program in human colon cancer
-
program in
Loboda et al.
Open Access
EMT is the dominant program in
human colon cancer
Andre Loboda1, Michael V Nebozhyn1, James W Watters1, Carolyne A Buser3, Peter Martin Shaw2, Pearl S Huang3,
Laura Vant Veer7, Rob AEM Tollenaar8, David B Jackson6, Deepak Agrawal5, Hongyue Dai4, Timothy J Yeatman5*
Background: Colon cancer has been classically described by clinicopathologic features that permit the prediction
of outcome only after surgical resection and staging.
Methods: We performed an unsupervised analysis of microarray data from 326 colon cancers to identify the first
principal component (PC1) of the most variable set of genes. PC1 deciphered two primary, intrinsic molecular
subtypes of colon cancer that predicted disease progression and recurrence.
Results: Here we report that the most dominant pattern of intrinsic gene expression in colon cancer (PC1) was
tightly correlated (Pearson R = 0.92, P < 10-135) with the EMT signature both in gene identity and directionality. In
a global micro-RNA screen, we further identified the most anti-correlated microRNA with PC1 as MiR200, known to
regulate EMT.
Conclusions: These data demonstrate that the biology underpinning the native, molecular classification of human
colon cancerpreviously thought to be highly heterogeneous was clarified through the lens of comprehensive
transcriptome analysis.
Background
Colon cancer has long been postulated to be a
molecularly heterogeneous disease. This heterogeneity has been
proposed as the reason why it has been difficult to
identify unifying molecular hypotheses explaining the
biology and behavior of the disease. Molecular profiling of
colon cancer has been a relatively effective approach for
identifying prognosis of early and intermediate stage
disease. We and others have identified biologically complex
signatures that affect multiple programs such as
adhesion, invasion, and angiogenesis and correlate well with
cancer progression and recurrence. These signatures
appear to support Weinbergs hypothesis [1] of multiple
programs leading to cancer development and
progression. These signatures have generally been developed
using supervised machine learning techniques that train
their models on pre-determined good vs. poor prognosis
patient populations [2-6]. Colon cancer, unlike breast
cancer where luminal and basal intrinsic subtypes have
* Correspondence:
5Moffitt Cancer Center, 12902 Magnolia Drive, Tampa, FL 33612, USA
Full list of author information is available at the end of the article
been identified [7-13], or bladder cancer where intrinsic
signatures of recurrence have been established [14,15],
has yet to be classified by unsupervised, molecular
profiling approaches. We believed it was important to
attempt to uncover unbiased, native biological traits that
might underpin colon cancer.
Methods
Colon Cancer Samples
326 human colon cancer samples derived from the
Moffitt Cancer Center were previously assessed using a
single Affymetrix U133Plus2.0 platform and single standard
operating procedure. Formalin fixed paraffin blocks
(FFPE) were obtained for 69 of these cases and used to
extract tumor RNA after macrodissection. Tumor RNA
was submitted for global microRNA analysis using an
Applied Biosystems platform covering ~700 unique
microRNA species. The gene expression data were then
compared directly to the microRNA data derived from
the same samples. All patient samples and clinical
information for the 326 colon samples were obtained
through a protocol approved by The University of South
Florida Institutional Review Board.
Identification of the cell line derived EMT signature
The EMT signature was derived from a microarray
dataset with 93 lung cancer cell lines by performing a t-test
comparing cell lines exhibiting mesenchymal-like gene
expression pattern (high levels of VIM and low levels of
CDH1) vs. cell lines with epithelial-like gene expression
pattern (low levels of VIM and high levels of CDH1).
Genes with p-value < 0.01 by a t-test were selected, and
were split into those that were up-regulated in
mesenchymal-like cell lines and those that were up-regulated in
epithelial like, and further restricted to approximately
200 unique gene symbols in each up and down
regulated gene sets based on the absolute value of the fold
change.
Identification of PC1
Unsupervised analysis of the most variable genes
expressed in the colon cancer data set (n = 326) was
undertaken to discover new, intrinsic biology of colon
cancer. Principal component analysis on the entire gene
expression data set of 326 CRC samples, as
implemented in the Princomp function in Matlab, (Mathworks
Inc.), was computed by selecting the 1st principal
component (PC1) corresponding to the highest eigenvalue of
the covariance matrix, describing the inherent variability
of the data.
Derivation of colon signatures
We identified a set of gene sets that were associated
with different endpoints related to tumor histology.
Signatures for each of the following scenarios was created:
right/left (RT/LT) colon was computed by comparing 60
samples collected in RT Colon vs. 18 samples collected
in LT Colon; Mucinous/Non-Mucinous colon
carcinoma was developed by comparing 35 mucinous colon
carcinomas vs. 165 non-mucinous; MSI/MSS was
created by comparing 6 MSI vs. 73 MSS samples;
Carcinoma vs. Adenoma was developed by comparing 22
pure adenocarcinoma samples vs. 5 pure adenomas;
Poor/Well differentiation was discovered by comparing
32 poorly differentiated samples vs. 19 well
differentiated, Colon/Rectum by comparing 50 samples
collected in colon vs. 19 samples collected in rectum;
Stage2/Stage1 was identified by comparing 59 stage 2
samples vs. 32 stage 1 samples, Stage 3/Stage 2 (71
Stage 3 samples vs. 59 Stage 2 samples) was similarly
identified. Each comparison was carried on non-metastatic
samples with known stage, histology, and collection site.
For each comparison, two gene sets (up and down
regulated) were identified by t-test with p-value < 0.01, split by
a sign of fold change, selection of unique gene symbols
among 100 probes most differentially expressed by an
absolute value of fold change. Performance of these gene
sets was evaluated by back substitution and the scores for
gene sets were computed as the mean of probes mapped
by the gene symbol to the up-regulated subset minus the
mean of the probes that mapped by the gene symbol to
the down-regulated subset. They were found to have ROC
AUC>0.7 and 1-way ANOVA p-value < 1e-6 when applied
to distinguish the same samples that were used to identify
these gene sets.
Scoring of signatures in the data set
Signature score for a given gene set was obtained by
averaging the expression levels of the probes that
mapped by the gene symbol to that gene set. MYC and
RAS signatures were obtained from Nevins et al [16,17].
Standard microarray data processing
The microarray data was processed by running RMA
normalization method as implemented in Affymetrix
Power Tools usin (...truncated)