An algorithm for classifying tumors based on genomic aberrations and selecting representative tumor models
Lu et al. BMC Medical Genomics 2010, 3:23
http://www.biomedcentral.com/1755-8794/3/23
Open Access
RESEARCH ARTICLE
An algorithm for classifying tumors based on
genomic aberrations and selecting representative
tumor models
Research article
Xin Lu*1, Ke Zhang2, Charles Van Sant3, John Coon4 and Dimitri Semizarov*1
Abstract
Background: Cancer is a heterogeneous disease caused by genomic aberrations and characterized by significant
variability in clinical outcomes and response to therapies. Several subtypes of common cancers have been identified
based on alterations of individual cancer genes, such as HER2, EGFR, and others. However, cancer is a complex disease
driven by the interaction of multiple genes, so the copy number status of individual genes is not sufficient to define
cancer subtypes and predict responses to treatments. A classification based on genome-wide copy number patterns
would be better suited for this purpose.
Method: To develop a more comprehensive cancer taxonomy based on genome-wide patterns of copy number
abnormalities, we designed an unsupervised classification algorithm that identifies genomic subgroups of tumors. This
algorithm is based on a modified genomic Non-negative Matrix Factorization (gNMF) algorithm and includes several
additional components, namely a pilot hierarchical clustering procedure to determine the number of clusters, a
multiple random initiation scheme, a new stop criterion for the core gNMF, as well as a 10-fold cross-validation stability
test for quality assessment.
Result: We applied our algorithm to identify genomic subgroups of three major cancer types: non-small cell lung
carcinoma (NSCLC), colorectal cancer (CRC), and malignant melanoma. High-density SNP array datasets for patient
tumors and established cell lines were used to define genomic subclasses of the diseases and identify cell lines
representative of each genomic subtype. The algorithm was compared with several traditional clustering methods and
showed improved performance. To validate our genomic taxonomy of NSCLC, we correlated the genomic classification
with disease outcomes. Overall survival time and time to recurrence were shown to differ significantly between the
genomic subtypes.
Conclusions: We developed an algorithm for cancer classification based on genome-wide patterns of copy number
aberrations and demonstrated its superiority to existing clustering methods. The algorithm was applied to define
genomic subgroups of three cancer types and identify cell lines representative of these subgroups. Our data enabled
the assembly of representative cell line panels for testing drug candidates.
Background
Cancer is a disease of the genome that is characterized by
substantial variability in the clinical course, outcome, and
response to therapies. A key factor underlying this variability is the genomic heterogeneity of human tumors:
individual tumors of the same histopathological subtype
* Correspondence: ,
Global Pharmaceutical Research and Development, Abbott Laboratories, 100
Abbott Park Road, Building AP-10, Dep. R4CD, Abbott Park, IL 60064, USA
Full list of author information is available at the end of the article
and anatomical origin typically carry different aberrations in their cellular DNA. Many of the most efficacious
recent drugs target specific genetic aberrations rather
than histological disease subtypes, for example trastuzumab and lapatinib for treating HER2-positive breast
cancers [1], tamoxifen for treating ER-positive breast cancers[2,3], and gefitinib and erlotinib for non-small cell
lung cancer with EGFR mutations [4-8].
Several subtypes of common cancers have been identified based on the aberrations of individual cancer genes,
© 2010 Lu et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Lu et al. BMC Medical Genomics 2010, 3:23
http://www.biomedcentral.com/1755-8794/3/23
for example HER2-amplified breast cancer [1,9,10],
EGFR-mutated and EGFR-amplified non-small-cell lung
cancer [5,8], and others. However, cancer is a complex
disease driven by the interaction of multiple genes and
pathways [11,12]. Therefore, the copy number status of
individual genes may not be sufficient to define cancer
subtypes and predict the response to treatments. More
comprehensive cancer taxonomy needs to be designed
based on genome-wide patterns of DNA copy number
abnormalities.
Previous ground-breaking studies have reported
molecular classifications for key cancer types based on
their global patterns of gene expression [13-16]. As the
high-density array technology became a reliable tool for
copy number profiling, multiple gene copy number datasets were generated, revealing the genomic heterogeneity
of key cancer types at the gene copy number level [17].
Various clustering methodologies have been applied to
comparative genomic hybridization (CGH) data sets to
classify cancers based on their copy number patterns and
identify copy number aberration hotspots [17-23]. Taxonomies based on gene copy number have a number of
advantages over gene expression-based classifications. In
particular, copy number alterations are stable events, not
affected by cell cycle or cytokine stimulation. Additionally, they show greater consistency between primary
human tumors and cultured cell lines.
Here we developed a copy number-based methodology
for cancer classification in order to enable identification
of genomic subgroups of major cancer types and facilitate
rational selection of tumor models representative of individual subgroups. The methodology is based on the previously published genomic non-negative matrix
factorization (gNMF) algorithm [23-26], with several
major modifications to enhance the performance. We
applied the algorithm to three major tumor types: nonsmall cell lung carcinoma (NSCLC), colorectal carcinoma
(CRC), and malignant melanoma, identified distinct
genomic subtypes for each cancer, and identified cell lines
representative of each subtype. Our data enabled the
assembly of representative cell line panels for testing drug
candidates.
Methods
Development of a tumor classification methodology based
on genome-wide copy number profiles
The overall flow of our tumor classification methodology
is illustrated in Fig. 1. After data pre-processing, a sample
quality control procedure was applied to eliminate contaminated samples. For the remaining samples, a pilot
hierarchical clustering was first applied to the segment
smoothed tumor and cell line CGH data to determine the
range of possible numbers of clusters, because the number of clusters needs to be fed into the gNMF algorithm,
Page 2 of 14
but is usually unknown for a given data set. The modified
gNMF algorithm was then applied to the same set of segme (...truncated)