Specific Genomic Regions Are Differentially Affected by Copy Number Alterations across Distinct Cancer Types, in Aggregated Cytogenetic Data
in Aggregated Cytogenetic Data. PLoS ONE 7(8): e43689. doi:10.1371/journal.pone.0043689
Specific Genomic Regions Are Differentially Affected by Copy Number Alterations across Distinct Cancer Types, in Aggregated Cytogenetic Data
Nitin Kumar 0
Haoyang Cai 0
Christian von Mering 0
Michael Baudis 0
Patrick Tan, Duke-National University of Singapore Graduate Medical School, Singapore
0 1 Institute of Molecular Life Sciences, University of Zurich , Zurich , Switzerland , 2 Swiss Institute of Bioinformatics , Quartier Sorge, Lausanne , Switzerland
Background: Regional genomic copy number alterations (CNA) are observed in the vast majority of cancers. Besides specifically targeting well-known, canonical oncogenes, CNAs may also play more subtle roles in terms of modulating genetic potential and broad gene expression patterns of developing tumors. Any significant differences in the overall CNA patterns between different cancer types may thus point towards specific biological mechanisms acting in those cancers. In addition, differences among CNA profiles may prove valuable for cancer classifications beyond existing annotation systems. Principal Findings: We have analyzed molecular-cytogenetic data from 25579 tumors samples, which were classified into 160 cancer types according to the International Classification of Disease (ICD) coding system. When correcting for differences in the overall CNA frequencies between cancer types, related cancers were often found to cluster together according to similarities in their CNA profiles. Based on a randomization approach, distance measures from the cluster dendrograms were used to identify those specific genomic regions that contributed significantly to this signal. This approach identified 43 non-neutral genomic regions whose propensity for the occurrence of copy number alterations varied with the type of cancer at hand. Only a subset of these identified loci overlapped with previously implied, highly recurrent (hot-spot) cytogenetic imbalance regions. Conclusions: Thus, for many genomic regions, a simple null-hypothesis of independence between cancer type and relative copy number alteration frequency can be rejected. Since a subset of these regions display relatively low overall CNA frequencies, they may point towards second-tier genomic targets that are adaptively relevant but not necessarily essential for cancer development.
-
. These authors contributed equally to this work.
Genetic changes such as point mutations, regional copy number
alterations/aberrations (CNA) and structural changes (e.g. gene
fusion events) are all hallmarks of cancer. CNAs arise as somatic
changes in the tumor cell genome through a variety of mechanisms
and can be observed in virtually all types of cancer, to a varying
extent. So far, the most widely used methods for the detection of
CNAs have been chromosomal and array-based Comparative
Genomic Hybridization (CGH) techniques [14]. Localized,
recurring CNAs (hot-spots) have been shown to target canonical
oncogenes (e.g. duplications/amplifications of the MYC, MYCN,
REL loci) or tumor suppressor genes (e.g. deletions of the
CDKN2A/B, TP53, ATM loci). Some regional CNAs such as
gains on 8q and losses on 3p are present across multiple cancer
types, whereas other imbalances may be largely restricted to a
limited number of cancer entities [5].
Datasets integrated across multiple cancer types have previously
been analyzed, to report regional hot-spots of frequent CNAs
[5,6]. In a given set of individual tumor samples, the number and
distribution of CNAs varies considerably [5] and this genetic
heterogeneity has been used to detect and report co-occurring
CNAs [7].
In principle, specific patterns and similarities in the individual
and/or disease specific CNA profiles might point to distinct
oncogenomic mechanisms acting in different cancer types and
specimens, given a sufficiently large number of data points.
Indeed, clustering of CNA patterns has been used to identify
oncogenomic similarities [5,811]. The adaptation of clustering
techniques to the analysis of CNA patterns has been subject of
previous studies [1214]. With a few exceptions [5,14], however,
sample-based clustering has been the main focus of such studies so
far. In contrast, we here explore the clustering of cancer types, not
of individual cancer samples.
Both descriptive and clustering-based analyses of CNA across
multiple cancer types suffer from a bias towards the more
frequently occurring events. Due to the heterogeneity of the
overall CNA signal, with greatly varying average frequencies of
CNAs per cancer type (Figure 1a), clustering results may be
distorted depending on the disease entities analyzed. This
variation in overall CNA occurrence frequencies across cancer
types may simply be owed to differences in the average time points
of clinical detection or in different progression characteristics, and
should be corrected for prior to clustering analyses. To the best of
our knowledge, so far no implementation has been reported for a
comprehensive, very large-scale clustering analysis of
frequencynormalized cancer CNA profiles.
Here, we focus on the identification of genomic regions that
contribute meaningfully to the clustering of cancer types. From
hereon we will refer to those as non-neutral regions. As the
starting point of our analysis, we use hierarchical clustering to
arrange cancer types on the basis of their CNA frequency profiles.
We then employ a permutation approach to estimate the relative
contribution of individual genomic regions to the quality of the
clustering and to the derived relationship tree. The clustering
quality is inferred from an intrinsic measure (summed branch
lengths: tree height statistics), and genomic regions that reject the
null hypothesis are termed non-neutral. Identified regions are
compared to canonical CNA hot-spots (i.e. those that occur most
frequently across the entire dataset).
Our current analysis is based on data from a total of 25579
samples, which are classified into 160 different cancer entities
(table S1) according to the International Classification of Disease
in Oncology (ICD-O 3). Our approach is unique in that it a)
focuses less on the clustering as such but more on the individual
genomic regions that best support the clustering, b) uses an
intrinsic quality measure coupled to a permutation strategy for
validation, c) performs CNA frequency normalization prior to
analysis, and d) is based on a very large data set, processed in a
standardized setup. We aim for the identification of potential
cancer-specific driver/modulator regions, which may not have
been detected in earlier, largely hot-spot-focused approaches. All
of the underlying cancer data is available through our Progenetix
repository (www.progenetix.org; [15]).
The average overall frequency of CNAs across the entire
genome varies among different cancer types (Figure 1a). Since the
relative weight of CNAs at individual genom (...truncated)