A segmental maximum a posteriori approach to genome-wide copy number profiling
BIOINFORMATICS
ORIGINAL PAPER
Vol. 24 no. 6 2008, pages 751–758
doi:10.1093/bioinformatics/btn003
Genome analysis
A segmental maximum a posteriori approach to genome-wide
copy number profiling
Robin Andersson1, Carl E. G. Bruder2, Arkadiusz Piotrowski2, Uwe Menzel3, Helena Nord3,
Johanna Sandgren4, Torgeir R. Hvidsten1, Teresita Diaz de Ståhl3, Jan P. Dumanski2,3
and Jan Komorowski1,5,*
1
Received and revised on December 19, 2007; accepted on January 2, 2008
Advance Access publication January 19, 2008
Associate Editor: Alex Bateman
ABSTRACT
Motivation: Copy number profiling methods aim at assigning DNA
copy numbers to chromosomal regions using measurements from
microarray-based comparative genomic hybridizations. Among the
proposed methods to this end, Hidden Markov Model (HMM)-based
approaches seem promising since DNA copy number transitions are
naturally captured in the model. Current discrete-index HMM-based
approaches do not, however, take into account heterogeneous
information regarding the genomic overlap between clones. Moreover, the majority of existing methods are restricted to chromosomewise analysis.
Results: We introduce a novel Segmental Maximum A Posteriori
approach, SMAP, for DNA copy number profiling. Our method is
based on discrete-index Hidden Markov Modeling and incorporates
genomic distance and overlap between clones. We exploit a priori
information through user-controllable parameterization that enables
the identification of copy number deviations of various lengths and
amplitudes. The model parameters may be inferred at a genomewide scale to avoid overfitting of model parameters often resulting
from chromosome-wise model inference. We report superior performances of SMAP on synthetic data when compared with two recent
methods. When applied on our new experimental data, SMAP readily
recognizes already known genetic aberrations including both largescale regions with aberrant DNA copy number and changes affecting
only single features on the array. We highlight the differences
between the prediction of SMAP and the compared methods and
show that SMAP accurately determines copy number changes and
benefits from overlap consideration.
Availability: SMAP is available from Bioconductor and within the
Linnaeus Centre for Bioinformatics Data Warehouse.
Contact:
Supplementary information: Supplementary data are available at
http://www.lcb.uu.se/papers/r_andersson/SMAP/
*To whom correspondence should be addressed.
1
INTRODUCTION
The study of human genetic variation at the level of nucleotide
sequence changes constitutes a major challenge and has,
therefore, received considerable attention in the genomic era.
The primary type of variation explored so far has been at
the level of single nucleotide polymorphisms (SNPs). Larger
variations at the level of gains and deletions, also called copy
number variation (CNV), have received less attention.
The genome-wide detection of CNVs has been difficult due to
the lack of high-resolution and high-throughput techniques.
A fundamental step towards identifying such variation has
been the development of microarray-based comparative genomic hybridization (array-CGH) (Mantripragada et al., 2004;
Pinkel et al., 1998; Solinas-Toldo et al., 1997). Recently, two
landmark studies have reported the presence of CNVs in the
human genome using different approaches (Iafrate et al., 2004;
Sebat et al., 2004). Both studies convincingly demonstrate the
presence in normal individuals of genomic imbalances that
overlap with genes and segmental duplications and may
contribute to phenotypic variation and disease susceptibility.
These initial findings have now been followed by a number of
additional reports that further strengthen the evidence for the
importance of CNVs (Redon et al., 2006). The identification of
DNA copy number alterations is also very important in studies
of cancer, indicating that failures in the mechanisms that
maintain the integrity of the genome contribute to tumor
initiation/progression. Structural rearrangements (translocations, inversions) or gains may cause activation of oncogenes,
whereas deletions may underlie haploinsufficiency or inactivate
tumor suppressor genes. All these aberrations may also influence
the expression of so-called phenotype modifier genes. Although
not critical for tumor initiation as such, these genes may greatly
change the clinical picture and outcome of a disease. Discovery
and functional assessment of genomic regions affected by copy
number alterations are thus essential for understanding the
biology of cancer and for diagnostic applications.
In a typical array-CGH experiment, total genomic DNA from
test and reference samples are labeled differently and hybridized
to a microarray. The intensity ratio between the test and
ß 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The Linnaeus Centre for Bioinformatics, Uppsala University, 751 24 Uppsala, Sweden, 2Department of Genetics,
University of Alabama at Birmingham, Birmingham AL 35294-0024, USA, 3Department of Genetics and Pathology,
Rudbeck Laboratory, Uppsala University, 4Department of Surgical Sciences, Uppsala University Hospital,
751 85 Uppsala, Sweden and 5Interdisciplinary Center for Mathematical and Computational Modelling,
Warsaw University, 02-106 Warsaw, Poland
R.Andersson et al.
752
four-state model. Extrapolating this to a normal array-CGH
project of, for example, 100 experiments with 24 chromosomes
each yields an expected execution time of 60 000 CPU hours, i.e.
2500 days or 6.8 years.
The HMM-based methods described above infer the number
of hidden states through model selection and perform copy
number profiling/segmentation separately for each chromosome. Such approaches may easily overfit the model parameters
to local effects in the chromosomes. Interpretation of results
becomes questionable in cases in which inferred means and
variances of the Gaussian distributions associated with a certain
state differ between chromosomes. In some situations, however,
one might prefer chromosome-wise models over genome-wide
ones. Segmentation methods with chromosome-wise models are
appropriate to detect relative copy number alterations between
loci or mosaicism in the same chromosome when the actual copy
number is not of interest (Rueda and Dı́az-Uriarte, 2007).
A number of discrete-index HMM-based methods with
genome-wide parameter estimation has been proposed to
avoid overfitting the HMM parameters to chromosomal
characteristics. Shah et al. (2006) proposed a four-state
HMM in which the parameters are estimated by pooling
across samples using block Gibbs sampling. Engler et al. (2006)
suggested a three-state Gaussian mixture HMM in which the
H (...truncated)