A segmental maximum a posteriori approach to genome-wide copy number profiling (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/24/6/751/49047959/bioinformatics_24_6_751.pdf

A segmental maximum a posteriori approach to genome-wide copy number profiling

BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 6 2008, pages 751–758 doi:10.1093/bioinformatics/btn003 Genome analysis A segmental maximum a posteriori approach to genome-wide copy number profiling Robin Andersson1, Carl E. G. Bruder2, Arkadiusz Piotrowski2, Uwe Menzel3, Helena Nord3, Johanna Sandgren4, Torgeir R. Hvidsten1, Teresita Diaz de Ståhl3, Jan P. Dumanski2,3 and Jan Komorowski1,5,* 1 Received and revised on December 19, 2007; accepted on January 2, 2008 Advance Access publication January 19, 2008 Associate Editor: Alex Bateman ABSTRACT Motivation: Copy number profiling methods aim at assigning DNA copy numbers to chromosomal regions using measurements from microarray-based comparative genomic hybridizations. Among the proposed methods to this end, Hidden Markov Model (HMM)-based approaches seem promising since DNA copy number transitions are naturally captured in the model. Current discrete-index HMM-based approaches do not, however, take into account heterogeneous information regarding the genomic overlap between clones. Moreover, the majority of existing methods are restricted to chromosomewise analysis. Results: We introduce a novel Segmental Maximum A Posteriori approach, SMAP, for DNA copy number profiling. Our method is based on discrete-index Hidden Markov Modeling and incorporates genomic distance and overlap between clones. We exploit a priori information through user-controllable parameterization that enables the identification of copy number deviations of various lengths and amplitudes. The model parameters may be inferred at a genomewide scale to avoid overfitting of model parameters often resulting from chromosome-wise model inference. We report superior performances of SMAP on synthetic data when compared with two recent methods. When applied on our new experimental data, SMAP readily recognizes already known genetic aberrations including both largescale regions with aberrant DNA copy number and changes affecting only single features on the array. We highlight the differences between the prediction of SMAP and the compared methods and show that SMAP accurately determines copy number changes and benefits from overlap consideration. Availability: SMAP is available from Bioconductor and within the Linnaeus Centre for Bioinformatics Data Warehouse. Contact: Supplementary information: Supplementary data are available at http://www.lcb.uu.se/papers/r_andersson/SMAP/ *To whom correspondence should be addressed. 1 INTRODUCTION The study of human genetic variation at the level of nucleotide sequence changes constitutes a major challenge and has, therefore, received considerable attention in the genomic era. The primary type of variation explored so far has been at the level of single nucleotide polymorphisms (SNPs). Larger variations at the level of gains and deletions, also called copy number variation (CNV), have received less attention. The genome-wide detection of CNVs has been difficult due to the lack of high-resolution and high-throughput techniques. A fundamental step towards identifying such variation has been the development of microarray-based comparative genomic hybridization (array-CGH) (Mantripragada et al., 2004; Pinkel et al., 1998; Solinas-Toldo et al., 1997). Recently, two landmark studies have reported the presence of CNVs in the human genome using different approaches (Iafrate et al., 2004; Sebat et al., 2004). Both studies convincingly demonstrate the presence in normal individuals of genomic imbalances that overlap with genes and segmental duplications and may contribute to phenotypic variation and disease susceptibility. These initial findings have now been followed by a number of additional reports that further strengthen the evidence for the importance of CNVs (Redon et al., 2006). The identification of DNA copy number alterations is also very important in studies of cancer, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor initiation/progression. Structural rearrangements (translocations, inversions) or gains may cause activation of oncogenes, whereas deletions may underlie haploinsufficiency or inactivate tumor suppressor genes. All these aberrations may also influence the expression of so-called phenotype modifier genes. Although not critical for tumor initiation as such, these genes may greatly change the clinical picture and outcome of a disease. Discovery and functional assessment of genomic regions affected by copy number alterations are thus essential for understanding the biology of cancer and for diagnostic applications. In a typical array-CGH experiment, total genomic DNA from test and reference samples are labeled differently and hybridized to a microarray. The intensity ratio between the test and ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. The Linnaeus Centre for Bioinformatics, Uppsala University, 751 24 Uppsala, Sweden, 2Department of Genetics, University of Alabama at Birmingham, Birmingham AL 35294-0024, USA, 3Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, 4Department of Surgical Sciences, Uppsala University Hospital, 751 85 Uppsala, Sweden and 5Interdisciplinary Center for Mathematical and Computational Modelling, Warsaw University, 02-106 Warsaw, Poland R.Andersson et al. 752 four-state model. Extrapolating this to a normal array-CGH project of, for example, 100 experiments with 24 chromosomes each yields an expected execution time of 60 000 CPU hours, i.e. 2500 days or 6.8 years. The HMM-based methods described above infer the number of hidden states through model selection and perform copy number profiling/segmentation separately for each chromosome. Such approaches may easily overfit the model parameters to local effects in the chromosomes. Interpretation of results becomes questionable in cases in which inferred means and variances of the Gaussian distributions associated with a certain state differ between chromosomes. In some situations, however, one might prefer chromosome-wise models over genome-wide ones. Segmentation methods with chromosome-wise models are appropriate to detect relative copy number alterations between loci or mosaicism in the same chromosome when the actual copy number is not of interest (Rueda and Dı́az-Uriarte, 2007). A number of discrete-index HMM-based methods with genome-wide parameter estimation has been proposed to avoid overfitting the HMM parameters to chromosomal characteristics. Shah et al. (2006) proposed a four-state HMM in which the parameters are estimated by pooling across samples using block Gibbs sampling. Engler et al. (2006) suggested a three-state Gaussian mixture HMM in which the H (...truncated)