CNV Workshop: an integrated platform for high-throughput copy number variation discovery and clinical diagnostics (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-11-74.pdf

CNV Workshop: an integrated platform for high-throughput copy number variation discovery and clinical diagnostics

Gai et al. BMC Bioinformatics 2010, 11:74 http://www.biomedcentral.com/1471-2105/11/74 SOFTWARE Open Access CNV Workshop: an integrated platform for highthroughput copy number variation discovery and clinical diagnostics Xiaowu Gai1†, Juan C Perin1†, Kevin Murphy2, Ryan O’Hara1, Monica D’arcy1, Adam Wenocur1, Hongbo M Xie1, Eric F Rappaport3,4, Tamim H Shaikh4,5, Peter S White1,2* Abstract Background: Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing availability of high-resolution genome surveillance platforms provides opportunity for rapidly assessing research and clinical samples for CNV content, as well as for determining the potential pathogenicity of identified variants. However, few informatics tools for accurate and efficient CNV detection and assessment currently exist. Results: We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection, annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects and incorporates a secure user authentication layer and user/admin roles. To assist with determination of pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and genebased literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV. Conclusions: To our knowledge, CNV Workshop represents the first cohesive and convenient platform for detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV Workshop has been successfully utilized for assessment of genomic variation in healthy individuals and disease cohorts and is an ideal platform for coordinating multiple associated projects. Availability and Implementation: Available on the web at: http://sourceforge.net/projects/cnv Background Genome copy number changes (copy number variations, or CNVs) include inherited, de novo, and somatically acquired deviations from a diploid state within a particular chromosome segment. CNVs likely contribute substantially to inherited and/or acquired risk for a variety of human diseases, including cancer and neuropsychiatric disorders [1,2]. In addition, CNVs are widely * Correspondence: † Contributed equally 1 Center for Biomedical Informatics, The Children’s Hospital of Philadelphia, Philadelphia, PA, 19104, USA distributed in the genomes of apparently healthy individuals and thus constitute significant amounts of population-based genomic variation [3-8]. New genotyping technologies such as SNP-based arrays provide highresolution coverage of entire genomes as well as an opportunity for rapidly determining CNV content in sample collections of interest [4,6,7,9-11]. Accordingly, numerous recent studies have described constellations of structural variants in various healthy and disease cohorts [1,2,12,13]. However, interpretation of the exact extent, character, distribution, and effect of these CNVs has been limited by the emerging nature of © 2010 Gai et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Gai et al. BMC Bioinformatics 2010, 11:74 http://www.biomedcentral.com/1471-2105/11/74 computational methods for accurate detection, and further challenged by the difficulty in assessing the biological importance of particular CNVs in context with other genomic features and study findings. Detection of CNVs in high-density SNP arrays requires genotypes that yield high quality intensity and, optimally, allelic ratio data for each locus surveyed. A number of algorithms have been utilized for the detection of CNVs from such genotyping data sets. Software from array vendors such as Illumina and Affymetrix provide basic CNV calls along with graphical interfaces that allow visual inspection of a region of interest. However, these tools generally lack the ability to quickly manage, annotate, and assess CNVs from a sizable number of samples. Moreover, visual inspection becomes challenging for interpreting small or complex rearrangements, or CNVs predicted from genome array data of marginal quality. A number of 3rd party commercial and open-source algorithms, including QuantiSNP [14] and PennCNV [15], utilize algorithms employing Hidden Markov Models [16] to predict CNVs, and these approaches have been developed and adopted for a number of recent genomewide studies of structural variation. Equally promising are segmentation algorithms such as GLAD [17] and Circular Binary Segmentation (CBS) [18] that have been successfully applied for analysis of data from array-based comparative genomic hybridization (aCGH) platforms. These segmentation approaches are particularly attractive as they have been shown to outperform certain HMMbased approaches for aCGH data [19,20]. Regardless of the approach, these algorithms typically overcall CNV events [12,15,21,22], thus requiring post-prediction methods that consider data quality metrics for distinguishing true events from false positives. Currently, researchers interested in analyzing genotypes for CNV content for the first time, or in setting up production systems for high-throughput analysis and interpretation, are challenged by the considerable variety and limited scope of most existing methods and tools. This is especially true in the use of SNP arrays for clinical diagnostic applications, where reliability and performance are of critical importance. At the same time, assessing the importance of particular CNVs in context with other genomic features and study findings is a complex task even without robust quality assessment of CNV predictions, especially given limited current knowledge of the distributions of CNVs across the genome and in populations. Contextual genomic and phenotypic annotations need to be considered, while projects involving sizable cohorts also require an infrastructure for managing, accessing, batch-processing, and visualizing annotated CNV predictions. To address these challenges, we describe the integrated platform CNV Workshop. This package Page 2 of 9 incorp (...truncated)