CNV Workshop: an integrated platform for high-throughput copy number variation discovery and clinical diagnostics
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
SOFTWARE
Open Access
CNV Workshop: an integrated platform for highthroughput copy number variation discovery and
clinical diagnostics
Xiaowu Gai1†, Juan C Perin1†, Kevin Murphy2, Ryan O’Hara1, Monica D’arcy1, Adam Wenocur1, Hongbo M Xie1,
Eric F Rappaport3,4, Tamim H Shaikh4,5, Peter S White1,2*
Abstract
Background: Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes
and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing
availability of high-resolution genome surveillance platforms provides opportunity for rapidly assessing research
and clinical samples for CNV content, as well as for determining the potential pathogenicity of identified variants.
However, few informatics tools for accurate and efficient CNV detection and assessment currently exist.
Results: We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV
detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection,
annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and
genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection
algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects
and incorporates a secure user authentication layer and user/admin roles. To assist with determination of
pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and genebased literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation
layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration
with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV.
Conclusions: To our knowledge, CNV Workshop represents the first cohesive and convenient platform for
detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV
Workshop has been successfully utilized for assessment of genomic variation in healthy individuals and disease
cohorts and is an ideal platform for coordinating multiple associated projects.
Availability and Implementation: Available on the web at: http://sourceforge.net/projects/cnv
Background
Genome copy number changes (copy number variations,
or CNVs) include inherited, de novo, and somatically
acquired deviations from a diploid state within a particular chromosome segment. CNVs likely contribute substantially to inherited and/or acquired risk for a variety
of human diseases, including cancer and neuropsychiatric disorders [1,2]. In addition, CNVs are widely
* Correspondence:
† Contributed equally
1
Center for Biomedical Informatics, The Children’s Hospital of Philadelphia,
Philadelphia, PA, 19104, USA
distributed in the genomes of apparently healthy individuals and thus constitute significant amounts of population-based genomic variation [3-8]. New genotyping
technologies such as SNP-based arrays provide highresolution coverage of entire genomes as well as an
opportunity for rapidly determining CNV content in
sample collections of interest [4,6,7,9-11]. Accordingly,
numerous recent studies have described constellations
of structural variants in various healthy and disease
cohorts [1,2,12,13]. However, interpretation of the exact
extent, character, distribution, and effect of these CNVs
has been limited by the emerging nature of
© 2010 Gai et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
computational methods for accurate detection, and
further challenged by the difficulty in assessing the biological importance of particular CNVs in context with
other genomic features and study findings.
Detection of CNVs in high-density SNP arrays requires
genotypes that yield high quality intensity and, optimally,
allelic ratio data for each locus surveyed. A number of
algorithms have been utilized for the detection of CNVs
from such genotyping data sets. Software from array vendors such as Illumina and Affymetrix provide basic CNV
calls along with graphical interfaces that allow visual
inspection of a region of interest. However, these tools
generally lack the ability to quickly manage, annotate,
and assess CNVs from a sizable number of samples.
Moreover, visual inspection becomes challenging for
interpreting small or complex rearrangements, or CNVs
predicted from genome array data of marginal quality. A
number of 3rd party commercial and open-source algorithms, including QuantiSNP [14] and PennCNV [15],
utilize algorithms employing Hidden Markov Models
[16] to predict CNVs, and these approaches have been
developed and adopted for a number of recent genomewide studies of structural variation. Equally promising
are segmentation algorithms such as GLAD [17] and Circular Binary Segmentation (CBS) [18] that have been successfully applied for analysis of data from array-based
comparative genomic hybridization (aCGH) platforms.
These segmentation approaches are particularly attractive
as they have been shown to outperform certain HMMbased approaches for aCGH data [19,20]. Regardless of
the approach, these algorithms typically overcall CNV
events [12,15,21,22], thus requiring post-prediction
methods that consider data quality metrics for distinguishing true events from false positives. Currently,
researchers interested in analyzing genotypes for CNV
content for the first time, or in setting up production systems for high-throughput analysis and interpretation, are
challenged by the considerable variety and limited scope
of most existing methods and tools. This is especially
true in the use of SNP arrays for clinical diagnostic applications, where reliability and performance are of critical
importance.
At the same time, assessing the importance of particular CNVs in context with other genomic features and
study findings is a complex task even without robust
quality assessment of CNV predictions, especially given
limited current knowledge of the distributions of CNVs
across the genome and in populations. Contextual genomic and phenotypic annotations need to be considered,
while projects involving sizable cohorts also require an
infrastructure for managing, accessing, batch-processing,
and visualizing annotated CNV predictions.
To address these challenges, we describe the integrated platform CNV Workshop. This package
Page 2 of 9
incorp (...truncated)