CGDSNPdb: a database resource for error-checked and imputed mouse SNPs
Lucie N. Hutchins
1
Yueming Ding
1
Jin P. Szatkiewicz
1
Randy Von Smith
1
Hyuna Yang
1
Fernando Pardo-Manuel de Villena
0
1
Gary A. Churchill
1
Joel H. Graber
1
0
Department of Genetics, School of Medicine, University of North Carolina
,
Chapel Hill, NC 27599, USA
1
Center for Genome Dynamics, The Jackson Laboratory
, 600 Main Street, Bar Harbor,
ME 04609
The Center for Genome Dynamics Single Nucleotide Polymorphism Database (CGDSNPdb) is an open-source value-added database with more than nine million mouse single nucleotide polymorphisms (SNPs), drawn from multiple sources, with genotypes assigned to multiple inbred strains of laboratory mice. All SNPs are checked for accuracy and annotated for properties specific to the SNP as well as those implied by changes to overlapping protein-coding genes. CGDSNPdb serves as the primary interface to two unique data sets, the 'imputed genotype resource' in which a Hidden Markov Model was used to assess local haplotypes and the most probable base assignment at several million genomic loci in tens of strains of mice, and the Affymetrix Mouse Diversity Genotyping Array, a high density microarray with over 600 000 SNPs and over 900 000 invariant genomic probes. CGDSNPdb is accessible online through either a web-based query tool or a MySQL public login. Database URL: http://cgd.jax.org/cgdsnpdb/
Introduction
Single nucleotide polymorphisms (SNPs) are variable single
base positions within a genome that represent the simplest
and possibly most common type of genetic variation.
Accordingly, SNPs have emerged as a powerful tool for
tracking heredity and genetic variation, and have become
especially popular for phenotype genome-wide association
studies (1, 2). The critical role of the laboratory mouse has
led to several efforts aimed at large-scale collection and
analysis of mouse SNPs (37).
The Center for Genome Dynamics Single Nucleotide
Polymorphism database (CGDSNPdb) was designed to
bring together multiple sources of mouse SNP data, while
checking them for accuracy and consistency among sources.
CGDSNPdb is distinguished by the inclusion of two unique
data sets:
The Imputed SNP Genotype Resource (IGR) (8)
generated by a Hidden Markov Model (HMM) that assigns
probable genotype and associated confidence levels
for over 8 million SNPs in 74 strains of mice.
Data collected from over 140 strains of laboratory mice
(filtered to 72 inbred strains in the current release,
version 1.3) with the Mouse Diversity Genotyping Array
[MusDiv; (9)], a high density microarray with probes
that target 623 124 SNPs and over 900 000 invariant
genomic regions targeting features such as exons and
copy number variations. MusDiv SNP data will also be
submitted to dbSNP following publication of an analysis
manuscript (in preparation).
The CGDSNPdb search engine facilitates a number of
different queries, including search by chromosome region(s),
nearby gene annotations, or SNP identifiers. Results can
be returned as dynamic html or in flat-text
commaseparated-value (CSV) format.
Annotations in CGDSNPdb include characteristics of
the SNP (e.g. presence in CpG dinucleotide, major/minor
allele frequencies), along with functional characteristics
of protein-coding genes affected by the SNP (e.g.
changes in amino-acid physical and chemical characteristics,
changes in codon usage, and overlapping or closest
neighboring genes). All annotations were generated using
an automated analysis pipeline with subsequent quality
controls, described below.
CGDSNPdb was constructed primarily as a resource to
support the imputation and mouse diversity array projects,
however, it is being made available as a somewhat reduced
size, but high confidence, collection of mouse SNPs.
Database updates will be driven by the availability of new
or updated genome assemblies, updated releases of major
external SNP data sets, new SNP data sources, and
maintenance. Future growth of the database will be targeted
primarily at large-scale projects such as the mouse
genomes project (http://www.sanger.ac.uk/resources/mouse/
genomes/) as well as data sets that can increase the
represented strain diversity. Minor releases of CGDSNPdb may
also be generated for improved data visualization or
underlying quality control procedures. This manuscript provides a
high-level overview of the main components of CGDSNPdb,
as of version 1.3 (January 2010).
Implementation details
The database
CGDSNPdb was implemented using the open source MySQL
relational database management system. The database
consists of the core tables, containing all pertinent data
for the SNP, including all data from the source download,
and gene related tables that facilitate associations between
SNPs and neighboring genes. Database schemas are
available as Supplementary Figures S1 and S2. Original SNP files
from sources and genome assembly files for flanking
sequence data are retained and stored separately from the
database.
Automated load programs, written in Perl and C++ with
SQL queries, integrate data sets from various external
sources into the database. These data files have, in general,
been obtained directly from the generating source rather
than another accumulative resource, such as dbSNP at NCBI
(10) or the Mouse Phenome Database (MPD) (11). The load
process (Supplementary Figure S3) includes a number of
quality control checks that identify problems or ambiguities
with SNPs and correct them, if possible. Quality control
checks include comparison of the provided SNP call in
C57BL/6J with the same position within the reference
genome, genomic comparison of the provided flanking
sequences (typically 50-nt up and downstream, with a
requirement of at least 60% sequence identity in each
direction), identification and resolution of duplicate entries
(as defined by chromosome and position), and comparison
of genotype calls (strain and genotype) among the
different data sources. Genomic coordinates provided by the SNP
source were assumed to be correct, and only challenged
and further tested if the checks of the SNP or flanking
sequence failed. No minimum length requirements were
placed on the length of the flanking sequences, but SNP
correction was only possible if flanking sequences were
provided with the source data.
MusDiv (9) SNPs were subjected to additional quality
tests intended to assist in the interpretation of the
microarray hybridization patterns. The MusDiv SNPs
hybridization probes are primarily 25-mers, with the SNP typically
centered within the probe. Probes included both forward
and reverse sense probes for both the reference C57BL6/J
base and the known variant. Alignment of the flanking and
probe sequences to the reference C57BL6/J genome was
made using PASS (12), as it provided the best tradeoff of
speed and alignment sensitivity (data not shown),
especially for the analysis of near matches necessary for the
mouse diversity array probes. PASS was used to align all
four variants classes of probes, identifying all (...truncated)