CGDSNPdb: a database resource for error-checked and imputed mouse SNPs (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/database/article-pdf/doi/10.1093/database/baq008/16728396/baq008.pdf

CGDSNPdb: a database resource for error-checked and imputed mouse SNPs

Database, Vol. 2010, Article ID baq008, doi:10.1093/database/baq008 ............................................................................................................................................................................................................................................................................................. Original article Lucie N. Hutchins1, Yueming Ding1, Jin P. Szatkiewicz1, Randy Von Smith1, Hyuna Yang1, Fernando Pardo-Manuel de Villena1,2, Gary A. Churchill1 and Joel H. Graber1,* 1 Center for Genome Dynamics, The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609 and 2Department of Genetics, School of Medicine, University of North Carolina, Chapel Hill, NC 27599, USA *Corresponding author: Tel: +1 207 288 6000; Fax: +1 207 288 6847; Email: Submitted 20 August 2009; Revised 21 January 2010; Accepted 11 March 2010 ............................................................................................................................................................................................................................................................................................. The Center for Genome Dynamics Single Nucleotide Polymorphism Database (CGDSNPdb) is an open-source value-added database with more than nine million mouse single nucleotide polymorphisms (SNPs), drawn from multiple sources, with genotypes assigned to multiple inbred strains of laboratory mice. All SNPs are checked for accuracy and annotated for properties specific to the SNP as well as those implied by changes to overlapping protein-coding genes. CGDSNPdb serves as the primary interface to two unique data sets, the ‘imputed genotype resource’ in which a Hidden Markov Model was used to assess local haplotypes and the most probable base assignment at several million genomic loci in tens of strains of mice, and the Affymetrix Mouse Diversity Genotyping Array, a high density microarray with over 600 000 SNPs and over 900 000 invariant genomic probes. CGDSNPdb is accessible online through either a web-based query tool or a MySQL public login. Database URL: http://cgd.jax.org/cgdsnpdb/ ............................................................................................................................................................................................................................................................................................. Introduction Single nucleotide polymorphisms (SNPs) are variable single base positions within a genome that represent the simplest and possibly most common type of genetic variation. Accordingly, SNPs have emerged as a powerful tool for tracking heredity and genetic variation, and have become especially popular for phenotype genome-wide association studies (1, 2). The critical role of the laboratory mouse has led to several efforts aimed at large-scale collection and analysis of mouse SNPs (3–7). The Center for Genome Dynamics Single Nucleotide Polymorphism database (CGDSNPdb) was designed to bring together multiple sources of mouse SNP data, while checking them for accuracy and consistency among sources. CGDSNPdb is distinguished by the inclusion of two unique data sets: The Imputed SNP Genotype Resource (IGR) (8) generated by a Hidden Markov Model (HMM) that assigns probable genotype and associated confidence levels for over 8 million SNPs in 74 strains of mice. Data collected from over 140 strains of laboratory mice (filtered to 72 inbred strains in the current release, version 1.3) with the Mouse Diversity Genotyping Array [MusDiv; (9)], a high density microarray with probes that target 623 124 SNPs and over 900 000 invariant genomic regions targeting features such as exons and copy number variations. MusDiv SNP data will also be submitted to dbSNP following publication of an analysis manuscript (in preparation). The CGDSNPdb search engine facilitates a number of different queries, including search by chromosome region(s), nearby gene annotations, or SNP identifiers. Results can be returned as dynamic html or in flat-text commaseparated-value (CSV) format. Annotations in CGDSNPdb include characteristics of the SNP (e.g. presence in CpG dinucleotide, major/minor allele frequencies), along with functional characteristics of protein-coding genes affected by the SNP (e.g. changes in amino-acid physical and chemical characteristics, changes in codon usage, and overlapping or closest ............................................................................................................................................................................................................................................................................................. ß The Author(s) 2010. Published by Oxford University Press. This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 7 (page number not for citation purposes) CGDSNPdb: a database resource for error-checked and imputed mouse SNPs Original article Database, Vol. 2010, Article ID baq008, doi:10.1093/database/baq008 ............................................................................................................................................................................................................................................................................................. Implementation details The database CGDSNPdb was implemented using the open source MySQL relational database management system. The database consists of the core tables, containing all pertinent data for the SNP, including all data from the source download, and gene related tables that facilitate associations between SNPs and neighboring genes. Database schemas are available as Supplementary Figures S1 and S2. Original SNP files from sources and genome assembly files for flanking sequence data are retained and stored separately from the database. Automated load programs, written in Perl and C++ with SQL queries, integrate data sets from various external sources into the database. These data files have, in general, been obtained directly from the generating source rather than another accumulative resource, such as dbSNP at NCBI (10) or the Mouse Phenome Database (MPD) (11). The load process (Supplementary Figure S3) includes a number of quality control checks that identify problems or ambiguities with SNPs and correct them, if possible. Quality control checks include comparison of the provided SNP call in C57BL/6J with the same position within the reference genome, genomic comparison of the provided flanking sequences (typically 50-nt up and downstream, with a requirement of at least 60% sequence identity in each direction), identification and reso (...truncated)