Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics
Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics
Oliver S. Burren 0
Barry C. Healy 0
Alex C. Lam 0
Helen Schuilenburg 0
Geoffrey E. Dolman 0
Vincent H. Everett 0
Davide Laneri 0
Sarah Nutland 0
Helen E. Rance 0
Felicity Payne 0
Deborah Smyth 0
Chris Lowe 0
Bryan J. Barratt 0
Rebecca C.J. Twells 0
Daniel B. Rainbow 0
Linda S. Wicker 0
John A. Todd 0
Neil M. Walker 0
Luc J. Smink 0
0 Juvenile Diabetes Research Foundation/Welcome Trust Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Wellcome Trust/MRC Building, Addenbrooke's Hospital , Cambridge, CB2 2XY , UK
The genetic dissection of complex disease remains a significant challenge. Sample-tracking and the recording, processing and storage of highthroughput laboratory data with public domain data, require integration of databases, genome informatics and genetic analyses in an easily updated and scaleable format. To find genes involved in multifactorial diseases such as type 1 diabetes (T1D), chromosome regions are defined based on functional candidate gene content, linkage information from humans and animal model mapping information. For each region, genomic information is extracted from Ensembl, converted and loaded into ACeDB for manual gene annotation. Homology information is examined using ACeDB tools and the gene structure verified. Manually curated genes are extracted from ACeDB and read into the feature database, which holds relevant local genomic feature data and an audit trail of laboratory investigations. Public domain information, manually curated genes, polymorphisms, primers, linkage and association analyses, with links to our genotyping database, are shown in Gbrowse. This system scales to include genetic, statistical, quality control (QC) and biological data such as expression analyses of RNA or protein, all linked from a genomics integrative display. Our system is applicable to any genetic study of complex disease, of either large or small scale.
type 1 diabetes; complex disease; genome informatics; data management; genetics
-
The availability of the genome sequences for human and
mouse,1 3 and for other species, has provided one of the
essential reagents for identifying the primary or causal
polymorphisms contributing to the inherited risk of common
multifactorial disease. The other prerequisite is substantial
numbers of samples of affected individuals and controls, in
the order of thousands.
The large amount of data from the Human Genome
Project (HGP) has necessitated the use of comprehensive data
repositories such as EMBL, GenBank and DDBJ, and specific
subsets of genomic information such as the Single Nucleotide
Polymorphism Database (dbSNP) and the database of
Expressed Sequence Tags (dbEST).4 6 Increasingly, however,
other information relevant to genomics and genetics has
become available, such as protein domains,7,8 Gene Ontology
(GO; The Gene Ontology Consortium, 2001) and pathways
(KEGG).9 This expansion of data provided the need and
opportunity for databases which integrate genome sequence,
homologies, SNPs, proteins, protein domains and annotations,
and allow visualisation in a single integrated view.5,10 13 These
tools have aided scientists in establishing the content of regions
of interest with regard to genes, SNPs, homologies and any
other features of the genome. Data warehousing strategies,
such as EnsMart, have made answering complex biological
queries possible without the need for computing skills and a
large computer setup.12
An essential prerequisite in our effort to find genes involved
in type 1 diabetes (T1D) in both human and mouse has been
the development of a modular informatics infrastructure based
on freely available tools such as Gbrowse,14 ACeDB15,16 and
Ensembl. All local genomic data are stored in a feature
database, the genotyping data are stored in a separate genotyping
database. The databases are custom relational databases
(MySQL).17 Local features can be visualised and integrated
with public domain data using Gbrowse. All parts of our
system are linked together with Perl and Bioperl.18 This,
together with the Gbrowse feature that allows web pages to be
linked to genomic features, has allowed the integration of
different types of genetic and genomic data using a single
visualisation platform. Our solution will be of interest to any
research group working on complex disease, providing
flexibility and scalability from single gene-based analyses to
genome-wide investigations.
Materials and methods
Databases
The barcode management system. The barcode management
system (BMS) was developed on a Dell Latitude C600(TM)
with a Pentium(TM) III processor and 256 MB of RAM under
Microsoft Windows 2000(TM) (SP3). Coding and compilation
was carried out using Microsoft Visual Basic (VB) 6.0(TM) and
Microsoft Access 2000(TM). Piccolink (RF600) handheld radi (...truncated)