Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics

Human Genomics, Jan 2004

The genetic dissection of complex disease remains a significant challenge. Sample-tracking and the recording, processing and storage of high-throughput laboratory data with public domain data, require integration of databases, genome informatics and genetic analyses in an easily updated and scaleable format. To find genes involved in multifactorial diseases such as type 1 diabetes (T1D), chromosome regions are defined based on functional candidate gene content, linkage information from humans and animal model mapping information. For each region, genomic information is extracted from Ensembl, converted and loaded into ACeDB for manual gene annotation. Homology information is examined using ACeDB tools and the gene structure verified. Manually curated genes are extracted from ACeDB and read into the feature database, which holds relevant local genomic feature data and an audit trail of laboratory investigations. Public domain information, manually curated genes, polymorphisms, primers, linkage and association analyses, with links to our genotyping database, are shown in Gbrowse. This system scales to include genetic, statistical, quality control (QC) and biological data such as expression analyses of RNA or protein, all linked from a genomics integrative display. Our system is applicable to any genetic study of complex disease, of either large or small scale.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://www.humgenomics.com/content/pdf/1479-7364-1-2-98.pdf

Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics

Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics Oliver S. Burren 0 Barry C. Healy 0 Alex C. Lam 0 Helen Schuilenburg 0 Geoffrey E. Dolman 0 Vincent H. Everett 0 Davide Laneri 0 Sarah Nutland 0 Helen E. Rance 0 Felicity Payne 0 Deborah Smyth 0 Chris Lowe 0 Bryan J. Barratt 0 Rebecca C.J. Twells 0 Daniel B. Rainbow 0 Linda S. Wicker 0 John A. Todd 0 Neil M. Walker 0 Luc J. Smink 0 0 Juvenile Diabetes Research Foundation/Welcome Trust Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Wellcome Trust/MRC Building, Addenbrooke's Hospital , Cambridge, CB2 2XY , UK The genetic dissection of complex disease remains a significant challenge. Sample-tracking and the recording, processing and storage of highthroughput laboratory data with public domain data, require integration of databases, genome informatics and genetic analyses in an easily updated and scaleable format. To find genes involved in multifactorial diseases such as type 1 diabetes (T1D), chromosome regions are defined based on functional candidate gene content, linkage information from humans and animal model mapping information. For each region, genomic information is extracted from Ensembl, converted and loaded into ACeDB for manual gene annotation. Homology information is examined using ACeDB tools and the gene structure verified. Manually curated genes are extracted from ACeDB and read into the feature database, which holds relevant local genomic feature data and an audit trail of laboratory investigations. Public domain information, manually curated genes, polymorphisms, primers, linkage and association analyses, with links to our genotyping database, are shown in Gbrowse. This system scales to include genetic, statistical, quality control (QC) and biological data such as expression analyses of RNA or protein, all linked from a genomics integrative display. Our system is applicable to any genetic study of complex disease, of either large or small scale. type 1 diabetes; complex disease; genome informatics; data management; genetics - The availability of the genome sequences for human and mouse,1 3 and for other species, has provided one of the essential reagents for identifying the primary or causal polymorphisms contributing to the inherited risk of common multifactorial disease. The other prerequisite is substantial numbers of samples of affected individuals and controls, in the order of thousands. The large amount of data from the Human Genome Project (HGP) has necessitated the use of comprehensive data repositories such as EMBL, GenBank and DDBJ, and specific subsets of genomic information such as the Single Nucleotide Polymorphism Database (dbSNP) and the database of Expressed Sequence Tags (dbEST).4 6 Increasingly, however, other information relevant to genomics and genetics has become available, such as protein domains,7,8 Gene Ontology (GO; The Gene Ontology Consortium, 2001) and pathways (KEGG).9 This expansion of data provided the need and opportunity for databases which integrate genome sequence, homologies, SNPs, proteins, protein domains and annotations, and allow visualisation in a single integrated view.5,10 13 These tools have aided scientists in establishing the content of regions of interest with regard to genes, SNPs, homologies and any other features of the genome. Data warehousing strategies, such as EnsMart, have made answering complex biological queries possible without the need for computing skills and a large computer setup.12 An essential prerequisite in our effort to find genes involved in type 1 diabetes (T1D) in both human and mouse has been the development of a modular informatics infrastructure based on freely available tools such as Gbrowse,14 ACeDB15,16 and Ensembl. All local genomic data are stored in a feature database, the genotyping data are stored in a separate genotyping database. The databases are custom relational databases (MySQL).17 Local features can be visualised and integrated with public domain data using Gbrowse. All parts of our system are linked together with Perl and Bioperl.18 This, together with the Gbrowse feature that allows web pages to be linked to genomic features, has allowed the integration of different types of genetic and genomic data using a single visualisation platform. Our solution will be of interest to any research group working on complex disease, providing flexibility and scalability from single gene-based analyses to genome-wide investigations. Materials and methods Databases The barcode management system. The barcode management system (BMS) was developed on a Dell Latitude C600(TM) with a Pentium(TM) III processor and 256 MB of RAM under Microsoft Windows 2000(TM) (SP3). Coding and compilation was carried out using Microsoft Visual Basic (VB) 6.0(TM) and Microsoft Access 2000(TM). Piccolink (RF600) handheld radi (...truncated)


This is a preview of a remote PDF: http://www.humgenomics.com/content/pdf/1479-7364-1-2-98.pdf

Oliver S Burren, Barry C Healy, Alex C Lam, Helen Schuilenburg, Geoffrey E Dolman, Vincent H Everett, Davide Laneri, Sarah Nutland, Helen E Rance, Felicity Payne, Deborah Smyth, Chris Lowe, Bryan J Barratt, Rebecca CJ Twells, Daniel B Rainbow, Linda S Wicker, John A Todd, Neil M Walker, Luc J Smink. Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics, Human Genomics, 2004, pp. 98-109, 1,