CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/nar/article-pdf/43/D1/D1117/7330462/gku895.pdf

CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities

Published online 30 September 2014 Nucleic Acids Research, 2015, Vol. 43, Database issue D1117–D1123 doi: 10.1093/nar/gku895 CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities Manuel Sánchez-Castillo1,† , David Ruau1,† , Adam C. Wilkinson1 , Felicia S.L. Ng1 , Rebecca Hannah1 , Evangelia Diamanti1 , Patrick Lombard2 , Nicola K. Wilson1,* and Berthold Gottgens1,* 1 Department of Haematology, Wellcome Trust-MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Cambridge University, Cambridge CB2 0XY, UK and 2 Wellcome Trust-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge, CB2 1QR, UK Received August 14, 2014; Revised September 15, 2014; Accepted September 16, 2014 ABSTRACT INTRODUCTION CODEX (http://codex.stemcells.cam.ac.uk/) is a userfriendly database for the direct access and interrogation of publicly available next-generation sequencing (NGS) data, specifically aimed at experimental biologists. In an era of multi-centre genomic dataset generation, CODEX provides a single database where these samples are collected, uniformly processed and vetted. The main drive of CODEX is to provide the wider scientific community with instant access to high-quality NGS data, which, irrespective of the publishing laboratory, is directly comparable. CODEX allows users to immediately visualize or download processed datasets, or compare user-generated data against the database’s cumulative knowledge-base. CODEX contains four types of NGS experiments: transcription factor chromatin immunoprecipitation coupled to high-throughput sequencing (ChIP-Seq), histone modification ChIP-Seq, DNase-Seq and RNASeq. These are largely encompassed within two specialized repositories, HAEMCODE and ESCODE, which are focused on haematopoiesis and embryonic stem cell samples, respectively. To date, CODEX contains over 1000 samples, including 221 unique TFs and 93 unique cell types. CODEX therefore provides one of the most complete resources of publicly available NGS data for the direct interrogation of transcriptional programmes that regulate cellular identity and fate in the context of mammalian development, homeostasis and disease. One of the fundamental questions in biology is how a single fertilized egg cell faithfully develops into a multicellular organism containing specialized organs capable of homeostasis and regeneration, while the genomic content within each cell remains essentially unchanged. Cell-type specific transcriptional and chromatin landscapes are critical determinants of the global gene expression patterns that define cell identities and fate choices (1). As key regulators of these processes, transcription factors (TFs) are thought to act combinatorially to confer context-specific activities responsible for orchestrating global gene expression patterns that drive stem cell self-renewal, proliferation, homeostasis, cell differentiation and specification (2). A unified understanding of these complex processes is still in its infancy. Two of the most studied systems of mammalian development are the haematopoietic system and embryonic stem (ES) cells (3–5). The haematopoietic system is also of particular interest in the context of disease, where transcriptional dysregulation is known to drive numerous haematological malignancies (6,7). Recent advances in next-generation sequencing (NGS) have allowed genome-wide analysis of TF binding and histone modifications (by chromatin immunoprecipitation coupled to high-throughput sequencing; ChIP-Seq), identification of open regions of chromatin (by DNase-Seq) and transcriptomic analysis (by RNA-Seq) (8). Such technologies have the potential to drive key advances in our understanding of mammalian development, homeostasis and disease. Both large international consortia (such as ENCODE and BLUEPRINT) (9,10) and numerous individual laboratories are effectively generating and releasing such genomewide datasets into the public domain. Current repositories * To whom correspondence should be addressed. Tel: +44 1223 336829; Fax: +44 1223 762670; Email: Correspondence may also be addressed to Nicola K. Wilson. Tel: +44 1223 336822; Fax: +44 1223 762670; Email: † The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors. Present Address: David Ruau, Head of Scientific Computing Solutions, da Vinci Building, Melbourn Science Park, Cambridge Road, Melbourn, SG8 6HB, UK. C The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D1118 Nucleic Acids Research, 2015, Vol. 43, Database issue for raw NGS data include the Gene Expression Omnibus (GEO) (11), ArrayExpress (12) and the DNA Data Bank of Japan (DDBJ) (13). These datasets provide a wealth of information, for both large-scale whole-genome metaanalyses and the study of single genomic loci. However, the multi-centre nature of this huge data generation effort has had several unintended side effects: (i) the bioinformatic processing and analysis necessary to provide informative and biologically relevant insights from such experiments are not uniformly standardized or integrated, (ii) no public repository provides instant NGS data visualization, (iii) the large size of such NGS datasets (raw RNASeq datasets can be 100 GB) is prohibitive for the in-house processing necessary for visualization and/or further analysis without dedicated computer hardware or bioinformatics expertise and finally (iv) annotation of publicly available NGS data is often incomplete or non-intuitive, limiting simple data interpretation. These current failures significantly reduce the utility of such data to the wider research community. In an effort to bridge this gap between the vast amounts of publicly available NGS raw data and end-user friendly information, we have developed CODEX (http://codex. stemcells.cam.ac.uk/), a database of NGS experiments including ChIP-Seq, RNA-Seq and DNase-Seq. CODEX provides uniformly processed data as well as online resources for NGS data visualization and bioinformatics analysis. Most importantly, CODEX uses a standardized bioinformatics-processing pipeline for all NGS datasets, and the details of each sample are manually curated to provide key information. CODEX currently includes over 1000 uniformly processed NGS datasets that can be easily viewed, interrogated and compared by the general scientific community, for both quick and informative comparisons as well as large-scale meta-analyses. The current focus of CODEX is to unify NGS data for the haematopoietic system and ES cells. CODEX therefore encompasses (...truncated)