CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities
Published online 30 September 2014
Nucleic Acids Research, 2015, Vol. 43, Database issue D1117–D1123
doi: 10.1093/nar/gku895
CODEX: a next-generation sequencing experiment
database for the haematopoietic and embryonic stem
cell communities
Manuel Sánchez-Castillo1,† , David Ruau1,† , Adam C. Wilkinson1 , Felicia S.L. Ng1 ,
Rebecca Hannah1 , Evangelia Diamanti1 , Patrick Lombard2 , Nicola K. Wilson1,* and
Berthold Gottgens1,*
1
Department of Haematology, Wellcome Trust-MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical
Research, Cambridge University, Cambridge CB2 0XY, UK and 2 Wellcome Trust-MRC Cambridge Stem Cell
Institute, University of Cambridge, Cambridge, CB2 1QR, UK
Received August 14, 2014; Revised September 15, 2014; Accepted September 16, 2014
ABSTRACT
INTRODUCTION
CODEX (http://codex.stemcells.cam.ac.uk/) is a userfriendly database for the direct access and interrogation of publicly available next-generation sequencing
(NGS) data, specifically aimed at experimental biologists. In an era of multi-centre genomic dataset generation, CODEX provides a single database where
these samples are collected, uniformly processed
and vetted. The main drive of CODEX is to provide
the wider scientific community with instant access to
high-quality NGS data, which, irrespective of the publishing laboratory, is directly comparable. CODEX
allows users to immediately visualize or download
processed datasets, or compare user-generated data
against the database’s cumulative knowledge-base.
CODEX contains four types of NGS experiments:
transcription factor chromatin immunoprecipitation
coupled to high-throughput sequencing (ChIP-Seq),
histone modification ChIP-Seq, DNase-Seq and RNASeq. These are largely encompassed within two
specialized repositories, HAEMCODE and ESCODE,
which are focused on haematopoiesis and embryonic stem cell samples, respectively. To date, CODEX
contains over 1000 samples, including 221 unique
TFs and 93 unique cell types. CODEX therefore provides one of the most complete resources of publicly
available NGS data for the direct interrogation of transcriptional programmes that regulate cellular identity
and fate in the context of mammalian development,
homeostasis and disease.
One of the fundamental questions in biology is how a single fertilized egg cell faithfully develops into a multicellular
organism containing specialized organs capable of homeostasis and regeneration, while the genomic content within
each cell remains essentially unchanged. Cell-type specific
transcriptional and chromatin landscapes are critical determinants of the global gene expression patterns that define cell identities and fate choices (1). As key regulators of
these processes, transcription factors (TFs) are thought to
act combinatorially to confer context-specific activities responsible for orchestrating global gene expression patterns
that drive stem cell self-renewal, proliferation, homeostasis,
cell differentiation and specification (2). A unified understanding of these complex processes is still in its infancy.
Two of the most studied systems of mammalian development are the haematopoietic system and embryonic stem
(ES) cells (3–5). The haematopoietic system is also of particular interest in the context of disease, where transcriptional
dysregulation is known to drive numerous haematological
malignancies (6,7).
Recent advances in next-generation sequencing (NGS)
have allowed genome-wide analysis of TF binding and
histone modifications (by chromatin immunoprecipitation
coupled to high-throughput sequencing; ChIP-Seq), identification of open regions of chromatin (by DNase-Seq) and
transcriptomic analysis (by RNA-Seq) (8). Such technologies have the potential to drive key advances in our understanding of mammalian development, homeostasis and disease. Both large international consortia (such as ENCODE
and BLUEPRINT) (9,10) and numerous individual laboratories are effectively generating and releasing such genomewide datasets into the public domain. Current repositories
* To whom correspondence should be addressed. Tel: +44 1223 336829; Fax: +44 1223 762670; Email:
Correspondence may also be addressed to Nicola K. Wilson. Tel: +44 1223 336822; Fax: +44 1223 762670; Email:
†
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
Present Address: David Ruau, Head of Scientific Computing Solutions, da Vinci Building, Melbourn Science Park, Cambridge Road, Melbourn, SG8 6HB, UK.
C The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
D1118 Nucleic Acids Research, 2015, Vol. 43, Database issue
for raw NGS data include the Gene Expression Omnibus
(GEO) (11), ArrayExpress (12) and the DNA Data Bank
of Japan (DDBJ) (13). These datasets provide a wealth
of information, for both large-scale whole-genome metaanalyses and the study of single genomic loci.
However, the multi-centre nature of this huge data generation effort has had several unintended side effects: (i) the
bioinformatic processing and analysis necessary to provide
informative and biologically relevant insights from such experiments are not uniformly standardized or integrated, (ii)
no public repository provides instant NGS data visualization, (iii) the large size of such NGS datasets (raw RNASeq datasets can be 100 GB) is prohibitive for the in-house
processing necessary for visualization and/or further analysis without dedicated computer hardware or bioinformatics expertise and finally (iv) annotation of publicly available
NGS data is often incomplete or non-intuitive, limiting simple data interpretation. These current failures significantly
reduce the utility of such data to the wider research community.
In an effort to bridge this gap between the vast amounts
of publicly available NGS raw data and end-user friendly
information, we have developed CODEX (http://codex.
stemcells.cam.ac.uk/), a database of NGS experiments including ChIP-Seq, RNA-Seq and DNase-Seq. CODEX
provides uniformly processed data as well as online resources for NGS data visualization and bioinformatics
analysis. Most importantly, CODEX uses a standardized
bioinformatics-processing pipeline for all NGS datasets,
and the details of each sample are manually curated to
provide key information. CODEX currently includes over
1000 uniformly processed NGS datasets that can be easily
viewed, interrogated and compared by the general scientific
community, for both quick and informative comparisons as
well as large-scale meta-analyses.
The current focus of CODEX is to unify NGS data for
the haematopoietic system and ES cells. CODEX therefore encompasses (...truncated)