Whole-genome bisulfite sequencing maps from multiple human tissues reveal novel CpG islands associated with tissue-specific regulation
Human Molecular Genetics, 2016, Vol. 25, No. 1
69–82
doi: 10.1093/hmg/ddv449
Advance Access Publication Date: 28 October 2015
Original Article
ORIGINAL ARTICLE
Whole-genome bisulfite sequencing maps from
multiple human tissues reveal novel CpG islands
associated with tissue-specific regulation
1
School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA and 2Department of Genetics, Physical
Anthropology and Animal Physiology, University of the Basque Country UPV/EHU, Barrio Sarriena s/n, 48940
Leioa, Spain
*To whom correspondence should be addressed at: School of Biology, Georgia Institute of Technology, 950 Atlantic Drive, Atlanta, GA 30332, USA.
Tel: +1 4043856084; Fax: +1 4048942295; Email:
Abstract
CpG islands (CGIs) are one of the most widely studied regulatory features of the human genome, with critical roles in
development and disease. Despite such significance and the original epigenetic definition, currently used CGI sets are typically
predicted from DNA sequence characteristics. Although CGIs are deeply implicated in practical analyses of DNA methylation,
recent studies have shown that such computational annotations suffer from inaccuracies. Here we used whole-genome bisulfite
sequencing from 10 diverse human tissues to identify a comprehensive, experimentally obtained, single-base resolution CGI
catalog. In addition to the unparalleled annotation precision, our method is free from potential bias due to arbitrary sequence
features or probe affinity differences. In addition to clarifying substantial false positives in the widely used University of
California Santa Cruz (UCSC) annotations, our study identifies numerous novel epigenetic loci. In particular, we reveal
significant impact of transposable elements on the epigenetic regulatory landscape of the human genome and demonstrate
ubiquitous presence of transcription initiation at CGIs, including alternative promoters in gene bodies and non-coding RNAs in
intergenic regions. Moreover, coordinated DNA methylation and chromatin modifications mark tissue-specific enhancers at
novel CGIs. Enrichment of specific transcription factor binding from ChIP-seq supports mechanistic roles of CGIs on the
regulation of tissue-specific transcription. The new CGI catalog provides a comprehensive and integrated list of genomic
hotspots of epigenetic regulation.
Introduction
Since their initial discovery almost three decades ago (1–3), numerous studies have established the critical importance of CpG
islands (CGIs) in fundamental regulatory and developmental processes (4–8). Originally defined as hypomethylated stretches of
CpG-rich sequences (1–3), CGIs punctuate otherwise heavily
methylated, CpG-depleted mammalian genomes (9–13). Cell
type- and tissue-specific CGI methylation is a key regulatory
signal for genomic imprinting (14), gene expression regulation
(4) and developmental programming (5,7,11,15). Aberrant CGI
methylation is implicated in numerous diseases, particularly
cancers (16,17) and neurodevelopmental disorders (18).
Even though CGIs were originally experimentally defined (1),
subsequent annotations of CGIs relied on sequence-based computational algorithms, due to the lack of actual DNA methylation
data (2,19–21). These computational algorithms have been
Received: May 21, 2015. Revised: October 2, 2015. Accepted: October 21, 2015
© The Author 2015. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
69
Isabel Mendizabal1,2 and Soojin V. Yi1, *
70
| Human Molecular Genetics, 2016, Vol. 25, No. 1
Indeed, important efforts have previously been made to generate an accurate CGI data set (5,22,24). However, these early
studies lacked DNA methylation maps with nucleotide-level
resolution. They were also limited to only a few tissue types.
Here, we utilize whole-genome bisulfite sequencing data sets
(11,15,29–34) generated from diverse cell types, including embryonic stem cells (ESCs), germ cells, fetal tissues and six adult somatic tissues spanning all three germ layers (Fig. 1A). From this
comprehensive collection of whole-genome methylation maps,
we identified more than 50 000 experimentally supported CGIs
(‘eCGIs’). The eCGI catalog presented here is the most comprehensive experimentally defined bona fide CGI catalog to date, revealing a large number of novel CGIs that were previously
undetected. This experimental definition allows for the discovery
of hypomethylated CpG clusters associated with constitutively
expressed genes, thereby expanding the list of CGI genes. Moreover, in contrast to the housekeeping nature of classical promoter
CGIs, many novel eCGIs show promoter- and enhancer-like chromatin features and associate with facultative transcription
factors (TFs) to putatively regulate tissue-specific coding and
non-coding transcription.
Figure 1. (A) Tissues analyzed for eCGI identification, including embryonic, gonad, germ line and fetal tissues, as well as six adult somatic tissues of distinct developmental
origins. These were selected to have the highest cell type diversity with respect to gene expression patterns (68) while avoiding overly cell heterogeneous tissues. Ovaries
comprise germ-line cells and endoderm-derived tissue. The adrenal gland has both ectodermal (medulla) and mesodermal (cortex) origins. (B) The genomic distribution of
eCGIs. (C) The correlation between the numbers of protein-coding genes and eCGIs on each chromosome. (D). Distribution of eCGIs and cCGIs across tissues.
extremely valuable for almost two decades. However, whether
computationally identified CGIs truly represent hypomethylated
CpG clusters has recently been called into question by genomewide methylation surveys. For example, substantial numbers of
computationally defined CGIs are consistently hypermethylated
in several tissues (5,22,23) (i.e. false positives). Moreover, many
hypomethylated CpG-rich sequences (representing the very definition of CGIs) are missing from the computationally annotated
CGI sets (5,24) (i.e. false negatives). Furthermore, a considerable
fraction of CGIs has undergone CpG loss during recent evolution,
suggesting that they are constitutively methylated and are not
bona fide CGIs (25). With the developments of techniques to identify different types of hypomethylated genomic regions (26–28), it
is feasible that the term CpG island itself may even be replaced
with some other terms in the future. Nevertheless, CGIs are still
one of the most widely analyzed genomic elements in epigenetic
analyses, and many commercial toolkits preferentially target
them (23). Consequently, re-visiting the epigenetic definition of
CGIs and providing an experimentally defined CGI catalog that
overcomes the limitations of computational predictions will
off (...truncated)