A-clustering: a novel method for the detection of co-regulated methylation regions, and regions associated with exposure
BIOINFORMATICS
ORIGINAL PAPER
Gene expression
Vol. 29 no. 22 2013, pages 2884–2891
doi:10.1093/bioinformatics/btt498
Advance Access publication August 29, 2013
A-clustering: a novel method for the detection of co-regulated
methylation regions, and regions associated with exposure
Tamar Sofer1,*, Elizabeth D. Schifano2, Jane A. Hoppin3, Lifang Hou4 and
Andrea A. Baccarelli5,6
1
Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, SPH2, 4th floor, Boston, MA
02115, USA, 2Department of Statistics, University of Connecticut, 215 Glenbrook Road, Storrs, CT 06269, USA,
3
NIEHS, Epidemiology Branch, MD A3-05, PO Box 12233, Research Triangle Park, NC 27709, USA, 4Department
of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 680 N Lake Shore Drive, Suite 1400
Chicago, IL 60611, USA, 5Department of Environmental Health and 6Department of Epidemiology, Harvard School
of Public Health, 401 Park Drive, Landmark Ctr Room 415E, Boston, MA 02215, USA
Associate Editor: Martin Bishop
Motivation: DNA methylation is a heritable modifiable chemical process that affects gene transcription and is associated with other molecular markers (e.g. gene expression) and biomarkers (e.g. cancer or
other diseases). Current technology measures methylation in hundred
of thousands, or millions of CpG sites throughout the genome. It is
evident that neighboring CpG sites are often highly correlated with
each other, and current literature suggests that clusters of adjacent
CpG sites are co-regulated.
Results: We develop the Adjacent Site Clustering (A-clustering) algorithm to detect sets of neighboring CpG sites that are correlated with
each other. To detect methylation regions associated with exposure,
we propose an analysis pipeline for high-dimensional methylation data
in which CpG sites within regions identified by A-clustering are modeled as multivariate responses to environmental exposure using a
generalized estimating equation approach that assumes exposure
equally affects all sites in the cluster. We develop a correlation preserving simulation scheme, and study the proposed methodology via
simulations. We study the clusters detected by the algorithm on high
dimensional dataset of peripheral blood methylation of pesticide
applicators.
Availability: We provide the R package Aclust that efficiently implements the A-clustering and the analysis pipeline, and produces analysis reports. The package is found on http://www.hsph.harvard.edu/
tamar-sofer/packages/
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
Received on March 28, 2013; revised on August 19, 2013; accepted
on August 21, 2013
1
INTRODUCTION
Methylation is a heritable and modifiable chemical process by
which, most often, a methyl group attaches to a cytosine base
that is followed by guanine on the same DNA strand (CpG
dinucleotide, or CpG site). It is sensitive to environmental
*To whom correspondence should be addressed.
2884
exposure, such as smoking, air pollution and chemicals (Anttila
et al., 2003; Hou et al., 2012; Sofer et al., 2013). Modern arraybased platforms measure methylation in hundreds of thousands
of CpG sites, and sequencing methods measure methylation in
millions of sites. Methylation is often measured as a continuous
variable known as a value, representing the proportion of
methylated CpG sites out of the total in the measured tissue.
Interestingly, sets of related-by-location CpG sites, whether associated with a gene or not, may be jointly affected by environmental exposure. It is of interest to identify such sets of CpG sites
that are affected by an exposure in a computationally efficient
and quick manner.
Methylation occurs throughout the genome, and it differs between tissues and cell types. Although it is known that methylation is associated with the control of genes, the mechanisms are
still debated (Jones, 2012). The distribution of CpG sites varies
across the genome. Areas densely populated with CpG sites are
called CpG islands [CGIs; Gardiner-Garden et al. (1987)]. CGIs
are often found in the promoter area of genes, and they exhibit
low methylation. Higher CGI methylation is associated with
gene silencing. Within gene bodies, CpG sites are usually hypermethylated, and are found in lower density. However, there are
many exceptions to these general rules, such as CGIs within gene
bodies or promoter areas without CGIs. There are other, predefined, regions associated with CGIs. In addition to the island
itself, there are north and south shores and shelves, located upand downstream from the island, respectively, and are defined
according to their distance (in base pairs) from the island
(Irizarry et al., 2009; Sandoval et al., 2011). Shores are up to
2 kb of the island, and shelves are within 2–4 kb of the islands.
We term the collection of shelves, shores and island associated
with a single CGI by a ‘resort’ to eliminate confusion. The definition of these regions is independent of any actual observed
behavior of the sets of associated CpG sites. Further, Jacoby
et al. (2012) report finding clusters of methylated CpG sites
within specific cell types, these clusters are not related to CGIs.
In other words, these regions do not necessarily correspond to
regions that are co-regulated. Therefore, it is useful to employ
computational tools for discovery of regions with CpG sites
ß The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail:
ABSTRACT
A-clustering
with different implementations of the A-clustering algorithm and
a subsequent sensitivity analysis. We then compare the analysis
pipeline with the Bump Hunting method of Jaffe et al. (2012) as
well as briefly review single-site analysis results. We then compare clustering results between two implementation options of
the A-clustering and dbp-merge on a dataset of peripheral blood
methylation of pesticide applicators. We conclude with discussion in Section 5.
2
MODEL
Suppose there are i ¼ 1, . . . , n subjects with j ¼ 1, . . . , m sites
with measured methylation. Denote the exposure measure of
subject i by Ei, and the 1 p vector of covariates of subject i
by xTi . We model the methylation of a site j as a linear function of
exposure and covariates, according to
yij ¼ j þ Ei Ej þ xTi bxj þ ij , i ¼ 1, . . . , n, j ¼ 1, . . . , m,
where this is a general model that lets the jth site have a unique
baseline methylation value j , as well as unique exposure effect
Ej and covariates’ effects bxj on its methylation level. The vector
of covariates xTi includes biological covariates. Note that it can
potentially include confounders and technical biases, such as
variables derived using a Surrogate Variables Analysis (SVA)
procedure (Leek and Storey, 2007), but in this article we limit
the discussion to the clustering and association analysis and assume that technical biases were already removed (...truncated)