A-clustering: a novel method for the detection of co-regulated methylation regions, and regions associated with exposure (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/29/22/2884.full.pdf

A-clustering: a novel method for the detection of co-regulated methylation regions, and regions associated with exposure

BIOINFORMATICS ORIGINAL PAPER Gene expression Vol. 29 no. 22 2013, pages 2884–2891 doi:10.1093/bioinformatics/btt498 Advance Access publication August 29, 2013 A-clustering: a novel method for the detection of co-regulated methylation regions, and regions associated with exposure Tamar Sofer1,*, Elizabeth D. Schifano2, Jane A. Hoppin3, Lifang Hou4 and Andrea A. Baccarelli5,6 1 Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, SPH2, 4th floor, Boston, MA 02115, USA, 2Department of Statistics, University of Connecticut, 215 Glenbrook Road, Storrs, CT 06269, USA, 3 NIEHS, Epidemiology Branch, MD A3-05, PO Box 12233, Research Triangle Park, NC 27709, USA, 4Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 680 N Lake Shore Drive, Suite 1400 Chicago, IL 60611, USA, 5Department of Environmental Health and 6Department of Epidemiology, Harvard School of Public Health, 401 Park Drive, Landmark Ctr Room 415E, Boston, MA 02215, USA Associate Editor: Martin Bishop Motivation: DNA methylation is a heritable modifiable chemical process that affects gene transcription and is associated with other molecular markers (e.g. gene expression) and biomarkers (e.g. cancer or other diseases). Current technology measures methylation in hundred of thousands, or millions of CpG sites throughout the genome. It is evident that neighboring CpG sites are often highly correlated with each other, and current literature suggests that clusters of adjacent CpG sites are co-regulated. Results: We develop the Adjacent Site Clustering (A-clustering) algorithm to detect sets of neighboring CpG sites that are correlated with each other. To detect methylation regions associated with exposure, we propose an analysis pipeline for high-dimensional methylation data in which CpG sites within regions identified by A-clustering are modeled as multivariate responses to environmental exposure using a generalized estimating equation approach that assumes exposure equally affects all sites in the cluster. We develop a correlation preserving simulation scheme, and study the proposed methodology via simulations. We study the clusters detected by the algorithm on high dimensional dataset of peripheral blood methylation of pesticide applicators. Availability: We provide the R package Aclust that efficiently implements the A-clustering and the analysis pipeline, and produces analysis reports. The package is found on http://www.hsph.harvard.edu/ tamar-sofer/packages/ Contact: Supplementary information: Supplementary data are available at Bioinformatics online. Received on March 28, 2013; revised on August 19, 2013; accepted on August 21, 2013 1 INTRODUCTION Methylation is a heritable and modifiable chemical process by which, most often, a methyl group attaches to a cytosine base that is followed by guanine on the same DNA strand (CpG dinucleotide, or CpG site). It is sensitive to environmental *To whom correspondence should be addressed. 2884 exposure, such as smoking, air pollution and chemicals (Anttila et al., 2003; Hou et al., 2012; Sofer et al., 2013). Modern arraybased platforms measure methylation in hundreds of thousands of CpG sites, and sequencing methods measure methylation in millions of sites. Methylation is often measured as a continuous variable known as a value, representing the proportion of methylated CpG sites out of the total in the measured tissue. Interestingly, sets of related-by-location CpG sites, whether associated with a gene or not, may be jointly affected by environmental exposure. It is of interest to identify such sets of CpG sites that are affected by an exposure in a computationally efficient and quick manner. Methylation occurs throughout the genome, and it differs between tissues and cell types. Although it is known that methylation is associated with the control of genes, the mechanisms are still debated (Jones, 2012). The distribution of CpG sites varies across the genome. Areas densely populated with CpG sites are called CpG islands [CGIs; Gardiner-Garden et al. (1987)]. CGIs are often found in the promoter area of genes, and they exhibit low methylation. Higher CGI methylation is associated with gene silencing. Within gene bodies, CpG sites are usually hypermethylated, and are found in lower density. However, there are many exceptions to these general rules, such as CGIs within gene bodies or promoter areas without CGIs. There are other, predefined, regions associated with CGIs. In addition to the island itself, there are north and south shores and shelves, located upand downstream from the island, respectively, and are defined according to their distance (in base pairs) from the island (Irizarry et al., 2009; Sandoval et al., 2011). Shores are up to 2 kb of the island, and shelves are within 2–4 kb of the islands. We term the collection of shelves, shores and island associated with a single CGI by a ‘resort’ to eliminate confusion. The definition of these regions is independent of any actual observed behavior of the sets of associated CpG sites. Further, Jacoby et al. (2012) report finding clusters of methylated CpG sites within specific cell types, these clusters are not related to CGIs. In other words, these regions do not necessarily correspond to regions that are co-regulated. Therefore, it is useful to employ computational tools for discovery of regions with CpG sites ß The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: ABSTRACT A-clustering with different implementations of the A-clustering algorithm and a subsequent sensitivity analysis. We then compare the analysis pipeline with the Bump Hunting method of Jaffe et al. (2012) as well as briefly review single-site analysis results. We then compare clustering results between two implementation options of the A-clustering and dbp-merge on a dataset of peripheral blood methylation of pesticide applicators. We conclude with discussion in Section 5. 2 MODEL Suppose there are i ¼ 1, . . . , n subjects with j ¼ 1, . . . , m sites with measured methylation. Denote the exposure measure of subject i by Ei, and the 1 p vector of covariates of subject i by xTi . We model the methylation of a site j as a linear function of exposure and covariates, according to yij ¼ j þ Ei Ej þ xTi bxj þ ij , i ¼ 1, . . . , n, j ¼ 1, . . . , m, where this is a general model that lets the jth site have a unique baseline methylation value j , as well as unique exposure effect Ej and covariates’ effects bxj on its methylation level. The vector of covariates xTi includes biological covariates. Note that it can potentially include confounders and technical biases, such as variables derived using a Surrogate Variables Analysis (SVA) procedure (Leek and Storey, 2007), but in this article we limit the discussion to the clustering and association analysis and assume that technical biases were already removed (...truncated)