Base-resolution methylation patterns accurately predict transcription factor bindings in vivo

Nucleic Acids Research, Mar 2015

Detecting in vivo transcription factor (TF) binding is important for understanding gene regulatory circuitries. ChIP-seq is a powerful technique to empirically define TF binding in vivo. However, the multitude of distinct TFs makes genome-wide profiling for them all labor-intensive and costly. Algorithms for in silico prediction of TF binding have been developed, based mostly on histone modification or DNase I hypersensitivity data in conjunction with DNA motif and other genomic features. However, technical limitations of these methods prevent them from being applied broadly, especially in clinical settings. We conducted a comprehensive survey involving multiple cell lines, TFs, and methylation types and found that there are intimate relationships between TF binding and methylation level changes around the binding sites. Exploiting the connection between DNA methylation and TF binding, we proposed a novel supervised learning approach to predict TF–DNA interaction using data from base-resolution whole-genome methylation sequencing experiments. We devised beta-binomial models to characterize methylation data around TF binding sites and the background. Along with other static genomic features, we adopted a random forest framework to predict TF–DNA interaction. After conducting comprehensive tests, we saw that the proposed method accurately predicts TF binding and performs favorably versus competing methods.

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/43/5/2757.full.pdf

Base-resolution methylation patterns accurately predict transcription factor bindings in vivo

Nucleic Acids Research Base-resolution methylation patterns accurately predict transcription factor bindings in vivo Tianlei Xu 1 2 Ben Li 1 Meng Zhao 1 Keith E. Szulwach 0 R. Craig Street 0 Li Lin 0 Bing Yao 0 Feiran Zhang 0 Peng Jin 0 Hao Wu 1 Zhaohui S. Qin 1 3 0 Department of Human Genetics, Emory University, School of Medicine , 615 Michael Street, Atlanta, GA 30322 , USA 1 Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University , 1518 Clifton Road, Atlanta, GA 30322 , USA 2 Department of Mathematics and Computer Science, Emory University , 400 Dowman Drive, Atlanta, GA 30322 , USA 3 Department of Biomedical Informatics, Emory University , 36 Eagle Row, Atlanta, GA 30322 , USA Detecting in vivo transcription factor (TF) binding is important for understanding gene regulatory circuitries. ChIP-seq is a powerful technique to empirically define TF binding in vivo. However, the multitude of distinct TFs makes genome-wide profiling for them all labor-intensive and costly. Algorithms for in silico prediction of TF binding have been developed, based mostly on histone modification or DNase I hypersensitivity data in conjunction with DNA motif and other genomic features. However, technical limitations of these methods prevent them from being applied broadly, especially in clinical settings. We conducted a comprehensive survey involving multiple cell lines, TFs, and methylation types and found that there are intimate relationships between TF binding and methylation level changes around the binding sites. Exploiting the connection between DNA methylation and TF binding, we proposed a novel supervised learning approach to predict TF-DNA interaction using data from base-resolution wholegenome methylation sequencing experiments. We devised beta-binomial models to characterize methylation data around TF binding sites and the background. Along with other static genomic features, we adopted a random forest framework to predict TFDNA interaction. After conducting comprehensive tests, we saw that the proposed method accurately predicts TF binding and performs favorably versus competing methods. - A fundamental goal of functional genomic research is to understand gene regulation. Gene expression can be controlled by epigenetic mechanisms via the coordinated binding of transcription factors (TFs), histone modifications, and DNA methylation (1). An important first step toward deciphering the complexities of gene regulatory networks is detecting the activities of functional elements, such as TF binding sites in the genome. Advances in high-throughput sequencing technologies such as ChIP-seq (24) and ChIP-exo (5) allow the comprehensive genome-wide profiling of proteinDNA binding sites. In recent years, enormous efforts have been made to map TF binding sites under different biological contexts; for example, by consortiums like ENCODE (6) and modENCODE (7). In spite of the successes, the application of ChIP-seq is still limited by the availability of high-quality antibodies and a requirement for fresh cells/tissues. The multitude of distinct proteins makes genome-wide profiling for all of them labor-intensive and costly. Furthermore, individual profiling of TF binding is a challenge in clinical settings because the amount of biological materials available is often limited. For these reasons, developing in silico approaches to predict in vivo TF binding sites that do not rely on ChIP-seq is desirable. Traditionally, DNA sequence motifs have been used to predict TF binding (8,9). However, such an approach only works well for proteins with binding motifs that are highly specific. For proteins with weak binding motif patterns, the predictions suffer low specificity. In addition, the DNA motif is insufficient to determine whether a TF will bind to DNA in vivo, which means cell type-specific binding cannot be determined; additional information is needed to make that prediction. Recent studies revealed that TF binding is associated with nucleosome positions (10), histone marks (4,11), and hypersensitivity to cleavage by DNase I (12,13). Based on these findings, a number of statistical methods and software tools have been developed to integrate motif information with other data types and genome annotations to achieve better prediction results (10,1420). All these methods use histone or DNase I data, as well as the genome annotations and DNA motifs for prediction. One of the common limitations is that the histone modification or DNase I hypersensitivity studies require large amounts of fresh starting material (at least from 106 cells). This makes the existing prediction methods (Supplemental Materials Table S1) practically inapplicable to clinical samples. DNA methylation is an important epigenetic modification with essential roles in many biological processes (21,22). Methylation of cytosine at carbon five (5methylcytosine, or 5mC) regulates gene expression, determines cell development, and affects numerous disease pathogeneses (22,23). Exploiting next-generation sequencing technologies, a powerful experimental assay called bisulfite sequencing (BS-seq) was developed that measures DNA methylation at base resolution genome-wide (2426). The experiment starts by treating DNA molecules with sodium, which induces deamination and conversion of unmethylated cytosine to uracil, while methylated cytosine is protected by the methyl group and remains unchanged. The uracil will be amplified as thymine during amplification. The bisulfite-treated and PCR-amplified DNA segments then go through high-throughput sequencing. After alignment and preprocessing, BS-seq data can be analyzed by counting the number of sequencing reads for each CpG site where either a thymine or a cytosine is observed. The count of thymine represents the number of sequenced DNA strands that are unmethylated, and the count of cytosine represents the number of DNA strands that are methylated at this CpG site. 5mC is known to interfere with DNAprotein interactions, thereby directing transcriptional states (27). For example, a recent publication reported that 5mC is strongly correlated with TF binding, where the binding sites are usually hypomethylated (28). Regulation of DNAprotein interactions can occur either through affinity of methyl-CpGbinding proteins for 5mC, or through the refractory effects of 5mC on some DNAprotein interactions. The latter is known to directly influence binding of a number of TFs, such as CTCF (29). Furthermore, more recent observations have implicated the iterative oxidation of 5mC to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC) and 5-carboxylcystosine (5caC) in pathways that serve to offset 5mC levels and facilitate TF binding (30). All these findings indicate that DNA methylation levels offer clues as to whether TF binding occurred at a particular locus, which may be exploited as an alternative to the DNase I or histone data for the purpose of predictin (...truncated)


This is a preview of a remote PDF: https://nar.oxfordjournals.org/content/43/5/2757.full.pdf
Article home page: http://nar.oxfordjournals.org/content/43/5/2757.abstract

Tianlei Xu, Ben Li, Meng Zhao, Keith E. Szulwach, R. Craig Street, Li Lin, Bing Yao, Feiran Zhang, Peng Jin, Hao Wu, Zhaohui S. Qin. Base-resolution methylation patterns accurately predict transcription factor bindings in vivo, Nucleic Acids Research, 2015, pp. 2757-2766, 43/5, DOI: 10.1093/nar/gkv151