Base-resolution methylation patterns accurately predict transcription factor bindings in vivo
Nucleic Acids Research
Base-resolution methylation patterns accurately predict transcription factor bindings in vivo
Tianlei Xu 1 2
Ben Li 1
Meng Zhao 1
Keith E. Szulwach 0
R. Craig Street 0
Li Lin 0
Bing Yao 0
Feiran Zhang 0
Peng Jin 0
Hao Wu 1
Zhaohui S. Qin 1 3
0 Department of Human Genetics, Emory University, School of Medicine , 615 Michael Street, Atlanta, GA 30322 , USA
1 Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University , 1518 Clifton Road, Atlanta, GA 30322 , USA
2 Department of Mathematics and Computer Science, Emory University , 400 Dowman Drive, Atlanta, GA 30322 , USA
3 Department of Biomedical Informatics, Emory University , 36 Eagle Row, Atlanta, GA 30322 , USA
Detecting in vivo transcription factor (TF) binding is important for understanding gene regulatory circuitries. ChIP-seq is a powerful technique to empirically define TF binding in vivo. However, the multitude of distinct TFs makes genome-wide profiling for them all labor-intensive and costly. Algorithms for in silico prediction of TF binding have been developed, based mostly on histone modification or DNase I hypersensitivity data in conjunction with DNA motif and other genomic features. However, technical limitations of these methods prevent them from being applied broadly, especially in clinical settings. We conducted a comprehensive survey involving multiple cell lines, TFs, and methylation types and found that there are intimate relationships between TF binding and methylation level changes around the binding sites. Exploiting the connection between DNA methylation and TF binding, we proposed a novel supervised learning approach to predict TF-DNA interaction using data from base-resolution wholegenome methylation sequencing experiments. We devised beta-binomial models to characterize methylation data around TF binding sites and the background. Along with other static genomic features, we adopted a random forest framework to predict TFDNA interaction. After conducting comprehensive tests, we saw that the proposed method accurately predicts TF binding and performs favorably versus competing methods.
-
A fundamental goal of functional genomic research is to
understand gene regulation. Gene expression can be
controlled by epigenetic mechanisms via the coordinated
binding of transcription factors (TFs), histone modifications,
and DNA methylation (1). An important first step toward
deciphering the complexities of gene regulatory networks is
detecting the activities of functional elements, such as TF
binding sites in the genome.
Advances in high-throughput sequencing technologies
such as ChIP-seq (24) and ChIP-exo (5) allow the
comprehensive genome-wide profiling of proteinDNA binding
sites. In recent years, enormous efforts have been made to
map TF binding sites under different biological contexts;
for example, by consortiums like ENCODE (6) and
modENCODE (7). In spite of the successes, the application of
ChIP-seq is still limited by the availability of high-quality
antibodies and a requirement for fresh cells/tissues. The
multitude of distinct proteins makes genome-wide profiling
for all of them labor-intensive and costly. Furthermore,
individual profiling of TF binding is a challenge in clinical
settings because the amount of biological materials
available is often limited. For these reasons, developing in silico
approaches to predict in vivo TF binding sites that do not
rely on ChIP-seq is desirable.
Traditionally, DNA sequence motifs have been used to
predict TF binding (8,9). However, such an approach only
works well for proteins with binding motifs that are highly
specific. For proteins with weak binding motif patterns, the
predictions suffer low specificity. In addition, the DNA
motif is insufficient to determine whether a TF will bind to
DNA in vivo, which means cell type-specific binding cannot
be determined; additional information is needed to make
that prediction. Recent studies revealed that TF binding is
associated with nucleosome positions (10), histone marks
(4,11), and hypersensitivity to cleavage by DNase I (12,13).
Based on these findings, a number of statistical methods and
software tools have been developed to integrate motif
information with other data types and genome annotations to
achieve better prediction results (10,1420). All these
methods use histone or DNase I data, as well as the genome
annotations and DNA motifs for prediction. One of the
common limitations is that the histone modification or DNase I
hypersensitivity studies require large amounts of fresh
starting material (at least from 106 cells). This makes the
existing prediction methods (Supplemental Materials Table S1)
practically inapplicable to clinical samples.
DNA methylation is an important epigenetic
modification with essential roles in many biological
processes (21,22). Methylation of cytosine at carbon five
(5methylcytosine, or 5mC) regulates gene expression,
determines cell development, and affects numerous disease
pathogeneses (22,23). Exploiting next-generation
sequencing technologies, a powerful experimental assay called
bisulfite sequencing (BS-seq) was developed that measures DNA
methylation at base resolution genome-wide (2426). The
experiment starts by treating DNA molecules with sodium,
which induces deamination and conversion of
unmethylated cytosine to uracil, while methylated cytosine is
protected by the methyl group and remains unchanged. The
uracil will be amplified as thymine during amplification.
The bisulfite-treated and PCR-amplified DNA segments
then go through high-throughput sequencing. After
alignment and preprocessing, BS-seq data can be analyzed by
counting the number of sequencing reads for each CpG
site where either a thymine or a cytosine is observed. The
count of thymine represents the number of sequenced DNA
strands that are unmethylated, and the count of cytosine
represents the number of DNA strands that are methylated
at this CpG site.
5mC is known to interfere with DNAprotein
interactions, thereby directing transcriptional states (27). For
example, a recent publication reported that 5mC is strongly
correlated with TF binding, where the binding sites are
usually hypomethylated (28). Regulation of DNAprotein
interactions can occur either through affinity of
methyl-CpGbinding proteins for 5mC, or through the refractory
effects of 5mC on some DNAprotein interactions. The
latter is known to directly influence binding of a number of
TFs, such as CTCF (29). Furthermore, more recent
observations have implicated the iterative oxidation of 5mC to
5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC)
and 5-carboxylcystosine (5caC) in pathways that serve to
offset 5mC levels and facilitate TF binding (30). All these
findings indicate that DNA methylation levels offer clues as
to whether TF binding occurred at a particular locus, which
may be exploited as an alternative to the DNase I or histone
data for the purpose of predictin (...truncated)