Learning protein–DNA interaction landscapes by integrating experimental data through computational models
BIOINFORMATICS
Vol. 30 no. 20 2014, pages 2868–2874
doi:10.1093/bioinformatics/btu408
ORIGINAL PAPER
Genome analysis
Advance Access publication June 27, 2014
Learning protein–DNA interaction landscapes by integrating
experimental data through computational models
Jianling Zhong1, Todd Wasson2 and Alexander J. Hartemink1,3,*
1
Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, 2Knowledge Systems and
Informatics, Lawrence Livermore National Laboratory, Livermore, CA 94550 and 3Department of Computer Science,
Duke University, Durham, NC 27708, USA
ABSTRACT
Motivation: Transcriptional regulation is directly enacted by the interactions between DNA and many proteins, including transcription factors (TFs), nucleosomes and polymerases. A critical step in
deciphering transcriptional regulation is to infer, and eventually predict,
the precise locations of these interactions, along with their strength
and frequency. While recent datasets yield great insight into these
interactions, individual data sources often provide only partial information regarding one aspect of the complete interaction landscape. For
example, chromatin immunoprecipitation (ChIP) reveals the binding
positions of a protein, but only for one protein at a time. In contrast,
nucleases like MNase and DNase can be used to reveal binding positions for many different proteins at once, but cannot easily determine
the identities of those proteins. Currently, few statistical frameworks
jointly model these different data sources to reveal an accurate, holistic view of the in vivo protein–DNA interaction landscape.
Results: Here, we develop a novel statistical framework that integrates different sources of experimental information within a thermodynamic model of competitive binding to jointly learn a holistic view of
the in vivo protein–DNA interaction landscape. We show that our
framework learns an interaction landscape with increased accuracy,
explaining multiple sets of data in accordance with thermodynamic
principles of competitive DNA binding. The resulting model of genomic
occupancy provides a precise mechanistic vantage point from which
to explore the role of protein–DNA interactions in transcriptional
regulation.
Availability and implementation: The C source code for COMPETE
and Python source code for MCMC-based inference are available at
http://www.cs.duke.edu/amink.
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
Received and revised on May 16, 2014; accepted on June 20, 2014
1
INTRODUCTION
As an essential component of transcriptional regulation, the
interaction between DNA-binding factors (DBFs) and DNA
has been studied extensively. To map genome-wide protein–
DNA interactions experimentally, two basic categories of techniques have been developed: ChIP-based methods (numerous
studies in many organisms, but a few examples for yeast are
*To whom correspondence should be addressed.
2868
Harbison et al., 2004; Ren et al., 2000; Rhee and Pugh, 2011);
and nuclease digestion-based methods that profile chromatin
with either DNase (Hesselberth et al., 2009) or MNase
(Henikoff et al., 2011). ChIP methods can be used to reveal
high-resolution DNA interaction sites for a single antibody-targeted factor, especially the recently developed ChIP-exo methods
(Rhee and Pugh, 2011) that use lambda exonuclease to obtain
precise positions of protein binding. Nuclease digestion methods
can be used to efficiently assay genome-wide DNA occupancy of
all proteins at once, but without explicit information about protein identities. These and other experimental efforts over the past
decade have generated a large amount of data regarding the
chromatin landscape and its role in transcriptional regulation.
We now need computational models that can effectively integrate
these data to generate deeper insights into transcriptional
regulation.
A popular set of computational models use these data to
search for overrepresented DNA sequences bound by certain
DBFs; these are often applied in the setting of motif discovery
(Foat et al., 2006; Harbison et al., 2004; MacIsaac et al., 2006;
Tanay, 2006). More recently, models have been applied to
DNase-seq data to identify ‘digital footprints’ of DBFs (Chen
et al., 2010; Hesselberth et al., 2009; Luo and Hartemink, 2012;
Pique-Regi et al., 2011). However, many of these approaches
share certain drawbacks. First, protein binding is typically treated as a binary event amenable to classification: either a protein
binds at a particular site on the DNA sequence or it does not.
However, both empirical and theoretical work has demonstrated
that proteins bind DNA with continuous occupancy levels [as
reviewed by Biggin (2011)]. Second, most computational methods model the binding events for one kind of protein at a time
instead of taking into consideration the interactions among different kinds of DBFs, especially nucleosomes. Although the
work of Kaplan et al. (2011), Segal et al. (2008) and Teif and
Rippe (2012) are notable exceptions, these all consider small genomic regions and include only a few TFs; Segal et al. (2008) did
not consider the role of nucleosomes. Third, and most importantly, almost all current methods fail to integrate different kinds
of datasets. This is suboptimal because data from one kind of
experiment only reveal partial information about the in vivo protein–DNA interaction landscape. For example, ChIP datasets
only contain binding information for one specific protein
under one specific condition; nuclease digestion datasets provide
binding information for all proteins, but do not reveal the
ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail:
Associate Editor: Janet Kelso
Learning protein–DNA interaction landscapes
2 METHODS
2.1
where t is an index over the Ni DBF binding sites in configuration i, Xt
denotes a weight associated with DBF t, while St and Et denote the start
and end position of the DBF binding site, respectively. PðSt ; Et jDBFt Þ is
the probability of observing the DNA sequence between St and Et, given
that DBF t is bound there. To simplify notation, we have treated each
unbound nucleotide as being bound by a special kind of ‘empty’ DBF. If
we use pi to denote the probability of configuration i after normalization
by the partition function, we can write the probability that DBF t binds
X
p , where I(t, j) is the subset of binding
at a specific position j as
i2Iðt;jÞ i
configurations in the ensemble that have DBF t bound at sequence position j.
This model can be formulated analogously to a hidden Markov model
(HMM) (Rabiner, 1989), in which the states correspond to the binding of
different DBFs and the observations are the DNA sequence. The various
probabilities, along with the partition function, can then be calculated
efficiently using the forward–backward algorithm. For TFs, we have
chosen to represent PðSt ; (...truncated)