A protein–protein interaction guided method for competitive transcription factor binding improves target predictions
Kirsti Laurila
1
Olli Yli-Harja
1
Harri La hdesma ki
0
1
0
Department of Information and Computer Science, Helsinki University of Technology
, P.O. Box 5400, FI-02015 TKK,
Finland
1
Department of Signal Processing, Tampere University of Technology
, P.O. Box 527, FI-33101 Tampere
An important milestone in revealing cells' functions is to build a comprehensive understanding of transcriptional regulation processes. These processes are largely regulated by transcription factors (TFs) binding to DNA sites. Several TF binding site (TFBS) prediction methods have been developed, but they usually model binding of a single TF at a time albeit few methods for predicting binding of multiple TFs also exist. In this article, we propose a probabilistic model that predicts binding of several TFs simultaneously. Our method explicitly models the competitive binding between TFs and uses the prior knowledge of existing proteinprotein interactions (PPIs), which mimics the situation in the nucleus. Modeling DNA binding for multiple TFs improves the accuracy of binding site prediction remarkably when compared with other programs and the cases where individual binding prediction results of separate TFs have been combined. The traditional TFBS prediction methods usually predict overwhelming number of false positives. This lack of specificity is overcome remarkably with our competitive binding prediction method. In addition, previously unpredictable binding sites can be detected with the help of PPIs. Source codes are available at http://www.cs .tut.fi/ harrila/.
-
A significant proportion of cells functions is determined
by transcription of genes. Thus, it is important to
understand the transcriptional regulation which is to a
large extent controlled by transcription factors (TFs)
binding to DNA. DNA sites that are bound by a TF
can be identified by experimental methods, such as
electromobility shift assay (EMSA). Moreover, recent
high-throughput methods including chromatin
immunoprecipitation-chip (ChIP-chip) or -sequencing (ChIP-seq)
have increased our knowledge of the TF binding sites
(TFBSs) remarkably. However, these experimental
techniques are laborious and limited by the specificity of
antibodies and additionally, they allow to study only one
protein at a time in certain conditions. Hence,
computational TFBS prediction methods have an important role in
revealing genome-wide transcriptional regulation.
Most of the existing TFBS prediction methods consider
the binding of a single TF at a time. These methods result
in lot of false positive predictions as individual sequence
motif models are sensitive but not very specific. Even
though searching of all possible binding sites of one TF
is important, it gives only a limited view of the whole
transcription regulation processes of a cell. Rather than
using only a single TF to regulate the expression of a gene,
several TFs participate in the process in a combinatorial
manner, in certain conditions and at the same time.
Further, other DNA binding TFs are also present in the
nucleus even though they may not regulate the gene of
interest directly. If these TFs have accessible binding
sites on the promoter of the studied gene, they can bind
to DNA and block the binding of the other TFs. For
example, in regulation of collagen type I (1) and in
differentiation processes of hematopoietic stem cells (2), specific
TFs can block the binding of other TFs that are
participating in the regulation. Therefore, the
transcription regulation process by TFs can be thought of as a
competition between TFs. Those TFs that have the
highest affinities to bind the sequence will, on average,
win the competition of the binding site, but even those
TFs that have lower affinities to this site have their
*To whom correspondence should be addressed. Email:
Correspondence may also be address to Harri La hdesma ki Tel: +358 3 3115 11; Fax: +358 33 115 4989 Email:
chance as determined by the steady state of the physical
binding competition. Competition of binding sites is also
affected by explicit interactions between regulatory
TFs. For these reasons, studying the binding of all
different TFs simultaneously is biologically more realistic than
combining the predictions made for individual TFs.
A few schemas for predicting TFBS of multiple TFs at
the same time already exist. These methods basically use
two different approaches (3). The methods in the first
category search for closely located binding sites as it is
known that TFs interact with each other in the regulation
process, and thus the TFBSs should be near to each other
to allow interactions. These proximal TFBSs can then be
applied to further searching and grouping to find
regulating factors as has been done in (46). The other
methods search for so-called cis-regulatory modules.
These modules are clusters of binding sites for TFs
that are known to affect expression together and to
possibly interact with each other. Methods for searching
cis-regulatory modules are presented, for example, in
(7) where hidden Markov models and expectation
maximization are used and in (8) which applies Gibbs
sampler to the model.
In this article, we present a new method for predicting
binding of several TFs simultaneously. Our method makes
Bayesian inference for integrated probabilistic sequence
specificity models and TFBSs and uses the prior
knowledge of existing proteinprotein interactions (PPIs) in
prediction. Modeling results in a carefully constructed set of
binding sites in the mouse genome show remarkable
improvement compared with the cases where the
individual prediction results of separate TFs have been
combined. Especially the number of false binding sites
is decreased significantly and previously unpredictable
binding sites can be identified. A comparison with
a widely used multiple TFBS prediction method,
MSCAN (6), also shows the better performance of our
model.
MATERIALS AND METHODS
MultiTF-PPI: a probabilistic model for competitive
TF binding with PPIs
We formulate a PPI guided probabilistic model for
competitive TF binding prediction, MultiTF-PPI. The goal of
our method is to develop a biologically realistic model that
mimics the situation in the cell. Thus, we take into account
the existence of several TFs in the regulating process and
their cooperation in the form of explicit and implicit
interactions. As the knowledge of existing PPIs is not
always available, we also provide a version of our
multiple TF predictor without PPIs, MultiTF. In our
modeling schema, we explicitly model simultaneous
binding of several TFs to the same DNA sequence,
which corresponds the situation where a large number of
TFs compete for the binding to the same sites on a
promoter. The proposed MultiTF-PPI method uses a
similar idea as our previously developed probabilistic TF
binding prediction method (9) which was developed for
analyzing binding of a single TF together with additional
sequence-level information. Here, we apply Bayesian
(...truncated)