Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources
Citation: Lahdesmaki H, Rust AG, Shmulevich I (
Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources
Harri La hdesma ki 0
Alistair G. Rust 0
Ilya Shmulevich 0
David Jones, University College London, United Kingdom
0 Institute for Systems Biology, Seattle , Washington , United States of America
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expressionbased gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.
-
Transcriptional regulation is a central control mechanism for
many biological processes. Transcriptional regulation generally
involves DNA-binding proteins, transcription factors (TFs), that
control gene expression by binding to short regulatory sequence
motifs in gene promoters [1]. DNA-binding specificities of TFs are
encoded in their DNA-binding domains that specialize them to
recognize and bind specific types of binding sites. This mechanism
is the basis of control in complex transcriptional regulatory
networks. Revealing these regulatory mechanisms is one of the key
problems in understanding genome-wide transcriptional
regulation. Although experimental studies and computational
approaches are extending our knowledge of TF binding specificities,
relatively little is known about genome-wide binding of TFs to
gene promoters. Thus, TF binding prediction remains an
important problem in computational biology.
Computational approaches to TF binding site analysis can be
divided into two categories, discovery and prediction. Motif discovery
focuses on searching for novel binding motifs from a collection of
short sequences that are assumed to contain a common regulatory
motif. Several algorithms have been proposed for motif discovery
(for a recent review and comparison, see [2,3]). Accurate motif
discovery is difficult in general, but incorporating additional
information to guide the search for novel sequence signals can
improve performance. Such additional data sources include,
among others, information about co-regulated genes [4],
evolutionary conservation [5,6], physical binding locations as measured
by chromatin immunoprecipitation on chip (ChIP-chip) [79],
information on the structural class of TFs [10], and nucleosome
occupancies [11,12].
TF binding prediction, in turn, makes use of given DNA-binding
specificities to predict putative TF binding sites. The binding
preferences can either be the output of a motif discovery algorithm
or they can be experimentally measured, such as those reported in
curated databases (TRANSFAC [13] and JASPAR [14]).
Regardless of the data source, binding site prediction typically
requires some information about binding specificities and is
therefore dependent on previous analysis. Current knowledge of
binding preferences already allows useful predictions to be made
genome-wide. Moreover, several novel measurement techniques to
measure DNA-binding specificities have recently been developed
[1519]. For example, Berger et al. [16] have developed a protein
binding microarray (PBM) technology to measure binding
preferences to all k-mers, k currently being 10 base pairs. These
new techniques are rapidly expanding currently available
databases by providing estimates of binding specificities of virtually
any TF in a high-throughput manner. Consequently, they also
offer an approach for rapid and sensitive identification of all TF
binding sites genome-wide. In particular, high-throughput
screening of TF binding specificities combined with accurate TF binding
prediction provides a viable, condition independent, alternative to
somewhat complex ChIP-chip experiments [20]. At the same time,
however, there is a growing need for accurate TF binding
prediction methods.
Although motif discovery methods are relatively well-developed,
the TF binding prediction problem has attracted less attention.
Most of the previous binding site prediction tools have been
formulated as hypothesis testing methods, where a significance
value of TF binding at a specific sequence position is obtained by
comparing a test statistic to a null distribution [2128], and
possibly correcting the significance level for multiple testing.
Traditional scanning methods for TF binding site prediction are
known to perform relatively poorly in that they typically have an
excessively high false positive rate (see [29]). This reported poor
performance is not directly a shortcoming of previous prediction
methods but has more to do with the fact that models to represent
binding motifs and background sequences alone do not contain
sufficient information for accurate binding site detection. This
suggests that one possible approach to improve binding site
prediction is to develop better motif (and background) models than
the currently used position specific frequency model (PSFM) for
binding sites and Markovian models for background. For example,
observed dependencies between binding site nucleotides [30,16]
can be incorporated into motif models [31]. However, the use of
more complex models, such as general (...truncated)