Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources (pdf)

Article PDF cannot be displayed. You can download it here:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0001820&type=printable

Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources

Citation: Lahdesmaki H, Rust AG, Shmulevich I ( Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources Harri La hdesma ki 0 Alistair G. Rust 0 Ilya Shmulevich 0 David Jones, University College London, United Kingdom 0 Institute for Systems Biology, Seattle , Washington , United States of America An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expressionbased gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org. - Transcriptional regulation is a central control mechanism for many biological processes. Transcriptional regulation generally involves DNA-binding proteins, transcription factors (TFs), that control gene expression by binding to short regulatory sequence motifs in gene promoters [1]. DNA-binding specificities of TFs are encoded in their DNA-binding domains that specialize them to recognize and bind specific types of binding sites. This mechanism is the basis of control in complex transcriptional regulatory networks. Revealing these regulatory mechanisms is one of the key problems in understanding genome-wide transcriptional regulation. Although experimental studies and computational approaches are extending our knowledge of TF binding specificities, relatively little is known about genome-wide binding of TFs to gene promoters. Thus, TF binding prediction remains an important problem in computational biology. Computational approaches to TF binding site analysis can be divided into two categories, discovery and prediction. Motif discovery focuses on searching for novel binding motifs from a collection of short sequences that are assumed to contain a common regulatory motif. Several algorithms have been proposed for motif discovery (for a recent review and comparison, see [2,3]). Accurate motif discovery is difficult in general, but incorporating additional information to guide the search for novel sequence signals can improve performance. Such additional data sources include, among others, information about co-regulated genes [4], evolutionary conservation [5,6], physical binding locations as measured by chromatin immunoprecipitation on chip (ChIP-chip) [79], information on the structural class of TFs [10], and nucleosome occupancies [11,12]. TF binding prediction, in turn, makes use of given DNA-binding specificities to predict putative TF binding sites. The binding preferences can either be the output of a motif discovery algorithm or they can be experimentally measured, such as those reported in curated databases (TRANSFAC [13] and JASPAR [14]). Regardless of the data source, binding site prediction typically requires some information about binding specificities and is therefore dependent on previous analysis. Current knowledge of binding preferences already allows useful predictions to be made genome-wide. Moreover, several novel measurement techniques to measure DNA-binding specificities have recently been developed [1519]. For example, Berger et al. [16] have developed a protein binding microarray (PBM) technology to measure binding preferences to all k-mers, k currently being 10 base pairs. These new techniques are rapidly expanding currently available databases by providing estimates of binding specificities of virtually any TF in a high-throughput manner. Consequently, they also offer an approach for rapid and sensitive identification of all TF binding sites genome-wide. In particular, high-throughput screening of TF binding specificities combined with accurate TF binding prediction provides a viable, condition independent, alternative to somewhat complex ChIP-chip experiments [20]. At the same time, however, there is a growing need for accurate TF binding prediction methods. Although motif discovery methods are relatively well-developed, the TF binding prediction problem has attracted less attention. Most of the previous binding site prediction tools have been formulated as hypothesis testing methods, where a significance value of TF binding at a specific sequence position is obtained by comparing a test statistic to a null distribution [2128], and possibly correcting the significance level for multiple testing. Traditional scanning methods for TF binding site prediction are known to perform relatively poorly in that they typically have an excessively high false positive rate (see [29]). This reported poor performance is not directly a shortcoming of previous prediction methods but has more to do with the fact that models to represent binding motifs and background sequences alone do not contain sufficient information for accurate binding site detection. This suggests that one possible approach to improve binding site prediction is to develop better motif (and background) models than the currently used position specific frequency model (PSFM) for binding sites and Markovian models for background. For example, observed dependencies between binding site nucleotides [30,16] can be incorporated into motif models [31]. However, the use of more complex models, such as general (...truncated)