Learning protein–DNA interaction landscapes by integrating experimental data through computational models (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/30/20/2868/48929839/bioinformatics_30_20_2868.pdf

Learning protein–DNA interaction landscapes by integrating experimental data through computational models

BIOINFORMATICS Vol. 30 no. 20 2014, pages 2868–2874 doi:10.1093/bioinformatics/btu408 ORIGINAL PAPER Genome analysis Advance Access publication June 27, 2014 Learning protein–DNA interaction landscapes by integrating experimental data through computational models Jianling Zhong1, Todd Wasson2 and Alexander J. Hartemink1,3,* 1 Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, 2Knowledge Systems and Informatics, Lawrence Livermore National Laboratory, Livermore, CA 94550 and 3Department of Computer Science, Duke University, Durham, NC 27708, USA ABSTRACT Motivation: Transcriptional regulation is directly enacted by the interactions between DNA and many proteins, including transcription factors (TFs), nucleosomes and polymerases. A critical step in deciphering transcriptional regulation is to infer, and eventually predict, the precise locations of these interactions, along with their strength and frequency. While recent datasets yield great insight into these interactions, individual data sources often provide only partial information regarding one aspect of the complete interaction landscape. For example, chromatin immunoprecipitation (ChIP) reveals the binding positions of a protein, but only for one protein at a time. In contrast, nucleases like MNase and DNase can be used to reveal binding positions for many different proteins at once, but cannot easily determine the identities of those proteins. Currently, few statistical frameworks jointly model these different data sources to reveal an accurate, holistic view of the in vivo protein–DNA interaction landscape. Results: Here, we develop a novel statistical framework that integrates different sources of experimental information within a thermodynamic model of competitive binding to jointly learn a holistic view of the in vivo protein–DNA interaction landscape. We show that our framework learns an interaction landscape with increased accuracy, explaining multiple sets of data in accordance with thermodynamic principles of competitive DNA binding. The resulting model of genomic occupancy provides a precise mechanistic vantage point from which to explore the role of protein–DNA interactions in transcriptional regulation. Availability and implementation: The C source code for COMPETE and Python source code for MCMC-based inference are available at http://www.cs.duke.edu/amink. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. Received and revised on May 16, 2014; accepted on June 20, 2014 1 INTRODUCTION As an essential component of transcriptional regulation, the interaction between DNA-binding factors (DBFs) and DNA has been studied extensively. To map genome-wide protein– DNA interactions experimentally, two basic categories of techniques have been developed: ChIP-based methods (numerous studies in many organisms, but a few examples for yeast are *To whom correspondence should be addressed. 2868 Harbison et al., 2004; Ren et al., 2000; Rhee and Pugh, 2011); and nuclease digestion-based methods that profile chromatin with either DNase (Hesselberth et al., 2009) or MNase (Henikoff et al., 2011). ChIP methods can be used to reveal high-resolution DNA interaction sites for a single antibody-targeted factor, especially the recently developed ChIP-exo methods (Rhee and Pugh, 2011) that use lambda exonuclease to obtain precise positions of protein binding. Nuclease digestion methods can be used to efficiently assay genome-wide DNA occupancy of all proteins at once, but without explicit information about protein identities. These and other experimental efforts over the past decade have generated a large amount of data regarding the chromatin landscape and its role in transcriptional regulation. We now need computational models that can effectively integrate these data to generate deeper insights into transcriptional regulation. A popular set of computational models use these data to search for overrepresented DNA sequences bound by certain DBFs; these are often applied in the setting of motif discovery (Foat et al., 2006; Harbison et al., 2004; MacIsaac et al., 2006; Tanay, 2006). More recently, models have been applied to DNase-seq data to identify ‘digital footprints’ of DBFs (Chen et al., 2010; Hesselberth et al., 2009; Luo and Hartemink, 2012; Pique-Regi et al., 2011). However, many of these approaches share certain drawbacks. First, protein binding is typically treated as a binary event amenable to classification: either a protein binds at a particular site on the DNA sequence or it does not. However, both empirical and theoretical work has demonstrated that proteins bind DNA with continuous occupancy levels [as reviewed by Biggin (2011)]. Second, most computational methods model the binding events for one kind of protein at a time instead of taking into consideration the interactions among different kinds of DBFs, especially nucleosomes. Although the work of Kaplan et al. (2011), Segal et al. (2008) and Teif and Rippe (2012) are notable exceptions, these all consider small genomic regions and include only a few TFs; Segal et al. (2008) did not consider the role of nucleosomes. Third, and most importantly, almost all current methods fail to integrate different kinds of datasets. This is suboptimal because data from one kind of experiment only reveal partial information about the in vivo protein–DNA interaction landscape. For example, ChIP datasets only contain binding information for one specific protein under one specific condition; nuclease digestion datasets provide binding information for all proteins, but do not reveal the ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: Associate Editor: Janet Kelso Learning protein–DNA interaction landscapes 2 METHODS 2.1 where t is an index over the Ni DBF binding sites in configuration i, Xt denotes a weight associated with DBF t, while St and Et denote the start and end position of the DBF binding site, respectively. PðSt ; Et jDBFt Þ is the probability of observing the DNA sequence between St and Et, given that DBF t is bound there. To simplify notation, we have treated each unbound nucleotide as being bound by a special kind of ‘empty’ DBF. If we use pi to denote the probability of configuration i after normalization by the partition function, we can write the probability that DBF t binds X p , where I(t, j) is the subset of binding at a specific position j as i2Iðt;jÞ i configurations in the ensemble that have DBF t bound at sequence position j. This model can be formulated analogously to a hidden Markov model (HMM) (Rabiner, 1989), in which the states correspond to the binding of different DBFs and the observations are the DNA sequence. The various probabilities, along with the partition function, can then be calculated efficiently using the forward–backward algorithm. For TFs, we have chosen to represent PðSt ; (...truncated)