A penalized Bayesian approach to predicting sparse protein–DNA binding landscapes (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/30/5/636.full.pdf

A penalized Bayesian approach to predicting sparse protein–DNA binding landscapes

BIOINFORMATICS ORIGINAL PAPER Sequence analysis Vol. 30 no. 5 2014, pages 636–643 doi:10.1093/bioinformatics/btt585 Advance Access publication October 9, 2013 A penalized Bayesian approach to predicting sparse protein–DNA binding landscapes Matthew Levinson and Qing Zhou* Department of Statistics, University of California, Los Angeles, CA 90095, USA Associate Editor: John Hancock ABSTRACT Received on June 21, 2013; revised on September 18, 2013; accepted on October 4, 2013 1 INTRODUCTION Many complex processes in the cell, particularly gene regulation, are controlled by the binding of various factors to the DNA sequence. A key to understanding these processes is determining where each of these DNA binding factors (DBFs), including transcription factors (TFs), nucleosomes, RNA and other proteins and protein complexes, binds in the genome in a certain cell type and set of conditions. This collection of binding sites (BSs) for all DBFs over regions of interest is sometimes called a binding landscape. More formally, we define a binding landscape as the base pair–specific probability of binding for each of a library of DBFs over a set of genomic regions. *To whom correspondence should be addressed. 636 Considering DBFs one at a time leads to many false positives, both in determining which DBFs have significantly enriched BSs in a set of genomic regions and in predicting the exact locations of BSs, and results in a limited view of the processes controlled by these DBFs. This has motivated recent work on jointly predicting binding landscapes for a set of DBFs. Currently, joint landscapes at single base pair resolution for all DBFs have only been predicted in lower eukaryotes such as yeast with fewer DBFs and much smaller genomes (Wasson and Hartemink, 2009). In higher eukaryotes, predictions have been limited to a (usually small) pre-selected set of DBFs known to bind the regions of interest (He et al., 2009, 2010; Kaplan et al., 2011; Laurila et al., 2009; Raveh-Sadka et al., 2009). Of these methods, only that of He et al. (2009) does not require the DBF concentrations as prior knowledge, something made possible by considering at most two DBFs at a time, and only Kaplan et al. (2011) used ChIP-Seq data as a source of direct information. See Arnold et al. (2011), Ernst et al. (2010), Marbach et al. (2012), Ramsey et al. (2010), Teif and Rippe (2010) and Won et al. (2010) for recent examples of alternate approaches to answering related questions. Owing to computational limits, it is impossible to predict a joint base pair–specific binding landscape for all DBFs with unknown concentrations over the entire genome in higher eukaryotes with large genomes and many DBFs. We are thus limited to exploring a subset of the genome. One motivating type of genomic subset is a set of regions known to be co-bound by a small group of DBFs based on ChIP-Seq data. In such a genomic subset, we do not expect most DBFs to have a significant number of BSs. Thus, the false positive BSs in a predicted binding landscape can be substantially reduced if only the DBFs with significantly enriched binding in the regions of interest are considered. However, it is limiting to require the complete set of DBFs enriched in the considered regions to be known a priori as is done in existing work on similar questions in higher eukaryotes. In this article, we develop a method that offers a principled way to select an, often small, subset of DBFs active in the regions of interest and to reduce the false-positive signal in the predicted probabilistic binding landscape, eliminating the need for prior knowledge of the set of enriched DBFs or DBF concentrations. In the motivating genomic subset, our method allows for the discovery of unknown cofactors that commonly bind near the DBFs with ChIP data (ChIP DBFs). The predicted joint binding landscape provides a global and quantitative view of the binding pattern among the DBFs. This is an initial step to the study of combinatorial regulatory logic among multiple DBFs. ß The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: Motivation: Cellular processes are controlled, directly or indirectly, by the binding of hundreds of different DNA binding factors (DBFs) to the genome. One key to deeper understanding of the cell is discovering where, when and how strongly these DBFs bind to the DNA sequence. Direct measurement of DBF binding sites (BSs; e.g. through ChIPChip or ChIP-Seq experiments) is expensive, noisy and not available for every DBF in every cell type. Naive and most existing computational approaches to detecting which DBFs bind in a set of genomic regions of interest often perform poorly, due to the high false discovery rates and restrictive requirements for prior knowledge. Results: We develop SparScape, a penalized Bayesian method for identifying DBFs active in the considered regions and predicting a joint probabilistic binding landscape. Using a sparsity-inducing penalization, SparScape is able to select a small subset of DBFs with enriched BSs in a set of DNA sequences from a much larger candidate set. This substantially reduces the false positives in prediction of BSs. Analysis of ChIP-Seq data in mouse embryonic stem cells and simulated data show that SparScape dramatically outperforms the naive motif scanning method and the comparable computational approaches in terms of DBF identification and BS prediction. Availability and implementation: SparScape is implemented in Cþþ with OpenMP (optional at compilation) and is freely available at ‘www. stat.ucla.edu/*zhou/Software.html’ for academic use. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. Sparse binding landscapes 2 MODEL AND ESTIMATION 2.1 Overview of SparScape 2.2 The SparScape model Consider the sequence S of a set of genomic regions with total length jSj, and the set of ChIP windows D in these regions for all ChIP DBFs. Let K be the number of candidate DBFs, and denote the set of binding model parameters for all K DBFs, including the nucleosome, and the background model. Under the standard steric hindrance constraint, we define a binding configuration as a partition of the sequence S into unbound background sites and BSs for the K DBFs. Denote a configuration by A ¼ ða1 , a2 , . . . , ajAj Þ, where ai is the index of one of the K þ 1 models and represents a subsequence of base pairs bound by a DBF (or is a single unbound base pair) in the current configuration. More specifically, it represents a single unbound site covering L0 ¼ 1 bp when ai ¼ 0, a nucleosome covering L1 ¼ 147 bp when ai ¼ 1, and a non-nucleosome DBF from the candidate library covering Lk bp when ai ¼ k 2 f2, . . . , Kg, where Lk is the length of the motif for the kth DBF. Figure 1a illustrates an example configuration. Let be the probability that a BS for a ChIP DBF is entirely within one of its ChIP window (...truncated)