A penalized Bayesian approach to predicting sparse protein–DNA binding landscapes
BIOINFORMATICS
ORIGINAL PAPER
Sequence analysis
Vol. 30 no. 5 2014, pages 636–643
doi:10.1093/bioinformatics/btt585
Advance Access publication October 9, 2013
A penalized Bayesian approach to predicting sparse protein–DNA
binding landscapes
Matthew Levinson and Qing Zhou*
Department of Statistics, University of California, Los Angeles, CA 90095, USA
Associate Editor: John Hancock
ABSTRACT
Received on June 21, 2013; revised on September 18, 2013; accepted
on October 4, 2013
1
INTRODUCTION
Many complex processes in the cell, particularly gene regulation,
are controlled by the binding of various factors to the DNA
sequence. A key to understanding these processes is determining
where each of these DNA binding factors (DBFs), including
transcription factors (TFs), nucleosomes, RNA and other proteins and protein complexes, binds in the genome in a certain cell
type and set of conditions. This collection of binding sites (BSs)
for all DBFs over regions of interest is sometimes called a binding landscape. More formally, we define a binding landscape as
the base pair–specific probability of binding for each of a library
of DBFs over a set of genomic regions.
*To whom correspondence should be addressed.
636
Considering DBFs one at a time leads to many false positives,
both in determining which DBFs have significantly enriched BSs
in a set of genomic regions and in predicting the exact locations
of BSs, and results in a limited view of the processes controlled
by these DBFs. This has motivated recent work on jointly predicting binding landscapes for a set of DBFs. Currently, joint
landscapes at single base pair resolution for all DBFs have only
been predicted in lower eukaryotes such as yeast with fewer
DBFs and much smaller genomes (Wasson and Hartemink,
2009). In higher eukaryotes, predictions have been limited to a
(usually small) pre-selected set of DBFs known to bind the regions of interest (He et al., 2009, 2010; Kaplan et al., 2011;
Laurila et al., 2009; Raveh-Sadka et al., 2009). Of these methods,
only that of He et al. (2009) does not require the DBF concentrations as prior knowledge, something made possible by considering at most two DBFs at a time, and only Kaplan et al.
(2011) used ChIP-Seq data as a source of direct information. See
Arnold et al. (2011), Ernst et al. (2010), Marbach et al. (2012),
Ramsey et al. (2010), Teif and Rippe (2010) and Won et al.
(2010) for recent examples of alternate approaches to answering
related questions.
Owing to computational limits, it is impossible to predict a
joint base pair–specific binding landscape for all DBFs with unknown concentrations over the entire genome in higher eukaryotes with large genomes and many DBFs. We are thus limited to
exploring a subset of the genome. One motivating type of genomic subset is a set of regions known to be co-bound by a small
group of DBFs based on ChIP-Seq data. In such a genomic
subset, we do not expect most DBFs to have a significant
number of BSs. Thus, the false positive BSs in a predicted binding landscape can be substantially reduced if only the DBFs with
significantly enriched binding in the regions of interest are considered. However, it is limiting to require the complete set of
DBFs enriched in the considered regions to be known a priori
as is done in existing work on similar questions in higher
eukaryotes.
In this article, we develop a method that offers a principled
way to select an, often small, subset of DBFs active in the regions
of interest and to reduce the false-positive signal in the predicted
probabilistic binding landscape, eliminating the need for prior
knowledge of the set of enriched DBFs or DBF concentrations.
In the motivating genomic subset, our method allows for the
discovery of unknown cofactors that commonly bind near the
DBFs with ChIP data (ChIP DBFs). The predicted joint binding
landscape provides a global and quantitative view of the binding
pattern among the DBFs. This is an initial step to the study of
combinatorial regulatory logic among multiple DBFs.
ß The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail:
Motivation: Cellular processes are controlled, directly or indirectly, by
the binding of hundreds of different DNA binding factors (DBFs) to the
genome. One key to deeper understanding of the cell is discovering
where, when and how strongly these DBFs bind to the DNA sequence.
Direct measurement of DBF binding sites (BSs; e.g. through ChIPChip or ChIP-Seq experiments) is expensive, noisy and not available
for every DBF in every cell type. Naive and most existing computational approaches to detecting which DBFs bind in a set of genomic
regions of interest often perform poorly, due to the high false discovery
rates and restrictive requirements for prior knowledge.
Results: We develop SparScape, a penalized Bayesian method for
identifying DBFs active in the considered regions and predicting a joint
probabilistic binding landscape. Using a sparsity-inducing penalization, SparScape is able to select a small subset of DBFs with enriched
BSs in a set of DNA sequences from a much larger candidate set. This
substantially reduces the false positives in prediction of BSs. Analysis
of ChIP-Seq data in mouse embryonic stem cells and simulated data
show that SparScape dramatically outperforms the naive motif scanning method and the comparable computational approaches in terms
of DBF identification and BS prediction.
Availability and implementation: SparScape is implemented in Cþþ
with OpenMP (optional at compilation) and is freely available at ‘www.
stat.ucla.edu/*zhou/Software.html’ for academic use.
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
Sparse binding landscapes
2 MODEL AND ESTIMATION
2.1
Overview of SparScape
2.2
The SparScape model
Consider the sequence S of a set of genomic regions with total
length jSj, and the set of ChIP windows D in these regions for all
ChIP DBFs. Let K be the number of candidate DBFs, and
denote the set of binding model parameters for all K DBFs,
including the nucleosome, and the background model. Under
the standard steric hindrance constraint, we define a binding
configuration as a partition of the sequence S into unbound
background sites and BSs for the K DBFs. Denote a configuration by A ¼ ða1 , a2 , . . . , ajAj Þ, where ai is the index of one of the
K þ 1 models and represents a subsequence of base pairs bound
by a DBF (or is a single unbound base pair) in the current configuration. More specifically, it represents a single unbound site
covering L0 ¼ 1 bp when ai ¼ 0, a nucleosome covering
L1 ¼ 147 bp when ai ¼ 1, and a non-nucleosome DBF from
the candidate library covering Lk bp when ai ¼ k 2 f2, . . . , Kg,
where Lk is the length of the motif for the kth DBF. Figure 1a
illustrates an example configuration.
Let be the probability that a BS for a ChIP DBF is entirely
within one of its ChIP window (...truncated)