info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling
BIOINFORMATICS
ORIGINAL PAPER
Vol. 25 no. 20 2009, pages 2715–2722
doi:10.1093/bioinformatics/btp490
Gene expression
info-gibbs: a motif discovery algorithm that directly optimizes
information content during sampling
Matthieu Defrance∗ and Jacques van Helden∗
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles CP 263,
Campus Plaine, Boulevard du Triomphe, B-1050 Bruxelles, Belgium
Received on October 2, 2008; revised on August 6, 2009; accepted on August 11, 2009
Advance Access publication August 18, 2009
Associate Editor: David Rocke
1
INTRODUCTION
Gene expression is regulated at the transcriptional level by
transcription factors (TFs) that bind to DNA at specific locations.
Several algorithmic approaches have been developed for de novo
identification of regulatory signals from a set of sequences. Motif
discovery methods can be used to construct motifs that represent the
specificity of the binding between a TF and binding sites (TFBSs).
Depending on the motif representation used, motif discovery
methods can be divided into two broad categories: enumerative
methods that select overrepresented words (exact or degenerated),
and heuristics that target the discovery of more complex motifs
like position-specific scoring matrices (PSSMs). Among the first
category of methods, motifs can be represented by words (van
Helden et al., 1998), spaced words (van Helden et al., 2000) or
words with multiple gaps and errors (i.e. with degenerated positions)
(Pavesi et al., 2001; Sinha and Tompa, 2002, 2003). Considering the
second category of methods, aiming at discovering PSSMs, many
algorithms have been proposed. This includes the greedy algorithm
consensus (Hertz et al., 1990), expectation maximization algorithms
like MEME (Bailey and Elkan, 1994) and several algorithms based
on a Gibbs sampling strategy: Gibbs (Lawrence et al., 1993; Liu
et al., 1995; Neuwald et al., 1995), AlignACE (Hughes et al.,
2000; Roth et al., 1998), MotifSampler (Thijs et al., 2002) or
BioProspector (Liu et al., 2001). The latter two support higher order
Markovian background models. More recently Shida (2006a, b)
proposed a Gibbs sampling method that allows a variable stochastic
factor (temperature) that enhances Gibbs sampling convergence
speed.
Most of the methods that target PSSM motif discovery sort the
predicted motifs by computing a posteriori some score such as
information content (IC) (consensus, MotifSampler), log-likelihood
ratio (LLR) (MotifSampler, Gibbs), E-value of the log-likelihood
(MEME) or E-value of the IC (consensus).
The IC (Hertz and Stormo, 1999), also called relative entropy,
presents the advantage of measuring both the specificity of a motif
(low variability within each column) and its contrast relative to the
background model. IC has been claimed to be a good measure of
DNA binding affinity (Stormo, 1998).
The program consensus (Hertz et al., 1990) optimizes the IC, but
is sensitive to the order of incorporation of the sequences. Genetic
algorithms like GAME that try to optimize directly this score have
recently emerged (Chan et al., 2008; Wei and Jensen, 2006), but
the time and memory complexity of genetic algorithms are higher
than more specific algorithms like Gibbs sampling. Furthermore,
they require to specify a set of parameters (probabilities of mutation
and crossing over, population size, selection operator, etc.), which
are difficult to relate to properties of the input sequences and output
motifs (size, number of sites, conservation, etc.).
The scoring function used to sample motifs during the
discovery process strongly affects the resulting motifs. Jensen and
co-workers (2004) emphasized the impact of the input parameters
and the scoring functions on the quality of discovered motifs. They
implemented the software BioOptimizer (Jensen and Liu, 2004),
which takes as input a motif returned by some pattern discovery
algorithm (BioProspector, Consensus, AlignACE, MEME), and
improves it by local optimization of a scoring function based on
the log-posterior distribution.
In this article, we present a motif finding algorithm called infogibbs, that combines the qualities of Gibbs sampling (time and
memory efficiency, interpretability of parameters) and uses as a
scoring a scoring scheme either the IC or the LLR of the motif.
The strategy is to directly compute the IC or LLR of the motif at
ABSTRACT
Motivation: Discovering cis-regulatory elements in genome
sequence remains a challenging issue. Several methods rely on
the optimization of some target scoring function. The information
content (IC) or relative entropy of the motif has proven to be a good
estimator of transcription factor DNA binding affinity. However, these
information-based metrics are usually used as a posteriori statistics
rather than during the motif search process itself.
Results: We introduce here info-gibbs, a Gibbs sampling algorithm
that efficiently optimizes the IC or the log-likelihood ratio (LLR) of the
motif while keeping computation time low. The method compares
well with existing methods like MEME, BioProspector, Gibbs or
GAME on both synthetic and biological datasets. Our study shows
that motif discovery techniques can be enhanced by directly focusing
the search on the motif IC or the motif LLR.
Availability: http://rsat.ulb.ac.be/rsat/info-gibbs
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
∗ To whom correspondence should be addressed.
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[15:39 29/9/2009 Bioinformatics-btp490.tex]
2715
Page: 2715
2715–2722
M.Defrance and J.van Helden
each step of the sampling. Compared with existing methods, infogibbs shows good performances in terms of computation time and
prediction quality on both simulated and real datasets.
The IC can be extended to a more general formulation when the positions
are not independent. This can be written as follows:
P(u|M)
(5)
P(u|M)log
IC(M,B) =
P(u|B)
l
2
where A is the alphabet, l the length of the motif, P(u|M) the probability
to generate the fragment u given the matrix M and P(u|B) the probability to
generate the same fragment given the background model B. When a Bernoulli
background model is used, Equation (5) can be simplified to Equation (4).
For Markov models of higher order, this formula can be rewritten (see
Supplementary Material) and computed in acceptable time for the Markov
orders typically used in practice (m ≤ 5).
u∈A
METHODS
Problem statement: given a set of sequences = {φ1 ,...,φz } and a motif
length l, the problem can be defined as follows: find a set of sequence
fragments (sites) W = {w1 ,w2 ,...,wn } that has maximal IC. When considering
the search space as a set of potential sites S = {s1 ,...,sz } (e.g. all allowed
positions in sequences), the problem can also be viewed as finding a
subset W = {w1 ,...,wn } of S that has maximal IC (...truncated)