info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/25/20/2715.full.pdf

info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling

BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 20 2009, pages 2715–2722 doi:10.1093/bioinformatics/btp490 Gene expression info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling Matthieu Defrance∗ and Jacques van Helden∗ Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles CP 263, Campus Plaine, Boulevard du Triomphe, B-1050 Bruxelles, Belgium Received on October 2, 2008; revised on August 6, 2009; accepted on August 11, 2009 Advance Access publication August 18, 2009 Associate Editor: David Rocke 1 INTRODUCTION Gene expression is regulated at the transcriptional level by transcription factors (TFs) that bind to DNA at specific locations. Several algorithmic approaches have been developed for de novo identification of regulatory signals from a set of sequences. Motif discovery methods can be used to construct motifs that represent the specificity of the binding between a TF and binding sites (TFBSs). Depending on the motif representation used, motif discovery methods can be divided into two broad categories: enumerative methods that select overrepresented words (exact or degenerated), and heuristics that target the discovery of more complex motifs like position-specific scoring matrices (PSSMs). Among the first category of methods, motifs can be represented by words (van Helden et al., 1998), spaced words (van Helden et al., 2000) or words with multiple gaps and errors (i.e. with degenerated positions) (Pavesi et al., 2001; Sinha and Tompa, 2002, 2003). Considering the second category of methods, aiming at discovering PSSMs, many algorithms have been proposed. This includes the greedy algorithm consensus (Hertz et al., 1990), expectation maximization algorithms like MEME (Bailey and Elkan, 1994) and several algorithms based on a Gibbs sampling strategy: Gibbs (Lawrence et al., 1993; Liu et al., 1995; Neuwald et al., 1995), AlignACE (Hughes et al., 2000; Roth et al., 1998), MotifSampler (Thijs et al., 2002) or BioProspector (Liu et al., 2001). The latter two support higher order Markovian background models. More recently Shida (2006a, b) proposed a Gibbs sampling method that allows a variable stochastic factor (temperature) that enhances Gibbs sampling convergence speed. Most of the methods that target PSSM motif discovery sort the predicted motifs by computing a posteriori some score such as information content (IC) (consensus, MotifSampler), log-likelihood ratio (LLR) (MotifSampler, Gibbs), E-value of the log-likelihood (MEME) or E-value of the IC (consensus). The IC (Hertz and Stormo, 1999), also called relative entropy, presents the advantage of measuring both the specificity of a motif (low variability within each column) and its contrast relative to the background model. IC has been claimed to be a good measure of DNA binding affinity (Stormo, 1998). The program consensus (Hertz et al., 1990) optimizes the IC, but is sensitive to the order of incorporation of the sequences. Genetic algorithms like GAME that try to optimize directly this score have recently emerged (Chan et al., 2008; Wei and Jensen, 2006), but the time and memory complexity of genetic algorithms are higher than more specific algorithms like Gibbs sampling. Furthermore, they require to specify a set of parameters (probabilities of mutation and crossing over, population size, selection operator, etc.), which are difficult to relate to properties of the input sequences and output motifs (size, number of sites, conservation, etc.). The scoring function used to sample motifs during the discovery process strongly affects the resulting motifs. Jensen and co-workers (2004) emphasized the impact of the input parameters and the scoring functions on the quality of discovered motifs. They implemented the software BioOptimizer (Jensen and Liu, 2004), which takes as input a motif returned by some pattern discovery algorithm (BioProspector, Consensus, AlignACE, MEME), and improves it by local optimization of a scoring function based on the log-posterior distribution. In this article, we present a motif finding algorithm called infogibbs, that combines the qualities of Gibbs sampling (time and memory efficiency, interpretability of parameters) and uses as a scoring a scoring scheme either the IC or the LLR of the motif. The strategy is to directly compute the IC or LLR of the motif at ABSTRACT Motivation: Discovering cis-regulatory elements in genome sequence remains a challenging issue. Several methods rely on the optimization of some target scoring function. The information content (IC) or relative entropy of the motif has proven to be a good estimator of transcription factor DNA binding afﬁnity. However, these information-based metrics are usually used as a posteriori statistics rather than during the motif search process itself. Results: We introduce here info-gibbs, a Gibbs sampling algorithm that efﬁciently optimizes the IC or the log-likelihood ratio (LLR) of the motif while keeping computation time low. The method compares well with existing methods like MEME, BioProspector, Gibbs or GAME on both synthetic and biological datasets. Our study shows that motif discovery techniques can be enhanced by directly focusing the search on the motif IC or the motif LLR. Availability: http://rsat.ulb.ac.be/rsat/info-gibbs Contact: Supplementary information: Supplementary data are available at Bioinformatics online. ∗ To whom correspondence should be addressed. © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [15:39 29/9/2009 Bioinformatics-btp490.tex] 2715 Page: 2715 2715–2722 M.Defrance and J.van Helden each step of the sampling. Compared with existing methods, infogibbs shows good performances in terms of computation time and prediction quality on both simulated and real datasets. The IC can be extended to a more general formulation when the positions are not independent. This can be written as follows: P(u|M) (5) P(u|M)log IC(M,B) = P(u|B) l 2 where A is the alphabet, l the length of the motif, P(u|M) the probability to generate the fragment u given the matrix M and P(u|B) the probability to generate the same fragment given the background model B. When a Bernoulli background model is used, Equation (5) can be simplified to Equation (4). For Markov models of higher order, this formula can be rewritten (see Supplementary Material) and computed in acceptable time for the Markov orders typically used in practice (m ≤ 5). u∈A METHODS Problem statement: given a set of sequences = {φ1 ,...,φz } and a motif length l, the problem can be defined as follows: find a set of sequence fragments (sites) W = {w1 ,w2 ,...,wn } that has maximal IC. When considering the search space as a set of potential sites S = {s1 ,...,sz } (e.g. all allowed positions in sequences), the problem can also be viewed as finding a subset W = {w1 ,...,wn } of S that has maximal IC (...truncated)