EXTREME: an online EM algorithm for motif discovery (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/30/12/1667.full.pdf

EXTREME: an online EM algorithm for motif discovery

BIOINFORMATICS ORIGINAL PAPER Sequence analysis Vol. 30 no. 12 2014, pages 1667–1673 doi:10.1093/bioinformatics/btu093 Advance Access publication February 14, 2014 EXTREME: an online EM algorithm for motif discovery Daniel Quang1,2 and Xiaohui Xie1,2,* 1 Department of Computer Science, University of California, Irvine, CA 92697, USA and 2Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA Associate Editor: John Hancock ABSTRACT Received on October 23, 2013; revised on January 24, 2014; accepted on February 7, 2014 1 INTRODUCTION Transcription factors (TFs) are proteins that play an important role in transcriptional regulation by promoting or blocking the recruitment of RNA polymerase II. They can bind specifically to recognition sequences on the genome or to other TFs in a complex. High-throughput assays generate a rich amount of information on the sequence preference of TFs. ChIP-Seq (Johnson et al., 2007) can provide the genome-wide binding sites of a single TF. DNase-Seq, which sequences open chromatin regions in the genome, can provide single nucleotide resolution for the binding sites of many TFs (Hesselberth et al., 2009). When sequenced deep enough, binding sites appear as dips, or footprints (FPs), in the DNase-Seq signal. FPs only identify the locations of the TF *To whom correspondence should be addressed. binding sites; they do not identify the proteins that are bound there. These assays can provide functional information for thousands to millions of base pair regions in the genome. The task of identifying the sequence preference of a TF is called motif discovery. Motif discovery algorithms can be classified as either search-based or probabilistic. Search-based algorithms infer motifs as consensus sequences. Probabilistic algorithms infer motifs as position frequency matrices (PFMs), which specify the frequency of nucleotides for each position in the binding site. While PFMs provide more information about a TF’s binding specificity than consensus sequences, inferring PFMs is not always practical. Probabilistic motif discovery programs usually use algorithms such as expectation-maximization (EM) (Dempster et al., 1977) for inference. These algorithms scale poorly with dataset size. Search-based algorithms are therefore preferred for large datasets. DREME (Bailey, 2011) is an example of a search-based algorithm designed for large datasets. MEME is a popular probabilistic motif discovery program (Bailey and Elkan, 1994). It uses the EM algorithm to infer PFMs. Since its inception in 1994, it has gone through several versions. However, MEME scales poorly with large datasets. One strategy to improve MEME’s performance is to discard many of the sequences. This is the strategy used by MEMEChIP (Machanick and Bailey, 2011). However, discarding sequences can decrease the chance of discovering motifs corresponding to infrequent cofactors. Another strategy, as used in STEME, applies suffix trees to accelerate MEME (Reid and Wernisch, 2011). However, STEME is only practical for finding motifs of up to width 8 on large datasets because its efficiency tails off quickly as the motif width increases. Other strategies for accelerating MEME involve specialized hardware such as parallel pattern matching chips on PCI cards (Sandve et al., 2006). However, these implementations require hardware not available to most researchers. To overcome these issues, we propose an online implementation of the MEME algorithm that we have named EXTREME. The online EM algorithm sticks closely to the original EM algorithm (hereafter referred to as the batch EM algorithm) (Cappé and Moulines, 2009). Normally, the online EM algorithm is designed for cases where not all data can be stored at once. Although most computers have enough memory to store entire sequence datasets at once, the online EM algorithm is still advantageous for motif discovery because, for large sample sizes, the online EM algorithm is more efficient, from a computational point of view, than the batch EM algorithm. We show that many of the features of the original MEME algorithm can be adapted ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: 1667 Motivation: Identifying regulatory elements is a fundamental problem in the field of gene transcription. Motif discovery—the task of identifying the sequence preference of transcription factor proteins, which bind to these elements—is an important step in this challenge. MEME is a popular motif discovery algorithm. Unfortunately, MEME’s running time scales poorly with the size of the dataset. Experiments such as ChIP-Seq and DNase-Seq are providing a rich amount of information on the binding preference of transcription factors. MEME cannot discover motifs in data from these experiments in a practical amount of time without a compromising strategy such as discarding a majority of the sequences. Results: We present EXTREME, a motif discovery algorithm designed to find DNA-binding motifs in ChIP-Seq and DNase-Seq data. Unlike MEME, which uses the expectation-maximization algorithm for motif discovery, EXTREME uses the online expectation-maximization algorithm to discover motifs. EXTREME can discover motifs in large datasets in a practical amount of time without discarding any sequences. Using EXTREME on ChIP-Seq and DNase-Seq data, we discover many motifs, including some novel and infrequent motifs that can only be discovered by using the entire dataset. Conservation analysis of one of these novel infrequent motifs confirms that it is evolutionarily conserved and possibly functional. Availability and implementation: All source code is available at the Github repository http://github.com/uci-cbcl/EXTREME. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. D.Quang and X.Xie to the online methodology. Furthermore, we show that EXTREME can achieve similar results to MEME in a fraction of the execution time. We also show that using the entire dataset is necessary to discover infrequent motifs, which is not always practical to do with MEME. To the best of our knowledge, this is the first application of the online EM algorithm to motif discovery. 2 MATERIALS AND METHODS 2.1 MEME The original MEME algorithm applies the batch EM algorithm to infer PFMs. Here, we provide a brief overview of MEME’s model and how MEME applies the batch EM algorithm to infer parameters. where Xi,j is the letter in the jth positon of subsequence Xi, and I(k,a) is an indicator function 1 if a ¼ k Iðk, aÞ ¼ ð6Þ 0 otherwise 2.1.2 Batch EM and are iteratively improved in the batch EM algorithm. In the E-step, the expected counts of all nucleotides at each position are calculated based on the current guess of the parameters. In the M-step, the parameters are updated based on the values calculated in the E-step. MEME repeats the E and M steps until the c (...truncated)