EXTREME: an online EM algorithm for motif discovery
BIOINFORMATICS
ORIGINAL PAPER
Sequence analysis
Vol. 30 no. 12 2014, pages 1667–1673
doi:10.1093/bioinformatics/btu093
Advance Access publication February 14, 2014
EXTREME: an online EM algorithm for motif discovery
Daniel Quang1,2 and Xiaohui Xie1,2,*
1
Department of Computer Science, University of California, Irvine, CA 92697, USA and 2Center for Complex Biological
Systems, University of California, Irvine, CA 92697, USA
Associate Editor: John Hancock
ABSTRACT
Received on October 23, 2013; revised on January 24, 2014; accepted
on February 7, 2014
1 INTRODUCTION
Transcription factors (TFs) are proteins that play an important
role in transcriptional regulation by promoting or blocking the
recruitment of RNA polymerase II. They can bind specifically to
recognition sequences on the genome or to other TFs in a complex. High-throughput assays generate a rich amount of information on the sequence preference of TFs. ChIP-Seq (Johnson
et al., 2007) can provide the genome-wide binding sites of a single
TF. DNase-Seq, which sequences open chromatin regions in the
genome, can provide single nucleotide resolution for the binding
sites of many TFs (Hesselberth et al., 2009). When sequenced
deep enough, binding sites appear as dips, or footprints (FPs), in
the DNase-Seq signal. FPs only identify the locations of the TF
*To whom correspondence should be addressed.
binding sites; they do not identify the proteins that are bound
there. These assays can provide functional information for thousands to millions of base pair regions in the genome.
The task of identifying the sequence preference of a TF is
called motif discovery. Motif discovery algorithms can be classified as either search-based or probabilistic. Search-based algorithms infer motifs as consensus sequences. Probabilistic
algorithms infer motifs as position frequency matrices (PFMs),
which specify the frequency of nucleotides for each position in
the binding site.
While PFMs provide more information about a TF’s binding
specificity than consensus sequences, inferring PFMs is not
always practical. Probabilistic motif discovery programs usually
use algorithms such as expectation-maximization (EM)
(Dempster et al., 1977) for inference. These algorithms scale
poorly with dataset size. Search-based algorithms are therefore
preferred for large datasets. DREME (Bailey, 2011) is an example of a search-based algorithm designed for large datasets.
MEME is a popular probabilistic motif discovery program
(Bailey and Elkan, 1994). It uses the EM algorithm to infer
PFMs. Since its inception in 1994, it has gone through several
versions. However, MEME scales poorly with large datasets.
One strategy to improve MEME’s performance is to discard
many of the sequences. This is the strategy used by MEMEChIP (Machanick and Bailey, 2011). However, discarding
sequences can decrease the chance of discovering motifs corresponding to infrequent cofactors. Another strategy, as used in
STEME, applies suffix trees to accelerate MEME (Reid and
Wernisch, 2011). However, STEME is only practical for finding
motifs of up to width 8 on large datasets because its efficiency
tails off quickly as the motif width increases. Other strategies for
accelerating MEME involve specialized hardware such as parallel pattern matching chips on PCI cards (Sandve et al., 2006).
However, these implementations require hardware not available
to most researchers.
To overcome these issues, we propose an online implementation of the MEME algorithm that we have named EXTREME.
The online EM algorithm sticks closely to the original EM algorithm (hereafter referred to as the batch EM algorithm) (Cappé
and Moulines, 2009). Normally, the online EM algorithm is designed for cases where not all data can be stored at once.
Although most computers have enough memory to store entire
sequence datasets at once, the online EM algorithm is still advantageous for motif discovery because, for large sample sizes,
the online EM algorithm is more efficient, from a computational
point of view, than the batch EM algorithm. We show that many
of the features of the original MEME algorithm can be adapted
ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail:
1667
Motivation: Identifying regulatory elements is a fundamental problem
in the field of gene transcription. Motif discovery—the task of identifying the sequence preference of transcription factor proteins, which
bind to these elements—is an important step in this challenge.
MEME is a popular motif discovery algorithm. Unfortunately,
MEME’s running time scales poorly with the size of the dataset.
Experiments such as ChIP-Seq and DNase-Seq are providing a rich
amount of information on the binding preference of transcription factors. MEME cannot discover motifs in data from these experiments in
a practical amount of time without a compromising strategy such as
discarding a majority of the sequences.
Results: We present EXTREME, a motif discovery algorithm designed
to find DNA-binding motifs in ChIP-Seq and DNase-Seq data. Unlike
MEME, which uses the expectation-maximization algorithm for motif
discovery, EXTREME uses the online expectation-maximization algorithm to discover motifs. EXTREME can discover motifs in large datasets in a practical amount of time without discarding any sequences.
Using EXTREME on ChIP-Seq and DNase-Seq data, we discover
many motifs, including some novel and infrequent motifs that can
only be discovered by using the entire dataset. Conservation analysis
of one of these novel infrequent motifs confirms that it is evolutionarily
conserved and possibly functional.
Availability and implementation: All source code is available at the
Github repository http://github.com/uci-cbcl/EXTREME.
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
D.Quang and X.Xie
to the online methodology. Furthermore, we show that
EXTREME can achieve similar results to MEME in a fraction
of the execution time. We also show that using the entire dataset
is necessary to discover infrequent motifs, which is not always
practical to do with MEME. To the best of our knowledge, this
is the first application of the online EM algorithm to motif
discovery.
2
MATERIALS AND METHODS
2.1
MEME
The original MEME algorithm applies the batch EM algorithm to infer
PFMs. Here, we provide a brief overview of MEME’s model and how
MEME applies the batch EM algorithm to infer parameters.
where Xi,j is the letter in the jth positon of subsequence Xi, and I(k,a) is
an indicator function
1 if a ¼ k
Iðk, aÞ ¼
ð6Þ
0 otherwise
2.1.2 Batch EM and are iteratively improved in the batch EM
algorithm. In the E-step, the expected counts of all nucleotides at each
position are calculated based on the current guess of the parameters. In
the M-step, the parameters are updated based on the values calculated in
the E-step. MEME repeats the E and M steps until the c (...truncated)