Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs
BMC Bioinformatics
Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs
Bartek Wilczynski
Norbert Dojer
Mateusz Patelak
Jerzy Tiuryn
0 Institute of Informatics, University of Warsaw , Warsaw , Poland
Background: Finding functional regulatory elements in DNA sequences is a very important problem in computational biology and providing a reliable algorithm for this task would be a major step towards understanding regulatory mechanisms on genome-wide scale. Major obstacles in this respect are that the fact that the amount of non-coding DNA is vast, and that the methods for predicting functional transcription factor binding sites tend to produce results with a high percentage of false positives. This makes the problem of finding regions significantly enriched in binding sites difficult. Results: We develop a novel method for predicting regulatory regions in DNA sequences, which is designed to exploit the evolutionary conservation of regulatory elements between species without assuming that the order of motifs is preserved across species. We have implemented our method and tested its predictive abilities on various datasets from different organisms. Conclusion: We show that our approach enables us to find a majority of the known CRMs using only sequence information from different species together with currently publicly available motif data. Also, our method is robust enough to perform well in predicting CRMs, despite differences in tissue specificity and even across species, provided that the evolutionary distances between compared species do not change substantially. The complexity of the proposed algorithm is polynomial, and the observed running times show that it may be readily applied.
-
Background
Deciphering mechanisms of gene regulation is currently
one of the key problems in molecular biology. The
number of sequenced and annotated genomes is
increasing rapidly, but we do not fully understand the
regulatory networks underlying gene regulation. A few
datasets approaching a genome-wide understanding of
gene regulation in relatively simple organisms such as
E. coli [1] or S. cerevisiae [2] exist, but especially for
higher eukaryotes our understanding of gene regulation
is far from complete. Experimental reconstruction of
regulatory interactions is possible for relatively small
systems [3], but it is impossible to scale this approach to
all the available genomes. Therefore, computational
methods are currently the best tool for improving our
understanding of genome-wide gene regulation.
Biological background
The process of transcriptional regulation is facilitated by
proteins called transcription factors which bind to DNA
sequences to help or prevent the initiation of
transcription by RNA polymerase. This binding is selective, i.e.
trans-factors bind only to specific DNA sequence motifs
(called cis-elements) [4]. In higher eukaryotes, many
genes need to exhibit complex spatio-temporal
expression patterns. The key to achieving such complexity is the
combinatorial transcription regulation [5], i.e. different
combinations of similar cis-elements may yield different
expression profiles. Sequence elements, whose main
function is driving complex expression patterns, are
often referred to as cis-regulatory modules (CRMs).
Throughout this paper, we will use this term, but it
should be noted that our method is limited to finding
CRMs that are relatively close the transcription start site
(TSS) of a gene of interest (in the range of 10 kb up- or
down-stream of its TSS) whereas in general the term
"CRM" may also be used to refer to distant enhancers
which cannot be found using our method.
Previous work
The earliest computational approaches to discovering CRMs
in non-coding DNA were based on two observations:
CRMs contain unusually high concentration of binding
sites [6],
CRMs are more conserved across species than other
non-coding sequences [7].
These early approaches sparked a number of studies which
utilize different computational approaches to find CRMs
based on these two presumed properties [8-19]. However,
in the light of more recent analyses of the statistical
properties of CRMs [20], neither assumption appears to
be a reliable foundation for CRM prediction. After analyzing
over 500 experimentally verified CRMs from D.
melanogaster, Li et al. claim that the clustering of motifs may reliably
predict only a few CRMs (most notably the ones involved in
the early blastoderm formation). Similarly, evolutionary
conservation of CRMs appears to be less stringent and much
more nuanced than previously thought. Firstly, CRMs are
significantly more conserved than the rest of non-coding
DNA only if measured by the density of short (7 bp) blocks
conserved between species, rather than by simple sequence
identity over larger windows. This is supported by recent
findings that the evolution of CRMs is driven by gain and
loss of whole binding sites rather than point mutations [21].
Secondly, even though the set of investigated CRMs was
statistically conserved, the authors conclude that most
CRMs are not distinguishable from other non-coding
sequences based solely on conservation. These findings are
not specific to D. melanogaster and are supported by a very
recent study [22] based on comparing TF binding signatures
in human and mouse liver.
However, there are two published studies addressing
these issues at least partially. Hallikas et al. [23] propose
the EEL algorithm for finding alignments of significant
motif occurrences instead of the sequences themselves.
This method is very efficient and does not rely on raw
sequence similarity but it assumes that the motifs in
conserved CRMs occur exactly in the same order. On the
other hand, the BLISS method [24] approaches the same
problem by analysis of a matrix containing occurrences
of all motifs along both homologous sequences after
Gaussian smoothing. This relaxes the assumption of
conserved motif occurrence order but at the very high
cost of computations. These two approaches fall into the
category of non-tissue-specific methods. The approach
reported in the present paper also falls into this category.
The other group of methods, which could be called
tissue-specific, are tuned for a particular type of CRMs,
using either a set of several known specific motifs [9], or
by learning such motifs from the known tissue-specific
CRMs [8].
Contributions of the present paper
We present a novel approach to finding CRMs in
noncoding sequences associated with homologous genes. It
is based on a simple method of scoring likelihood of the
occurrence of a conserved combination of binding sites
in a fixed-size window. This measure is constructed in
such a way that it does not rely on strict criteria for
neither sequence conservation, nor for motif clustering.
We show that we are able to use the same parameters to
discover motifs in human, rat, mouse and fruit fly using
a universal, non-tissue-specific set of known mo (...truncated)