Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-10-82.pdf

Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs

BMC Bioinformatics Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs Bartek Wilczynski Norbert Dojer Mateusz Patelak Jerzy Tiuryn 0 Institute of Informatics, University of Warsaw , Warsaw , Poland Background: Finding functional regulatory elements in DNA sequences is a very important problem in computational biology and providing a reliable algorithm for this task would be a major step towards understanding regulatory mechanisms on genome-wide scale. Major obstacles in this respect are that the fact that the amount of non-coding DNA is vast, and that the methods for predicting functional transcription factor binding sites tend to produce results with a high percentage of false positives. This makes the problem of finding regions significantly enriched in binding sites difficult. Results: We develop a novel method for predicting regulatory regions in DNA sequences, which is designed to exploit the evolutionary conservation of regulatory elements between species without assuming that the order of motifs is preserved across species. We have implemented our method and tested its predictive abilities on various datasets from different organisms. Conclusion: We show that our approach enables us to find a majority of the known CRMs using only sequence information from different species together with currently publicly available motif data. Also, our method is robust enough to perform well in predicting CRMs, despite differences in tissue specificity and even across species, provided that the evolutionary distances between compared species do not change substantially. The complexity of the proposed algorithm is polynomial, and the observed running times show that it may be readily applied. - Background Deciphering mechanisms of gene regulation is currently one of the key problems in molecular biology. The number of sequenced and annotated genomes is increasing rapidly, but we do not fully understand the regulatory networks underlying gene regulation. A few datasets approaching a genome-wide understanding of gene regulation in relatively simple organisms such as E. coli [1] or S. cerevisiae [2] exist, but especially for higher eukaryotes our understanding of gene regulation is far from complete. Experimental reconstruction of regulatory interactions is possible for relatively small systems [3], but it is impossible to scale this approach to all the available genomes. Therefore, computational methods are currently the best tool for improving our understanding of genome-wide gene regulation. Biological background The process of transcriptional regulation is facilitated by proteins called transcription factors which bind to DNA sequences to help or prevent the initiation of transcription by RNA polymerase. This binding is selective, i.e. trans-factors bind only to specific DNA sequence motifs (called cis-elements) [4]. In higher eukaryotes, many genes need to exhibit complex spatio-temporal expression patterns. The key to achieving such complexity is the combinatorial transcription regulation [5], i.e. different combinations of similar cis-elements may yield different expression profiles. Sequence elements, whose main function is driving complex expression patterns, are often referred to as cis-regulatory modules (CRMs). Throughout this paper, we will use this term, but it should be noted that our method is limited to finding CRMs that are relatively close the transcription start site (TSS) of a gene of interest (in the range of 10 kb up- or down-stream of its TSS) whereas in general the term "CRM" may also be used to refer to distant enhancers which cannot be found using our method. Previous work The earliest computational approaches to discovering CRMs in non-coding DNA were based on two observations: CRMs contain unusually high concentration of binding sites [6], CRMs are more conserved across species than other non-coding sequences [7]. These early approaches sparked a number of studies which utilize different computational approaches to find CRMs based on these two presumed properties [8-19]. However, in the light of more recent analyses of the statistical properties of CRMs [20], neither assumption appears to be a reliable foundation for CRM prediction. After analyzing over 500 experimentally verified CRMs from D. melanogaster, Li et al. claim that the clustering of motifs may reliably predict only a few CRMs (most notably the ones involved in the early blastoderm formation). Similarly, evolutionary conservation of CRMs appears to be less stringent and much more nuanced than previously thought. Firstly, CRMs are significantly more conserved than the rest of non-coding DNA only if measured by the density of short (7 bp) blocks conserved between species, rather than by simple sequence identity over larger windows. This is supported by recent findings that the evolution of CRMs is driven by gain and loss of whole binding sites rather than point mutations [21]. Secondly, even though the set of investigated CRMs was statistically conserved, the authors conclude that most CRMs are not distinguishable from other non-coding sequences based solely on conservation. These findings are not specific to D. melanogaster and are supported by a very recent study [22] based on comparing TF binding signatures in human and mouse liver. However, there are two published studies addressing these issues at least partially. Hallikas et al. [23] propose the EEL algorithm for finding alignments of significant motif occurrences instead of the sequences themselves. This method is very efficient and does not rely on raw sequence similarity but it assumes that the motifs in conserved CRMs occur exactly in the same order. On the other hand, the BLISS method [24] approaches the same problem by analysis of a matrix containing occurrences of all motifs along both homologous sequences after Gaussian smoothing. This relaxes the assumption of conserved motif occurrence order but at the very high cost of computations. These two approaches fall into the category of non-tissue-specific methods. The approach reported in the present paper also falls into this category. The other group of methods, which could be called tissue-specific, are tuned for a particular type of CRMs, using either a set of several known specific motifs [9], or by learning such motifs from the known tissue-specific CRMs [8]. Contributions of the present paper We present a novel approach to finding CRMs in noncoding sequences associated with homologous genes. It is based on a simple method of scoring likelihood of the occurrence of a conserved combination of binding sites in a fixed-size window. This measure is constructed in such a way that it does not rely on strict criteria for neither sequence conservation, nor for motif clustering. We show that we are able to use the same parameters to discover motifs in human, rat, mouse and fruit fly using a universal, non-tissue-specific set of known mo (...truncated)