Transcriptional programs: Modelling higher order structure in transcriptional control (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-10-218.pdf

Transcriptional programs: Modelling higher order structure in transcriptional control

John E Reid 1 Sascha Ott 0 Lorenz Wernisch 1 0 Systems Biology Centre , Coventry House , University of Warwick , Coventry CV4 7AL , UK 1 MRC Biostatistics Unit, Institute of Public Health, University Forvie Site , Robinson Way, Cambridge CB2 0SR , UK Background: Transcriptional regulation is an important part of regulatory control in eukaryotes. Even if binding motifs for transcription factors are known, the task of finding binding sites by scanning sequences is plagued by false positives. One way to improve the detection of binding sites from motifs is by taking cooperativity of transcription factor binding into account. We propose a non-parametric probabilistic model, similar to a document topic model, for detecting transcriptional programs, groups of cooperative transcription factors and co-regulated genes. The analysis results in transcriptional programs which generalise both transcriptional modules and TF-target gene incidence matrices and provide a higher-level summary of these structures. The method is independent of prior specification of training sets of genes, for example, via gene expression data. The analysis is based on known binding motifs. Results: We applied our method to putative regulatory regions of 18,445 Mus musculus genes. We discovered just 68 transcriptional programs that effectively summarised the action of 149 transcription factors on these genes. Several of these programs were significantly enriched for known biological processes and signalling pathways. One transcriptional program has a significant overlap with a reference set of cell cycle specific transcription factors. Conclusion: Our method is able to pick out higher order structure from noisy sequence analyses. The transcriptional programs it identifies potentially represent common mechanisms of regulatory control across the genome. It simultaneously predicts which genes are co-regulated and which sets of transcription factors cooperate to achieve this co-regulation. The programs we discovered enable biologists to choose new genes and transcription factors to study in specific transcriptional regulatory systems. - Background Organisms ranging in complexity from bacteria to higher eukaryotes are able to react and adapt to environmental and cellular signals. These responses are often encoded as complex gene regulatory networks. In these networks the expression of a gene's products is regulated by the activity of other genes. Although regulation can occur at many levels, we focus on transcriptional regulation, one of the most important and pervasive methods of regulation in eukaryotes. Transcriptional regulation occurs when certain gene products, transcription factors (TFs), bind to the DNA at binding sites (TFBSs) and affect the transcription of the regulated gene by modulation of the RNA polymerase complex. TFBSs often appear in clusters or cis-regulatory modules (CRMs), presumably to enable interactions between TFs binding there. Combinatorics of transcriptional regulation TFs do not work in isolation from each other. Particularly in higher organisms, combinatorial operations are often necessary for the response of a cell to external stimuli or developmental programs. Such a response is frequently implemented as a transcriptional switch where a combination of presence or absence of certain TFs regulates the expression of a certain gene. Several well characterised examples of the coordination of TFs are known. For instance, a set of well studied TFs in Drosophila melanogaster that govern spatial patterns of development in its embryo is described in [1]; higher eukaryotes are known to use CRMs to integrate cellular signalling information [2]; the development of the anterior pituitary gland is regulated by combinatorial actions of specific activating and restricting factors [3] which determine cell type. Conversely, cellular processes often involve the coordinated expression of sets of genes. Hence there is reason to suppose that not only do particular sets of transcription factors regulate particular genes but that these sets are also reused across the genome: that is, co-regulated genes are often targets of the same TFs. Genomic data commonly available today, such as sequence data, expression data or TF localisation data, do not permit direct inference of the higher order structure in transcriptional regulation. Most analyses of these data operate at the individual TF level. When the data permit it and the biologist is interested in this level of detail, it is certainly appropriate. However, genomic data is often noisy or incomplete. In this case a summary or view of higher order structure in transcriptional regulation is easier to interpret. Identification of binding sites by sequence analysis The databases TRANSFAC [4] and JASPAR [5] hold the most widely used collections of position specific scoring matrices (PSSMs). Each PSSM is a probabilistic model of the DNA binding specificities of a particular TF: given the PSSM and a stretch of DNA the likelihood of that TF binding to different positions in the sequence can be computationally predicted. There are several problems with this approach: algorithms that find putative binding sites are known to generate many false positives; the regions in which regulatory TFBSs are located are not normally known in advance; and, unfortunately, JASPAR and TRANSFAC do not contain PSSMs for all TFs of interest. We chose to use the PSSMs in TRANSFAC for our analysis. Our model Our model aims to discover cooperative effects between transcription factors in noisy sequence analysis data. We use a model that has had success in the field of document modelling where the task is to infer the latent topics that best summarise a corpus of documents. Each document is modelled as a mixture of several topics drawn from a shared pool of unknown topics and each topic is modelled as a collection of words. Only the documents are given as input to the model. To explain the use of this model in the context of transcriptional regulation we draw an analogy: in our model a document is analogous to a gene; a word is analogous to a transcription factor and the occurrence of a word in a document is analogous to a binding site in a gene's CRM. To complete the picture, a topic is analogous to what we term a transcriptional program (TP). A TP captures the notion of a set of transcription factors that act in a coordinated manner across a set of target genes. So in the same way that a document's topics define its context, a gene's transcriptional programs summarise its transcriptional regulation. Figure 1 shows how transcriptional programs can summarise regulatory relationships. Hierarchical Dirichlet processes (HDPs) are a natural framework to use in document-topic modelling and hence for our work in transcriptional regulation. In our framework, transcriptional programs are modelled as distributions over transcription factors. Each gene's transcriptional regulation (...truncated)