hPDI: a database of experimental human protein–DNA interactions (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/26/2/287.full.pdf

hPDI: a database of experimental human protein–DNA interactions

BIOINFORMATICS APPLICATIONS NOTE Vol. 26 no. 2 2010, pages 287–289 doi:10.1093/bioinformatics/btp631 Databases and ontologies hPDI: a database of experimental human protein–DNA interactions Zhi Xie1 , Shaohui Hu2,3 , Seth Blackshaw1,3,4,5 , Heng Zhu2,3 and Jiang Qian1,∗ 1 Department of Ophthalmology, 2 Department of Pharmacology and Molecular Sciences, 3 The Center for High-Throughput Biology, 4 Institute for Cell Engineering and 5 Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD 21231, USA Received on August 26, 2009; revised and accepted on November 3, 2009 Advance Access publication November 9, 2009 Associate Editor: Martin Bishop 1 INTRODUCTION Protein–DNA interactions (PDIs) mediate a large range of functions essential for cellular differentiation, development and function. A major class of DNA-binding proteins are the transcription factors (TFs) that regulate gene expression; DNA-binding specificities of TFs have been extensively studied for decades and the results are mainly collected in the TRANSFAC and JASPAR databases (Sandelin et al., 2004; Wingender et al., 1996). In addition, yeast and bacterial one-hybrid techniques (Y1H and B1H, respectively) and the recently developed protein-binding microarray technology also provide an efficient and comprehensive method for identification of specific PDIs. Consequently, PDIs for relatively comprehensive yeast TFs, as well as a few TF subfamilies in Caenorhabditis elegans, Drosophila and mouse TFs, have been characterized (Badis et al., 2008; Berger et al., 2008; Deplancke et al., 2006; Grove et al., 2009; Newburger and Bulyk 2009; Noyes et al., 2008; Zhu et al., 2009). Despite the long history of studies and recent advances in this field, PDIs of the vast majority of human TFs remain uncharacterized, which comprise a total of ∼1400 proteins (Messina et al., 2004). Furthermore, sequence-specific PDIs of the larger universe of unconventional DNA-binding proteins (uDBPs), aside from TFs, have not been extensively explored, although a few recent studies have suggested that uDBPs, such as protein kinases and metabolic enzymes, do in fact possess property of sequence-specific PDIs, as reviewed in Hu et al. (2009). ∗ To whom correspondence should be addressed. Our recent PDI study, in using an unbiased protein microarray assays probed by 460 sequence-diverse DNA motifs, has made significant progress towards the goal of identifying PDIs in humans (Hu et al., 2009). Preferred target sites for 493 human TFs have been identified. Comparison of significant consensus sequences (consensus logos) between our study and TRANSFAC has shown considerable agreement. Furthermore, we found that >500 proteins not predicted to act as TFs unexpectedly showed sequence-specific DNA-binding activity. A number of newly identified PDIs have also been confirmed both in vitro and in vivo. Here, we present a database hosting the experimentally determined DNA-binding sequences obtained from protein microarray assays for both human TFs and uDBPs. The database is available via a web interface that enables users to browse, query and download any PDIs of interest. 2 DATABASE CONTENT The human protein DNA Interactome (hPDI) database currently holds a collection of over 17 000 preferable DNA-binding sequences for 493 human TFs and 520 uDBPs. TFs containing known DNAbinding domains (DBDs) cover all the major subfamilies, including zf-C2H2, Homeobox, Nuclear hormone receptor, bHLH, Forkhead, bZIP, Ets, HMG box, RHD, STAT, GATA and IRF. In addition, a number of proteins that do not have known DBDs but are annotated as ‘regulation of gene expression’ by GO database are also annotated as TFs (Ashburner et al., 2000). Consensus logos have been generated for 201 TFs for those binding to at least three and <30 oligonucleotide dsDNA probe sequences. Consensus logos are generated using ‘WebLogo’ (Crooks et al., 2004). Among these logos, 166 novel ones for TFs have no previously known binding sites listed in TRANSFAC. It should be noted that the consensus logos from TRANSFAC are generated from the TRANSFAC SITE database where only DNA-binding sequences bound by human proteins are used. PDIs of those uDBPs identified in our recent study (Hu et al., 2009) are archived based on different protein classes in the database, including protein kinases, chromatin-associated proteins, RNA-binding proteins, transcriptional co-regulators, other nucleic acid-binding proteins rather than TFs and RNA-binding proteins, protein associated with DNA repair and replication, mitochondrial proteins and all other categories. Using the same criteria as the TFs, © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [10:52 16/12/2009 Bioinformatics-btp631.tex] ABSTRACT Summary: The human protein DNA Interactome (hPDI) database holds experimental protein–DNA interaction data for humans identiﬁed by protein microarray assays. The unique characteristics of hPDI are that it contains consensus DNA-binding sequences not only for nearly 500 human transcription factors but also for >500 unconventional DNA-binding proteins, which are completely uncharacterized previously. Users can browse, search and download a subset or the entire data via a web interface. This database is freely accessible for any academic purposes. Availability: http://bioinfo.wilmer.jhu.edu/PDI/ Contact: 287 Page: 287 287–289 Z.Xie et al. Table 1. Statistics of the hPDI database No. uDBP classes No. Overall statistics No. Total Zf-C2H2 Homeobox Other sub-families HLH Nuclear hormone_receptor zf-CCHC Myb HMG_box Ets MH bZIP_1 Forkhead IRF TFs without identified DBDs 493 95 44 36 22 17 12 11 11 10 8 6 6 6 209 Total RNA-binding proteins All other categories Mitochondrial proteins Chromatin-associated proteins Other nucleic acid binding DNA repair and replication Transcriptional co-regulators Protein kinases 520 207 132 97 73 50 50 43 14 No. of PDIs No. of DNA-binding proteins No. of DNA-binding logos Mean sequences bound per protein 17 718 1013 437 17 TFs without identified DBDs are defined as proteins annotated as ‘regulation of transcription’ at GO database but without known DBDs defined by Pfam database. Some proteins may belong to more than one protein class. consensus logos for 236 uDBPs have been generated and archived. The class/family coverage of proteins in hPDI is summarized in Table 1. 3 DATABASE ARCHITECHTURE AND DATA RETRIEVAL We have developed a web interface for the hPDI database. Perl CGI is used to connect the database and dynamically generate userfriendly HTML front-end queries, using Apache web server. Users may perform the following tasks on the web. (1) Protein view: Users can search a protein of interest. The protein view pages will provide the relevant information of the protein, such as annotation, protein class, DNA-binding sequences/logos and the position weight matrix (PWM). (2) (...truncated)