hPDI: a database of experimental human protein–DNA interactions
BIOINFORMATICS APPLICATIONS NOTE
Vol. 26 no. 2 2010, pages 287–289
doi:10.1093/bioinformatics/btp631
Databases and ontologies
hPDI: a database of experimental human protein–DNA
interactions
Zhi Xie1 , Shaohui Hu2,3 , Seth Blackshaw1,3,4,5 , Heng Zhu2,3 and Jiang Qian1,∗
1 Department of Ophthalmology, 2 Department of Pharmacology and Molecular Sciences, 3 The Center for
High-Throughput Biology, 4 Institute for Cell Engineering and 5 Department of Neuroscience, Johns Hopkins University
School of Medicine, Baltimore, MD 21231, USA
Received on August 26, 2009; revised and accepted on November 3, 2009
Advance Access publication November 9, 2009
Associate Editor: Martin Bishop
1
INTRODUCTION
Protein–DNA interactions (PDIs) mediate a large range of functions
essential for cellular differentiation, development and function.
A major class of DNA-binding proteins are the transcription factors
(TFs) that regulate gene expression; DNA-binding specificities
of TFs have been extensively studied for decades and the results
are mainly collected in the TRANSFAC and JASPAR databases
(Sandelin et al., 2004; Wingender et al., 1996). In addition, yeast and
bacterial one-hybrid techniques (Y1H and B1H, respectively) and
the recently developed protein-binding microarray technology also
provide an efficient and comprehensive method for identification
of specific PDIs. Consequently, PDIs for relatively comprehensive
yeast TFs, as well as a few TF subfamilies in Caenorhabditis
elegans, Drosophila and mouse TFs, have been characterized (Badis
et al., 2008; Berger et al., 2008; Deplancke et al., 2006; Grove
et al., 2009; Newburger and Bulyk 2009; Noyes et al., 2008;
Zhu et al., 2009). Despite the long history of studies and recent
advances in this field, PDIs of the vast majority of human TFs remain
uncharacterized, which comprise a total of ∼1400 proteins (Messina
et al., 2004). Furthermore, sequence-specific PDIs of the larger
universe of unconventional DNA-binding proteins (uDBPs), aside
from TFs, have not been extensively explored, although a few recent
studies have suggested that uDBPs, such as protein kinases and
metabolic enzymes, do in fact possess property of sequence-specific
PDIs, as reviewed in Hu et al. (2009).
∗ To whom correspondence should be addressed.
Our recent PDI study, in using an unbiased protein microarray
assays probed by 460 sequence-diverse DNA motifs, has made
significant progress towards the goal of identifying PDIs in humans
(Hu et al., 2009). Preferred target sites for 493 human TFs have
been identified. Comparison of significant consensus sequences
(consensus logos) between our study and TRANSFAC has shown
considerable agreement. Furthermore, we found that >500 proteins
not predicted to act as TFs unexpectedly showed sequence-specific
DNA-binding activity. A number of newly identified PDIs have
also been confirmed both in vitro and in vivo. Here, we present
a database hosting the experimentally determined DNA-binding
sequences obtained from protein microarray assays for both human
TFs and uDBPs. The database is available via a web interface
that enables users to browse, query and download any PDIs of
interest.
2
DATABASE CONTENT
The human protein DNA Interactome (hPDI) database currently
holds a collection of over 17 000 preferable DNA-binding sequences
for 493 human TFs and 520 uDBPs. TFs containing known DNAbinding domains (DBDs) cover all the major subfamilies, including
zf-C2H2, Homeobox, Nuclear hormone receptor, bHLH, Forkhead,
bZIP, Ets, HMG box, RHD, STAT, GATA and IRF. In addition,
a number of proteins that do not have known DBDs but are
annotated as ‘regulation of gene expression’ by GO database are
also annotated as TFs (Ashburner et al., 2000). Consensus logos
have been generated for 201 TFs for those binding to at least three
and <30 oligonucleotide dsDNA probe sequences. Consensus logos
are generated using ‘WebLogo’ (Crooks et al., 2004). Among these
logos, 166 novel ones for TFs have no previously known binding
sites listed in TRANSFAC. It should be noted that the consensus
logos from TRANSFAC are generated from the TRANSFAC SITE
database where only DNA-binding sequences bound by human
proteins are used.
PDIs of those uDBPs identified in our recent study (Hu et al.,
2009) are archived based on different protein classes in the
database, including protein kinases, chromatin-associated proteins,
RNA-binding proteins, transcriptional co-regulators, other nucleic
acid-binding proteins rather than TFs and RNA-binding proteins,
protein associated with DNA repair and replication, mitochondrial
proteins and all other categories. Using the same criteria as the TFs,
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[10:52 16/12/2009 Bioinformatics-btp631.tex]
ABSTRACT
Summary: The human protein DNA Interactome (hPDI) database
holds experimental protein–DNA interaction data for humans
identified by protein microarray assays. The unique characteristics
of hPDI are that it contains consensus DNA-binding sequences
not only for nearly 500 human transcription factors but also for
>500 unconventional DNA-binding proteins, which are completely
uncharacterized previously. Users can browse, search and download
a subset or the entire data via a web interface. This database is freely
accessible for any academic purposes.
Availability: http://bioinfo.wilmer.jhu.edu/PDI/
Contact:
287
Page: 287
287–289
Z.Xie et al.
Table 1. Statistics of the hPDI database
No.
uDBP classes
No.
Overall statistics
No.
Total
Zf-C2H2
Homeobox
Other sub-families
HLH
Nuclear hormone_receptor
zf-CCHC
Myb
HMG_box
Ets
MH
bZIP_1
Forkhead
IRF
TFs without identified DBDs
493
95
44
36
22
17
12
11
11
10
8
6
6
6
209
Total
RNA-binding proteins
All other categories
Mitochondrial proteins
Chromatin-associated proteins
Other nucleic acid binding
DNA repair and replication
Transcriptional co-regulators
Protein kinases
520
207
132
97
73
50
50
43
14
No. of PDIs
No. of DNA-binding proteins
No. of DNA-binding logos
Mean sequences bound per protein
17 718
1013
437
17
TFs without identified DBDs are defined as proteins annotated as ‘regulation of transcription’ at GO database but without known DBDs defined by Pfam database. Some proteins
may belong to more than one protein class.
consensus logos for 236 uDBPs have been generated and archived.
The class/family coverage of proteins in hPDI is summarized in
Table 1.
3
DATABASE ARCHITECHTURE AND DATA
RETRIEVAL
We have developed a web interface for the hPDI database. Perl
CGI is used to connect the database and dynamically generate userfriendly HTML front-end queries, using Apache web server. Users
may perform the following tasks on the web.
(1) Protein view: Users can search a protein of interest. The
protein view pages will provide the relevant information of
the protein, such as annotation, protein class, DNA-binding
sequences/logos and the position weight matrix (PWM).
(2) (...truncated)