iProClass: an integrated database of protein family, function and structure information
390–392
Nucleic Acids Research, 2003, Vol. 31, No. 1
DOI: 10.1093/nar/gkg044
# 2003 Oxford University Press
iProClass: an integrated database of protein family,
function and structure information
Hongzhan Huang, Winona C. Barker1, Yongxing Chen1 and Cathy H. Wu*
Department of Biochemistry and Molecular Biology and 1National Biomedical Research Foundation, Georgetown
University Medical Center, 3900 Reservoir Road, NW, Box 571414, Washington, DC 20057-1414, USA
Received September 15, 2002; Accepted September 27, 2002
ABSTRACT
INTRODUCTION
The completion of the draft human genome sequences marked
the beginning of a new era of biological research, in which
scientists have begun systematically to explore gene functions
OVERVIEW AND CURRENT CONTENTS
The iProClass database (Fig. 1) contains value-added descriptions of proteins, including family relationships at both global
(superfamily/family) and local (domain, motif, site) levels, as
well as structural and functional classifications and features.
The database was first released in October 2000 and contained
about 200 000 proteins from the PIR Protein Sequence
Database (PIR-PSD) (2) and SWISS-PROT (3). It is updated
biweekly and currently consists of about 830 000 nonredundant protein sequences from the PIR-PSD, SWISSPROT, and TrEMBL (3) databases. The protein entries are
organized with more than 36 000 PIR superfamilies (4),
145 000 families, 3700 Pfam (5) and PIR homology domains,
1300 ProSite (6) motifs, 550 000 FASTA (7) similarity
clusters, and links to over 50 molecular biology databases.
Database cross-references in iProClass are represented by
rich links, which include both the links and related summary
information. This approach effectively combines data warehouse and hypertext navigation methods for data integration to
provide timely information from distributed sources. iProClass
collects information from and links to databases for protein
sequences (PIR-PSD, PIR-NREF, SWISS-PROT, TrEMBL,
GenPept, RefSeq), families (InterPro, Pfam, ProSite, Blocks,
Prints, COG, MetaFam, PIR-ASDB, ProClass), functions and
pathways (EC-IUBMB, KEGG, BRENDA, WIT, MetaCyc,
EcoCyc), interactions (DIP, BIND), post-translational modifications (RESID, PhosphoSite DB), protein expression and
proteomes (PMG), structures and structural classifications
(PDB, PDBSum, SCOP, CATH, FSSP, MMDB), genes and
genomes (GenBank, EMBL, DDBJ, LocusLink, TIGR, SGD,
*To whom correspondence should be addressed. Tel: þ1 2026872121; Fax: þ1 2026871662; Email:
The iProClass database provides comprehensive,
value-added descriptions of proteins and serves as
a framework for data integration in a distributed
networking environment. The protein information in
iProClass includes family relationships as well as
structural and functional classifications and features. The current version consists of about
830 000 non-redundant PIR-PSD, SWISS-PROT, and
TrEMBL proteins organized with more than 36 000
PIR superfamilies, 145 000 families, 4000 domains,
1300 motifs and 550 000 FASTA similarity clusters. It
provides rich links to over 50 database of protein
sequences, families, functions and pathways, protein–
protein interactions, post-translational modifications, protein expressions, structures and
structural classifications, genes and genomes,
ontologies, literature and taxonomy. Protein and
superfamily summary reports present extensive
annotation information and include membership
statistics and graphical display of domains and
motifs. iProClass employs an open and modular
architecture for interoperability and scalability. It is
implemented in the Oracle object-relational database
system and is updated biweekly. The database is
freely accessible from the web site at http://pir.
georgetown.edu/iproclass/ and searchable by
sequence or text string. The data integration in
iProClass supports exploration of protein relationships. Such knowledge is fundamental to the understanding of protein evolution, structure and function
and crucial to functional genomic and proteomic
research.
and other complex regulatory processes by studying organisms
at the global scale of genomes, transcriptomes and proteomes.
With the accelerated accumulation of molecular data, advanced
bioinformatics infrastructures must be developed in order to
fully explore these valuable data and to generate new
hypotheses and derive scientific knowledge. One major
challenge lies in the volume, complexity and dynamic nature
of the data, which are being collected and maintained in
heterogeneous and distributed sources. The iProClass database
(1) was designed to offer a comprehensive, integrated view of
protein information to facilitate knowledge discovery.
Nucleic Acids Research, 2003, Vol. 31, No. 1
391
FlyBase, MGI, GDB, OMIM, MIPS, GenProtEC), ontologies
(GO), literature (PubMed) and taxonomy (NCBI Taxonomy).
The information content is continually enhanced by: (i) adding
links to more databases, (ii) adding executive summary
information from the linked databases and (iii) increasing the
number of occurrences of links to the databases that iProClass
already links to. The composite annotations collected from
multiple sources are presented with attribution to the underlying databases.
iProClass presents comprehensive views for protein
sequences and superfamilies in two types of summary reports.
The protein sequence report covers information on family,
structure, function, gene, genetics, disease, ontology, taxonomy and literature, with cross-references to relevant molecular databases and executive summary lines, as well as a
graphical display of domain and motif sequence regions and a
link to related sequences in pre-computed FASTA clusters. The
superfamily report provides PIR superfamily membership
information with length, taxonomy and keyword statistics,
complete member listing separated into major kingdoms,
family relationships at the whole protein and domain and motif
levels with direct mapping to other classifications, structure
and function cross-references, graphical display of domain and
motif architecture of members, and a link to dynamically
generated multiple sequence alignments and phylogenetic trees
for superfamilies with curated seed members.
DATABASE ACCESS AND USAGE
The iProClass database employs an open and modular database
architecture to provide a framework for data integration in a
distributed networking environment. The modular structure
makes the system scalable, customizable, and extendable for
adding new components. The database is implemented in the
Oracle object-relational system and freely accessible from our
web site at http://pir.georgetown.edu/iproclass/. Direct report
retrieval is based on unique identifiers such as PIR or SWISSPROT sequence ID (e.g. http://pir.georgetown.edu/cgi-bin/
ipcEntry?id=A31997) or PIR superfamily ID (e.g. http://pir.
georgetown.edu/cgi-bin/ipcSF?id=SF000130). Matching lists
of proteins or superfamilies are retrievable by sequence search
[BLAS (...truncated)