iProClass: an integrated database of protein family, function and structure information (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/31/1/390.full.pdf

iProClass: an integrated database of protein family, function and structure information

390–392 Nucleic Acids Research, 2003, Vol. 31, No. 1 DOI: 10.1093/nar/gkg044 # 2003 Oxford University Press iProClass: an integrated database of protein family, function and structure information Hongzhan Huang, Winona C. Barker1, Yongxing Chen1 and Cathy H. Wu* Department of Biochemistry and Molecular Biology and 1National Biomedical Research Foundation, Georgetown University Medical Center, 3900 Reservoir Road, NW, Box 571414, Washington, DC 20057-1414, USA Received September 15, 2002; Accepted September 27, 2002 ABSTRACT INTRODUCTION The completion of the draft human genome sequences marked the beginning of a new era of biological research, in which scientists have begun systematically to explore gene functions OVERVIEW AND CURRENT CONTENTS The iProClass database (Fig. 1) contains value-added descriptions of proteins, including family relationships at both global (superfamily/family) and local (domain, motif, site) levels, as well as structural and functional classifications and features. The database was first released in October 2000 and contained about 200 000 proteins from the PIR Protein Sequence Database (PIR-PSD) (2) and SWISS-PROT (3). It is updated biweekly and currently consists of about 830 000 nonredundant protein sequences from the PIR-PSD, SWISSPROT, and TrEMBL (3) databases. The protein entries are organized with more than 36 000 PIR superfamilies (4), 145 000 families, 3700 Pfam (5) and PIR homology domains, 1300 ProSite (6) motifs, 550 000 FASTA (7) similarity clusters, and links to over 50 molecular biology databases. Database cross-references in iProClass are represented by rich links, which include both the links and related summary information. This approach effectively combines data warehouse and hypertext navigation methods for data integration to provide timely information from distributed sources. iProClass collects information from and links to databases for protein sequences (PIR-PSD, PIR-NREF, SWISS-PROT, TrEMBL, GenPept, RefSeq), families (InterPro, Pfam, ProSite, Blocks, Prints, COG, MetaFam, PIR-ASDB, ProClass), functions and pathways (EC-IUBMB, KEGG, BRENDA, WIT, MetaCyc, EcoCyc), interactions (DIP, BIND), post-translational modifications (RESID, PhosphoSite DB), protein expression and proteomes (PMG), structures and structural classifications (PDB, PDBSum, SCOP, CATH, FSSP, MMDB), genes and genomes (GenBank, EMBL, DDBJ, LocusLink, TIGR, SGD, *To whom correspondence should be addressed. Tel: þ1 2026872121; Fax: þ1 2026871662; Email: The iProClass database provides comprehensive, value-added descriptions of proteins and serves as a framework for data integration in a distributed networking environment. The protein information in iProClass includes family relationships as well as structural and functional classiﬁcations and features. The current version consists of about 830 000 non-redundant PIR-PSD, SWISS-PROT, and TrEMBL proteins organized with more than 36 000 PIR superfamilies, 145 000 families, 4000 domains, 1300 motifs and 550 000 FASTA similarity clusters. It provides rich links to over 50 database of protein sequences, families, functions and pathways, protein– protein interactions, post-translational modiﬁcations, protein expressions, structures and structural classiﬁcations, genes and genomes, ontologies, literature and taxonomy. Protein and superfamily summary reports present extensive annotation information and include membership statistics and graphical display of domains and motifs. iProClass employs an open and modular architecture for interoperability and scalability. It is implemented in the Oracle object-relational database system and is updated biweekly. The database is freely accessible from the web site at http://pir. georgetown.edu/iproclass/ and searchable by sequence or text string. The data integration in iProClass supports exploration of protein relationships. Such knowledge is fundamental to the understanding of protein evolution, structure and function and crucial to functional genomic and proteomic research. and other complex regulatory processes by studying organisms at the global scale of genomes, transcriptomes and proteomes. With the accelerated accumulation of molecular data, advanced bioinformatics infrastructures must be developed in order to fully explore these valuable data and to generate new hypotheses and derive scientific knowledge. One major challenge lies in the volume, complexity and dynamic nature of the data, which are being collected and maintained in heterogeneous and distributed sources. The iProClass database (1) was designed to offer a comprehensive, integrated view of protein information to facilitate knowledge discovery. Nucleic Acids Research, 2003, Vol. 31, No. 1 391 FlyBase, MGI, GDB, OMIM, MIPS, GenProtEC), ontologies (GO), literature (PubMed) and taxonomy (NCBI Taxonomy). The information content is continually enhanced by: (i) adding links to more databases, (ii) adding executive summary information from the linked databases and (iii) increasing the number of occurrences of links to the databases that iProClass already links to. The composite annotations collected from multiple sources are presented with attribution to the underlying databases. iProClass presents comprehensive views for protein sequences and superfamilies in two types of summary reports. The protein sequence report covers information on family, structure, function, gene, genetics, disease, ontology, taxonomy and literature, with cross-references to relevant molecular databases and executive summary lines, as well as a graphical display of domain and motif sequence regions and a link to related sequences in pre-computed FASTA clusters. The superfamily report provides PIR superfamily membership information with length, taxonomy and keyword statistics, complete member listing separated into major kingdoms, family relationships at the whole protein and domain and motif levels with direct mapping to other classifications, structure and function cross-references, graphical display of domain and motif architecture of members, and a link to dynamically generated multiple sequence alignments and phylogenetic trees for superfamilies with curated seed members. DATABASE ACCESS AND USAGE The iProClass database employs an open and modular database architecture to provide a framework for data integration in a distributed networking environment. The modular structure makes the system scalable, customizable, and extendable for adding new components. The database is implemented in the Oracle object-relational system and freely accessible from our web site at http://pir.georgetown.edu/iproclass/. Direct report retrieval is based on unique identifiers such as PIR or SWISSPROT sequence ID (e.g. http://pir.georgetown.edu/cgi-bin/ ipcEntry?id=A31997) or PIR superfamily ID (e.g. http://pir. georgetown.edu/cgi-bin/ipcSF?id=SF000130). Matching lists of proteins or superfamilies are retrievable by sequence search [BLAS (...truncated)