The Protein Information Resource (PIR) (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/28/1/41.full.pdf

The Protein Information Resource (PIR)

Winona C. Barker 2 John S. Garavelli 2 Hongzhan Huang 2 Peter B. McGarvey 2 Bruce C. Orcutt 2 Geetha Y. Srinivasarao 2 Chunlin Xiao 2 Lai-Su L. Yeh 2 Robert S. Ledley 2 Joseph F. Janda 2 Friedhelm Pfeiffer 1 2 Hans-Werner Mewes 1 2 Akira Tsugita 0 2 Cathy Wu 2 0 Japan International Protein Information Database , Amakubo 1-16-1, Tsukuba 305-0005, Japan 1 GSF-Forschungszentrum fr Umwelt und Gesundheit, Munich Information Center for Protein Sequences am Max-Planck-Instut fr Biochemie , Am Klopferspitz 18, D-82152 Martinsried, Germany 2 Protein Information Resource, National Biomedical Research Foundation , 3900 Reservoir Road, NW, Washington, DC 20007, USA The Protein Information Resource (PIR) produces the largest, most comprehensive, annotated protein sequence database in the public domain, the PIRInternational Protein Sequence Database, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Sequence Database (JIPID). The expanded PIR WWW site allows sequence similarity and text searching of the Protein Sequence Database and auxiliary databases. Several new web-based search engines combine searches of sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. New capabilities for searching the PIR sequence databases include annotation-sorted search, domain search, combined global and domain search, and interactive text searches. The PIR-International databases and search tools are accessible on the PIR WWW site at http://pir.georgetown.edu and at the MIPS WWW site at http://www.mips.biochem.mpg.de . The PIR-International Protein Sequence Database and other files are also available by FTP. - The accelerating pace of genome sequencing projects has greatly increased the volume and complexity of available molecular data. To realize the fullest possible value from the data and to gain a better understanding of the genome, databases and the computational tools for analyzing them are required to allow biologically relevant features in the sequences to be identified and to provide insight on their structure and function. For over 30 years, the Protein Information Resource (PIR) has been providing the scientific community with databases and tools for the organization and analysis of protein sequence data (1,2). Together with MIPS and JIPID, we have undertaken a major restructuring to meet the challenges presented by the rapid growth of largely uncharacterized sequence data and the opportunities provided by the nearly universal access of scientists to the resources available on the WWW. Among the key developments are complete protein family organization for the PIR-International Protein Sequence Database (PSD) and integrated WWW interfaces for user-friendly sequence analysis, database searching and information retrieval. THE PIR-INTERNATIONAL PROTEIN DATABASES PIR, MIPS and JIPID constitute the PIR-International consortium that maintains the PIR-International Protein Sequence Database (PSD), the largest publicly distributed and freely available protein sequence database. The database has the following distinguishing features. It is a comprehensive, annotated, and non-redundant protein sequence database, containing over 142 000 sequences as of September 1999. Included are sequences from the completely sequenced genomes of 16 prokaryotes, six archaebacteria, 17 viruses and phages, >100 eukaryote organelles and Saccharomyces cerevisiae. The collection is well organized with >99% of entries classified by protein family and >57% classified by protein superfamily. PSD annotation includes concurrent cross-references to other sequence, structure, genomic and citation databases, including the public nucleic acid sequence databases ENTREZ, MEDLINE, PDB, GDB, OMIM, FlyBase, MIPS/ Yeast, SGD/Yeast, MIPS/Arabidopsis and TIGR. Where these databases are publicly and freely accessible and provide suitable WWW access, the cross-references presented on the PIR WWW site are hot-linked so that searchers can consult the most current data. The PIR is the only sequence database to provide context cross-references between its own database entries. These cross-references assist searchers in exploring relationships such as subunit associations in molecular complexes, enzymesubstrate interactions, activation and regulation cascades, as well as in browsing entries with shared features and annotations. Interim updates are made publicly available on a weekly basis, and full releases have been published quarterly since 1984. In addition to the PSD, PIR-International distributes or provides WWW access to other sequence and auxiliary databases Annotated and classified protein sequences Sequences not yet in the PIR-International PSD http://pir.georgetown.edu/pirwww/dbinfo/textpsd.html http://pir.georgetown.edu/pirwww/dbinfo/patchx.html Sequences as originally reported in a publication or submission http://pir.georgetown.edu/pirwww/dbinfo/archive.html Sequences from three-dimensional structure database PDB http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html Representative sequences from each protein family http://pir.georgetown.edu/pirwww/dbinfo/fambase.html Sequence alignments of superfamilies, families and homology domains http://pir.georgetown.edu/pirwww/dbinfo/piraln.html Post-translational modifications with PSD feature information http://pir.georgetown.edu/pirwww/dbinfo/resid.html Non-redundant sequences organized according to superfamilies and motifs http://pir.georgetown.edu/gfserver/proclass.html Sequence alignments of superfamilies http://www.mips.biochem.mpg.de/proj/protfam/protfam (Table 1), briefly described below, and maintains several internal data collections used for sequence annotation and integrity checks. PATCHX (3) is a non-redundant database assembled by MIPS of publicly available protein sequences not yet in the PIR-International PSD. PIR+PATCHX, a combination of the PSD and PATCHX containing ~300 000 sequences available for similarity searches, is the most complete nonredundant collection of protein sequences available in the public domain. ARCHIVE is a database of protein sequences as originally reported in a publication or submission, the only such collection of as published unmerged sequences. NRL_3D (4) sequence-structure database is produced from sequence and annotation in the Protein Data Bank (PDB) of three-dimensional structures (5). FAMBASE is a collection of representative sequences from each protein family that can be used in a similarity search to reduce search time and improve sensitivity for identifying distant families. PIR-ALN (6) is a curated database of sequence alignments of superfamilies, families and homology domains, with annotation information derived from PSD and consensus patterns calculated from the alignments. RESID (7) is a database of post-translational modifications with descriptive, chemical, structural and bibliographic inform (...truncated)