The Protein Information Resource (PIR)
Winona C. Barker
2
John S. Garavelli
2
Hongzhan Huang
2
Peter B. McGarvey
2
Bruce C. Orcutt
2
Geetha Y. Srinivasarao
2
Chunlin Xiao
2
Lai-Su L. Yeh
2
Robert S. Ledley
2
Joseph F. Janda
2
Friedhelm Pfeiffer
1
2
Hans-Werner Mewes
1
2
Akira Tsugita
0
2
Cathy Wu
2
0
Japan International Protein Information Database
, Amakubo 1-16-1, Tsukuba 305-0005,
Japan
1
GSF-Forschungszentrum fr Umwelt und Gesundheit,
Munich Information Center for Protein Sequences am Max-Planck-Instut fr Biochemie
, Am Klopferspitz 18, D-82152 Martinsried,
Germany
2
Protein Information Resource,
National Biomedical Research Foundation
, 3900 Reservoir Road, NW,
Washington, DC 20007, USA
The Protein Information Resource (PIR) produces the largest, most comprehensive, annotated protein sequence database in the public domain, the PIRInternational Protein Sequence Database, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Sequence Database (JIPID). The expanded PIR WWW site allows sequence similarity and text searching of the Protein Sequence Database and auxiliary databases. Several new web-based search engines combine searches of sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. New capabilities for searching the PIR sequence databases include annotation-sorted search, domain search, combined global and domain search, and interactive text searches. The PIR-International databases and search tools are accessible on the PIR WWW site at http://pir.georgetown.edu and at the MIPS WWW site at http://www.mips.biochem.mpg.de . The PIR-International Protein Sequence Database and other files are also available by FTP.
-
The accelerating pace of genome sequencing projects has
greatly increased the volume and complexity of available
molecular data. To realize the fullest possible value from the
data and to gain a better understanding of the genome, databases
and the computational tools for analyzing them are required to
allow biologically relevant features in the sequences to be
identified and to provide insight on their structure and function.
For over 30 years, the Protein Information Resource (PIR) has
been providing the scientific community with databases and
tools for the organization and analysis of protein sequence data
(1,2). Together with MIPS and JIPID, we have undertaken a
major restructuring to meet the challenges presented by the
rapid growth of largely uncharacterized sequence data and the
opportunities provided by the nearly universal access of scientists
to the resources available on the WWW. Among the key
developments are complete protein family organization for the
PIR-International Protein Sequence Database (PSD) and
integrated WWW interfaces for user-friendly sequence
analysis, database searching and information retrieval.
THE PIR-INTERNATIONAL PROTEIN DATABASES
PIR, MIPS and JIPID constitute the PIR-International consortium
that maintains the PIR-International Protein Sequence Database
(PSD), the largest publicly distributed and freely available
protein sequence database. The database has the following
distinguishing features.
It is a comprehensive, annotated, and non-redundant protein
sequence database, containing over 142 000 sequences as of
September 1999. Included are sequences from the
completely sequenced genomes of 16 prokaryotes, six
archaebacteria, 17 viruses and phages, >100 eukaryote
organelles and Saccharomyces cerevisiae.
The collection is well organized with >99% of entries
classified by protein family and >57% classified by protein
superfamily.
PSD annotation includes concurrent cross-references to
other sequence, structure, genomic and citation databases,
including the public nucleic acid sequence databases
ENTREZ, MEDLINE, PDB, GDB, OMIM, FlyBase, MIPS/
Yeast, SGD/Yeast, MIPS/Arabidopsis and TIGR. Where
these databases are publicly and freely accessible and
provide suitable WWW access, the cross-references
presented on the PIR WWW site are hot-linked so that
searchers can consult the most current data.
The PIR is the only sequence database to provide context
cross-references between its own database entries. These
cross-references assist searchers in exploring relationships
such as subunit associations in molecular complexes,
enzymesubstrate interactions, activation and regulation
cascades, as well as in browsing entries with shared features
and annotations.
Interim updates are made publicly available on a weekly
basis, and full releases have been published quarterly since
1984.
In addition to the PSD, PIR-International distributes or
provides WWW access to other sequence and auxiliary databases
Annotated and classified protein sequences
Sequences not yet in the PIR-International PSD
http://pir.georgetown.edu/pirwww/dbinfo/textpsd.html
http://pir.georgetown.edu/pirwww/dbinfo/patchx.html
Sequences as originally reported in a publication or submission
http://pir.georgetown.edu/pirwww/dbinfo/archive.html
Sequences from three-dimensional structure database PDB
http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
Representative sequences from each protein family
http://pir.georgetown.edu/pirwww/dbinfo/fambase.html
Sequence alignments of superfamilies, families and homology domains
http://pir.georgetown.edu/pirwww/dbinfo/piraln.html
Post-translational modifications with PSD feature information
http://pir.georgetown.edu/pirwww/dbinfo/resid.html
Non-redundant sequences organized according to superfamilies and motifs http://pir.georgetown.edu/gfserver/proclass.html
Sequence alignments of superfamilies
http://www.mips.biochem.mpg.de/proj/protfam/protfam
(Table 1), briefly described below, and maintains several
internal data collections used for sequence annotation and
integrity checks.
PATCHX (3) is a non-redundant database assembled by
MIPS of publicly available protein sequences not yet in the
PIR-International PSD. PIR+PATCHX, a combination of
the PSD and PATCHX containing ~300 000 sequences
available for similarity searches, is the most complete
nonredundant collection of protein sequences available in the
public domain.
ARCHIVE is a database of protein sequences as originally
reported in a publication or submission, the only such collection
of as published unmerged sequences.
NRL_3D (4) sequence-structure database is produced from
sequence and annotation in the Protein Data Bank (PDB) of
three-dimensional structures (5).
FAMBASE is a collection of representative sequences from
each protein family that can be used in a similarity search to
reduce search time and improve sensitivity for identifying
distant families.
PIR-ALN (6) is a curated database of sequence alignments
of superfamilies, families and homology domains, with
annotation information derived from PSD and consensus
patterns calculated from the alignments.
RESID (7) is a database of post-translational modifications
with descriptive, chemical, structural and bibliographic
inform (...truncated)