The Protein Information Resource
Cathy H. Wu
1
Lai-Su L. Yeh
0
1
Hongzhan Huang
1
Leslie Arminski
0
1
Jorge Castro-Alvear
0
1
Yongxing Chen
0
1
Zhangzhi Hu
0
1
Panagiotis Kourtesis
0
1
Robert S. Ledley
0
1
Baris E. Suzek
0
1
C.R. Vinayaka
0
1
Jian Zhang
0
1
Winona C. Barker
0
1
0
National Biomedical Research Foundation, Georgetown University Medical Center
, 3900 Reservoir Road, NW, Box 571414,
Washington, DC 20057-1414, USA
1
Department of Biochemistry and Molecular Biology
The Protein Information Resource (PIR) is an integrated public resource of protein informatics that supports genomic and proteomic research and scientific discovery. PIR maintains the Protein Sequence Database (PSD), an annotated protein database containing over 283 000 sequences covering the entire taxonomic range. Family classification is used for sensitive identification, consistent annotation, and detection of annotation errors. The superfamily curation defines signature domain architecture and categorizes memberships to improve automated classification. To increase the amount of experimental annotation, the PIR has developed a bibliography system for literature searching, mapping, and user submission, and has conducted retrospective attribution of citations for experimental features. PIR also maintains NREF, a non-redundant reference database, and iProClass, an integrated database of protein family, function, and structure information. PIR-NREF provides a timely and comprehensive collection of protein sequences, currently consisting of more than 1 000 000 entries from PIR-PSD, SWISS-PROT, TrEMBL, RefSeq, GenPept, and PDB. The PIR web site (http://pir.georgetown.edu) connects data analysis tools to underlying databases for information retrieval and knowledge discovery, with functionalities for interactive queries, combinations of sequence and text searches, and sorting and visual exploration of search results. The FTP site provides free download for PSD and NREF biweekly releases and auxiliary databases and files.
-
In order to provide integrated and value-added protein
information to the scientific community, the Protein
Information Resource (PIR) continues to enhance its three
major databases, the Protein Sequence Database (PSD), the
Non-redundant REFerence (NREF) sequence database, and
the integrated Protein Classification (iProClass) database (1).
The sections below describe key developments in the past year.
The PIR-PSD is public domain protein sequence database,
which currently contains over 283 000 annotated and classified
entries, covering the entire taxonomic range. Recent
development and annotation efforts have focused on superfamily
classification and curation and bibliography mapping and
attribution.
Superfamily classification and curation. A unique
characteristic of the PIR-PSD is the superfamily classification (2) that
provides comprehensive, non-overlapping, and hierarchical
clustering of sequences to reflect their evolutionary
relationships. To further improve the quality of automated
classification, we have conducted systematic superfamily curation
that: (i) defines the signature domain architecture (number,
order, and types of domains) characteristic of the superfamily,
(ii) categorizes regular and associate members to distinguish
sequence entries sharing the signature features from outliers
(such as fragments), and (iii) designates representative and
seed members amongst regular members. Several thousand
superfamilies have been manually curated. The seed members
provide a basis for automatic placement of new sequences into
existing superfamilies and for automatic generation of multiple
sequence alignments and phylogenetic trees. Currently, over
99% of PSD sequences are classified into families of closely
related sequences (at least 45% identical), and over two-thirds
of sequences are classified into >36 000 superfamilies.
Bibliography mapping and attribution. To improve the
quality of protein annotation by increasing the amount of
experimentally verified data with source attribution, the PIR has
developed a bibliography information system and conducted
retrospective attribution of literature data. The bibliography
system allows browsing and searching of extensive literature
collected for all protein entries from PubMed and other curated
molecular databases, together with an interface for scientists to
categorize and submit literature information for mapped
proteins. In PIR-PSD, protein features such as binding sites,
structural motifs, and post-translational modifications are tagged
with experimental status for experimentally determined
features to distinguish from those that are computationally
predicted; however, they had not been associated with
literature citations. A systematic manual attribution of experimental
features is being carried out with computer-assisted mapping
to existing protein bibliographic information. So far, a few
thousand experimental features have been associated with
publications.
PIR-NREF DATABASE
The PIR-NREF provides a timely and comprehensive
collection of protein sequence data, keeping pace with the genome
sequencing projects and containing source attribution and
minimal redundancy. The database contains all sequences in
PIR-PSD, SWISS-PROT (3), TrEMBL (3), RefSeq (4),
GenPept, and PDB (5), totaling more than 1 000 000 entries
currently. Identical sequences from the same source organism
(species) reported in different databases are presented as a
single NREF entry with protein IDs, accession numbers, and
protein names from each underlying database, as well as amino
acid sequence, taxonomy, and composite bibliographic data.
Also listed are related sequences identified by all-against-all
FASTA search (6), including identical sequences from different
organisms, identical subsequences, and highly similar
sequences ( 95% identity). NREF can be used for sequence
searching and protein identification against the entire
sequence collection or a subset of one or more genomes.
The collective protein names, including synonyms, and the
bibliographic information can be used to develop a protein
name ontology. The different protein names assigned by
different databases may help detect annotation errors,
especially those resulting from large-scale genomic annotation.
PIR web site. The PIR web site connects data mining and
sequence analysis tools to underlying databases for
information retrieval and knowledge discovery, with functionalities
for interactive queries, combinations of sequence and
annotation text searches, and sorting and visual exploration of search
results. The three major databases (PSD, NREF and iProClass)
represent primary entry points in the PIR web site, all of which
provide text search for entry and list retrieval as well as
BLAST search and peptide match. Direct entry report retrieval
is based on sequence unique identifiers of all underlying
databases, such as PIR, SWISS-PROT, or RefSeq. Basic and
advanced text searches return protein entries listed in summary
lines (...truncated)