The Protein Information Resource (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/31/1/345.full.pdf

The Protein Information Resource

Cathy H. Wu 1 Lai-Su L. Yeh 0 1 Hongzhan Huang 1 Leslie Arminski 0 1 Jorge Castro-Alvear 0 1 Yongxing Chen 0 1 Zhangzhi Hu 0 1 Panagiotis Kourtesis 0 1 Robert S. Ledley 0 1 Baris E. Suzek 0 1 C.R. Vinayaka 0 1 Jian Zhang 0 1 Winona C. Barker 0 1 0 National Biomedical Research Foundation, Georgetown University Medical Center , 3900 Reservoir Road, NW, Box 571414, Washington, DC 20057-1414, USA 1 Department of Biochemistry and Molecular Biology The Protein Information Resource (PIR) is an integrated public resource of protein informatics that supports genomic and proteomic research and scientific discovery. PIR maintains the Protein Sequence Database (PSD), an annotated protein database containing over 283 000 sequences covering the entire taxonomic range. Family classification is used for sensitive identification, consistent annotation, and detection of annotation errors. The superfamily curation defines signature domain architecture and categorizes memberships to improve automated classification. To increase the amount of experimental annotation, the PIR has developed a bibliography system for literature searching, mapping, and user submission, and has conducted retrospective attribution of citations for experimental features. PIR also maintains NREF, a non-redundant reference database, and iProClass, an integrated database of protein family, function, and structure information. PIR-NREF provides a timely and comprehensive collection of protein sequences, currently consisting of more than 1 000 000 entries from PIR-PSD, SWISS-PROT, TrEMBL, RefSeq, GenPept, and PDB. The PIR web site (http://pir.georgetown.edu) connects data analysis tools to underlying databases for information retrieval and knowledge discovery, with functionalities for interactive queries, combinations of sequence and text searches, and sorting and visual exploration of search results. The FTP site provides free download for PSD and NREF biweekly releases and auxiliary databases and files. - In order to provide integrated and value-added protein information to the scientific community, the Protein Information Resource (PIR) continues to enhance its three major databases, the Protein Sequence Database (PSD), the Non-redundant REFerence (NREF) sequence database, and the integrated Protein Classification (iProClass) database (1). The sections below describe key developments in the past year. The PIR-PSD is public domain protein sequence database, which currently contains over 283 000 annotated and classified entries, covering the entire taxonomic range. Recent development and annotation efforts have focused on superfamily classification and curation and bibliography mapping and attribution. Superfamily classification and curation. A unique characteristic of the PIR-PSD is the superfamily classification (2) that provides comprehensive, non-overlapping, and hierarchical clustering of sequences to reflect their evolutionary relationships. To further improve the quality of automated classification, we have conducted systematic superfamily curation that: (i) defines the signature domain architecture (number, order, and types of domains) characteristic of the superfamily, (ii) categorizes regular and associate members to distinguish sequence entries sharing the signature features from outliers (such as fragments), and (iii) designates representative and seed members amongst regular members. Several thousand superfamilies have been manually curated. The seed members provide a basis for automatic placement of new sequences into existing superfamilies and for automatic generation of multiple sequence alignments and phylogenetic trees. Currently, over 99% of PSD sequences are classified into families of closely related sequences (at least 45% identical), and over two-thirds of sequences are classified into >36 000 superfamilies. Bibliography mapping and attribution. To improve the quality of protein annotation by increasing the amount of experimentally verified data with source attribution, the PIR has developed a bibliography information system and conducted retrospective attribution of literature data. The bibliography system allows browsing and searching of extensive literature collected for all protein entries from PubMed and other curated molecular databases, together with an interface for scientists to categorize and submit literature information for mapped proteins. In PIR-PSD, protein features such as binding sites, structural motifs, and post-translational modifications are tagged with experimental status for experimentally determined features to distinguish from those that are computationally predicted; however, they had not been associated with literature citations. A systematic manual attribution of experimental features is being carried out with computer-assisted mapping to existing protein bibliographic information. So far, a few thousand experimental features have been associated with publications. PIR-NREF DATABASE The PIR-NREF provides a timely and comprehensive collection of protein sequence data, keeping pace with the genome sequencing projects and containing source attribution and minimal redundancy. The database contains all sequences in PIR-PSD, SWISS-PROT (3), TrEMBL (3), RefSeq (4), GenPept, and PDB (5), totaling more than 1 000 000 entries currently. Identical sequences from the same source organism (species) reported in different databases are presented as a single NREF entry with protein IDs, accession numbers, and protein names from each underlying database, as well as amino acid sequence, taxonomy, and composite bibliographic data. Also listed are related sequences identified by all-against-all FASTA search (6), including identical sequences from different organisms, identical subsequences, and highly similar sequences ( 95% identity). NREF can be used for sequence searching and protein identification against the entire sequence collection or a subset of one or more genomes. The collective protein names, including synonyms, and the bibliographic information can be used to develop a protein name ontology. The different protein names assigned by different databases may help detect annotation errors, especially those resulting from large-scale genomic annotation. PIR web site. The PIR web site connects data mining and sequence analysis tools to underlying databases for information retrieval and knowledge discovery, with functionalities for interactive queries, combinations of sequence and annotation text searches, and sorting and visual exploration of search results. The three major databases (PSD, NREF and iProClass) represent primary entry points in the PIR web site, all of which provide text search for entry and list retrieval as well as BLAST search and peptide match. Direct entry report retrieval is based on sequence unique identifiers of all underlying databases, such as PIR, SWISS-PROT, or RefSeq. Basic and advanced text searches return protein entries listed in summary lines (...truncated)