PROFESS: a PROtein Function, Evolution, Structure and Sequence database (pdf)

Article PDF cannot be displayed. You can download it here:

https://database.oxfordjournals.org/content/2010/baq011.full.pdf

PROFESS: a PROtein Function, Evolution, Structure and Sequence database

Database, Vol. 2010, Article ID baq011, doi:10.1093/database/baq011 ............................................................................................................................................................................................................................................................................................. Original article PROFESS: a PROtein Function, Evolution, Structure and Sequence database Thomas Triplet1,†, Matthew D. Shortridge2, Mark A. Griep2, Jaime L. Stark2, Robert Powers2,* and Peter Revesz1,* 1 Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115 and 2Department of Chemistry, University of Nebraska-Lincoln, Lincoln NE 68588-0304, USA † Present address: Thomas Triplet, Department of Computer Science, Concordia University, Montreal, Qc H3G-1M8, Canada. Submitted 2 December 2009; Revised 3 June 2010; Accepted 6 June 2010 ............................................................................................................................................................................................................................................................................................. The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are 1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein–protein interaction networks. Database URL: http://cse.unl.edu/profess/ ............................................................................................................................................................................................................................................................................................. Introduction There are 1100 molecular biology databases freely available to the public online (1,2). These databases constitute the extent of our knowledge related to genomics, proteomics, metabolomics, and structural genomics. Most serve as data warehouses with simple interfaces for data retrieval (3). To address more complex questions, biologists are routinely required to develop new databases by filtering information from existing databases (4). Even though this is extremely inefficient, there are a growing number of specialized databases designed around single topics. Unfortunately, this simply propagates the underlying problem: an inability to utilize the data outside the constraints imposed by the database designers (5). Capitalizing on the potential of biological information requires the development of a next-generation database that enables biologists to explore biological data in new ways. The key to solving this problem is to move the design focus from the database structure (predefined relationships between fields) to a fluid association that can be adapted to a biologist’s questions (6) without re-designing the underlying data structure. However, there are barriers to linking individual databases because of different data formats and structure (7, 8). Thus, it was essential to this effort to implement a new approach to integrate diverse biological databases (9). Most of the work on database integration has focused on business and spatio-temporal data (10, 11). Satisfying, general and practical solutions have proven to be elusive for these complex data sources, which are actually simple compared to biological data. Nevertheless, the most versatile of the solutions is to use a separate adapter, or ‘wrapper’ (Figure 1), program around each source database (12). The ‘wrappers’ provide a simplified ‘view’ of the source ............................................................................................................................................................................................................................................................................................. ß The Author(s) 2010. Published by Oxford University Press. This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 11 (page number not for citation purposes) *Corresponding author: Tel: +1 402 472 3039; Fax: +1 402 472 9402; Email: *Correspondence may also be addressed to Peter Revesz. Tel: +1 402 472 3488; Fax: +1 402 472 7767; Email: Original article Database, Vol. 2010, Article ID baq011, doi:10.1093/database/baq011 ............................................................................................................................................................................................................................................................................................. Figure 1. Two solutions for the data integration problem. (A) The ETL software extracts, transforms and loads the data sources into the warehouse. (B) The more flexible local-as-view method defines a virtual database that interacts with data sources through wrappers, which provide simplified views of the original databases. Database content Fourteen sources of data were integrated to create PROFESS (Table 1) using a local-as-view (LAV) modular approach (Figure 1B) (see the ‘Method for data integration’ section for details). The modular functionality of PROFESS coupled with user friendly searching capabilities makes PROFESS particularly useful for asking a range of questions about the sequence, structure, and functional relationship of evolutionary and functionally related proteins. A user interacts with PROFESS through a web interface using a functional-style query language that is translated to the structure query language (SQL) for mining PROFESS (Figure 2A). The core of PROFESS established a relationship between the Protein Data Bank (PDB) (13) and the eggNOG databases (14, 15) (Figure 2B). The link between eggNOG with the PDB was established using the proteins UniProt a (...truncated)