E-MSD: the European Bioinformatics Institute Macromolecular Structure Database (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/31/1/458.full.pdf

E-MSD: the European Bioinformatics Institute Macromolecular Structure Database

H. Boutselakis 0 D. Dimitropoulos 0 J. Fillon 0 A. Golovin 0 K. Henrick 0 A. Hussain 0 J. Ionides 0 M. John 0 P. A. Keller 0 E. Krissinel 0 P. McNeil 0 A. Naim 0 R. Newman 0 T. Oldfield 0 J. Pineda 0 A. Rachedi 0 J. Copeland 0 A. Sitnov 0 S. Sobhany 0 A. Suarez-Uruena 0 J. Swaminathan 0 M. Tagari 0 J. Tate 0 S. Tromm 0 S. Velankar 0 W. Vranken 0 0 EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 1SD, UK The E-MSD macromolecular structure relational database (http://www.ebi.ac.uk/msd) is designed to be a single access point for protein and nucleic acid structures and related information. The database is derived from Protein Data Bank (PDB) entries. Relational database technologies are used in a comprehensive cleaning procedure to ensure data uniformity across the whole archive. The search database contains an extensive set of derived properties, goodness-of-fit indicators, and links to other EBI databases including InterPro, GO, and SWISS-PROT, together with links to SCOP, CATH, PFAM and PROSITE. A generic search interface is available, coupled with a fast secondary structure domain search tool. - The European Bioinformatics Institute (EBI) (http://www. ebi.ac.uk) was established in 1995 as a centre for biological databases covering a broad range of topics from nucleotide sequence through to protein function. From its inception, the EBI has hosted the EMBL nucleotide sequence database (1), and the protein sequence database SWISS-PROT/TrEMBL (2,3). The E-MSD (http://www.ebi.ac.uk/msd) project was set up in 1996, initially as a pilot study, to create the infrastructure based on emerging relational database technologies to provide clean macromolecular structure data. The challenge of presenting the available information in an intuitive way to users from various backgrounds and expertise demands that the data are archived in a meaningful and flexible way that represents the hierarchy and constraints within the data. Relational database technology offers both the flexibility and the framework to achieve this goal. The E-MSD has applied these database technologies for the extremely complex processes of importing legacy data from the Protein Data Bank (PDB, 4), creation of a deposition system for new depositions to the PDB with automated annotation procedures, achieving data conformity and the integration of relevant information from other biological databases. A generic query system has been developed to allow access to the database. The overall system has been designed from the outset to cope with the expected exponential growth in structure data through the structural genomics initiatives (5). The PDB search database (E-MSD) Database framework. The search database is implemented using relational database technology, in a generic form that can be used on a variety of database engines (e.g. MySQL (6) http://www.mysql.com, Oracle, http://www.oracle.com). The organization of the structural information is hierarchical, with the topmost level corresponding to potential biological assemblies [based on the PQS (7) service, http://pqs.ebi.ac.uk], followed by the constituent polymer chains (protein and nucleic acid) and associated bound molecules. The chains are decomposed into residues and finally the constituent atoms. Derived data are added at each level of the hierarchy (accessible surface area, torsion angles etc) see Figure 1. Other data are also represented, for example, experimental and bibliographical information. Another level of organization divides the data into entry specific data (e.g. coordinates, experimental details) and reference data (data that is not specific to any particular entry, such as the chemical description of ligands and amino acids). The search database is designed to support efficient querying and data retrieval, and therefore, contains considerable data redundancy. Its contents are derived from another database (the deposition database) which has a much more complex structure and lower redundancy, making it unsuitable for performing complex queries in real time. The deposition database was designed using the Oracle Designer CASE tool, which has been invaluable for tracking the development of such a complex data model (around 400 tables linked by 1000 foreign key relationships). The maintenance of the integrity of relationships within the data is one of the guiding principles of its design. The deposition database performs two key functions. It provides a filter that forces the legacy PDB data into a consistent framework, thus forming the basis for development of search services described below. Secondly, it is coupled to a deposition service for structural data to the PDB through AutoDep (8) (http://www.ebi.ac.uk/msd-srv/autodep), providing a versatile way of handling the depositions. Biologically relevant organization. The quaternary structure of a protein molecule is the arrangement of its subunits in space and the ensemble of its intersubunit contacts and interactions, without regard to the internal geometry of the subunits. The quaternary state of a protein is important in understanding its biological function. For a protein structure determined using X-ray crystallography, the PDB entry describes the contents of the asymmetric unit (ASU) of the crystal. The PDB entry may, therefore, partially describe the quaternary state of the protein. The complete description of the quaternary state requires crystallographic symmetry operations to be applied to the contents of the ASU. We have developed algorithms (7,9) to determine the most likely oligomeric state, taking into account the symmetry related chains, that are used to determine the assemblies for each PDB entry and are then loaded into the database. Inter-database consistency. To maintain consistency between the structure (E-MSD) and sequence (SWISS-PROT) databases, it is important to determine the correct sequence database cross-reference. The subsequent derived data pertaining to protein families, domains, functional sites and sequences from other databases (InterPro 10, GO 11, SCOP 12, CATH 13, PFAM 14 and PROSITE 15) are dependent on the correct SWISS-PROT database cross-reference. These data are integrated into the E-MSD search database and are made available to users via various interfaces. For new depositions, steps are taken to ensure that the SEQRES record in the PDB entry represents the correct amino acid sequence of the sample. Since many of the legacy PDB entries contain only the coordinates of the observed atom positions, it is difficult to obtain the complete sequence of the protein(s) studied. Procedures developed in the group are implemented, in collaboration with SWISS-PROT, to ensure correct mapping of the SEQRES records in a PDB entry to the sequence database entry at the residue level. Exchange of information between the E-MSD and SWISS-PROT further helps to maintain consistent information between st (...truncated)