E-MSD: the European Bioinformatics Institute Macromolecular Structure Database
H. Boutselakis
0
D. Dimitropoulos
0
J. Fillon
0
A. Golovin
0
K. Henrick
0
A. Hussain
0
J. Ionides
0
M. John
0
P. A. Keller
0
E. Krissinel
0
P. McNeil
0
A. Naim
0
R. Newman
0
T. Oldfield
0
J. Pineda
0
A. Rachedi
0
J. Copeland
0
A. Sitnov
0
S. Sobhany
0
A. Suarez-Uruena
0
J. Swaminathan
0
M. Tagari
0
J. Tate
0
S. Tromm
0
S. Velankar
0
W. Vranken
0
0
EMBL Outstation,
The European Bioinformatics Institute, Wellcome Trust Genome Campus
, Hinxton, Cambridge CB10 1SD,
UK
The E-MSD macromolecular structure relational database (http://www.ebi.ac.uk/msd) is designed to be a single access point for protein and nucleic acid structures and related information. The database is derived from Protein Data Bank (PDB) entries. Relational database technologies are used in a comprehensive cleaning procedure to ensure data uniformity across the whole archive. The search database contains an extensive set of derived properties, goodness-of-fit indicators, and links to other EBI databases including InterPro, GO, and SWISS-PROT, together with links to SCOP, CATH, PFAM and PROSITE. A generic search interface is available, coupled with a fast secondary structure domain search tool.
-
The European Bioinformatics Institute (EBI) (http://www.
ebi.ac.uk) was established in 1995 as a centre for biological
databases covering a broad range of topics from nucleotide
sequence through to protein function. From its inception, the
EBI has hosted the EMBL nucleotide sequence database (1),
and the protein sequence database SWISS-PROT/TrEMBL
(2,3). The E-MSD (http://www.ebi.ac.uk/msd) project was set
up in 1996, initially as a pilot study, to create the infrastructure
based on emerging relational database technologies to provide
clean macromolecular structure data. The challenge of
presenting the available information in an intuitive way to
users from various backgrounds and expertise demands that
the data are archived in a meaningful and flexible way that
represents the hierarchy and constraints within the data.
Relational database technology offers both the flexibility and
the framework to achieve this goal. The E-MSD has applied
these database technologies for the extremely complex
processes of importing legacy data from the Protein Data
Bank (PDB, 4), creation of a deposition system for new
depositions to the PDB with automated annotation procedures,
achieving data conformity and the integration of relevant
information from other biological databases. A generic query
system has been developed to allow access to the database. The
overall system has been designed from the outset to cope with
the expected exponential growth in structure data through the
structural genomics initiatives (5).
The PDB search database (E-MSD)
Database framework. The search database is implemented
using relational database technology, in a generic form that
can be used on a variety of database engines (e.g. MySQL
(6) http://www.mysql.com, Oracle, http://www.oracle.com).
The organization of the structural information is hierarchical,
with the topmost level corresponding to potential biological
assemblies [based on the PQS (7) service, http://pqs.ebi.ac.uk],
followed by the constituent polymer chains (protein and
nucleic acid) and associated bound molecules. The chains
are decomposed into residues and finally the constituent atoms.
Derived data are added at each level of the hierarchy
(accessible surface area, torsion angles etc) see Figure 1. Other data are
also represented, for example, experimental and
bibliographical information. Another level of organization divides the data
into entry specific data (e.g. coordinates, experimental details)
and reference data (data that is not specific to any particular
entry, such as the chemical description of ligands and amino
acids).
The search database is designed to support efficient querying
and data retrieval, and therefore, contains considerable data
redundancy. Its contents are derived from another database (the
deposition database) which has a much more complex
structure and lower redundancy, making it unsuitable for
performing complex queries in real time. The deposition
database was designed using the Oracle Designer CASE tool,
which has been invaluable for tracking the development of
such a complex data model (around 400 tables linked by 1000
foreign key relationships). The maintenance of the integrity of
relationships within the data is one of the guiding principles of
its design.
The deposition database performs two key functions. It
provides a filter that forces the legacy PDB data into a
consistent framework, thus forming the basis for development
of search services described below. Secondly, it is coupled to a
deposition service for structural data to the PDB through
AutoDep (8) (http://www.ebi.ac.uk/msd-srv/autodep),
providing a versatile way of handling the depositions.
Biologically relevant organization. The quaternary structure
of a protein molecule is the arrangement of its subunits in
space and the ensemble of its intersubunit contacts and
interactions, without regard to the internal geometry of the subunits.
The quaternary state of a protein is important in understanding
its biological function. For a protein structure determined using
X-ray crystallography, the PDB entry describes the contents of
the asymmetric unit (ASU) of the crystal. The PDB entry may,
therefore, partially describe the quaternary state of the protein.
The complete description of the quaternary state requires
crystallographic symmetry operations to be applied to the contents
of the ASU. We have developed algorithms (7,9) to determine
the most likely oligomeric state, taking into account the
symmetry related chains, that are used to determine the assemblies
for each PDB entry and are then loaded into the database.
Inter-database consistency. To maintain consistency between
the structure (E-MSD) and sequence (SWISS-PROT)
databases, it is important to determine the correct sequence
database cross-reference. The subsequent derived data pertaining
to protein families, domains, functional sites and sequences
from other databases (InterPro 10, GO 11, SCOP 12, CATH
13, PFAM 14 and PROSITE 15) are dependent on the correct
SWISS-PROT database cross-reference. These data are
integrated into the E-MSD search database and are made available
to users via various interfaces. For new depositions, steps are
taken to ensure that the SEQRES record in the PDB entry
represents the correct amino acid sequence of the sample.
Since many of the legacy PDB entries contain only the
coordinates of the observed atom positions, it is difficult to
obtain the complete sequence of the protein(s) studied.
Procedures developed in the group are implemented, in
collaboration with SWISS-PROT, to ensure correct mapping
of the SEQRES records in a PDB entry to the sequence
database entry at the residue level. Exchange of information
between the E-MSD and SWISS-PROT further helps to
maintain consistent information between st (...truncated)