NRG-CING: integrated validation reports of remediated experimental biomolecular NMR data and coordinates in wwPDB
Published online 1 December 2011
Nucleic Acids Research, 2012, Vol. 40, Database issue D519–D524
doi:10.1093/nar/gkr1134
NRG-CING: integrated validation reports of
remediated experimental biomolecular NMR data
and coordinates in wwPDB
Jurgen F. Doreleijers1,2,*, Wim F. Vranken3, Christopher Schulte4, John L. Markley4,
Eldon L. Ulrich4, Gert Vriend2 and Geerten W. Vuister1,5,*
1
IMM, Radboud University Nijmegen, Geert Grooteplein 26-28, 6525 GA Nijmegen, The Netherlands, 2CMBI,
Radboud University Nijmegen Medical Centre, Geert Grooteplein 26-28, 6525 GA Nijmegen, The Netherlands,
3
Department of Structural Biology, VIB and Structural Biology Brussels, Vrije Universiteit Brussel, Building E,
4th Floor, Pleinlaan 2, 1050 Brussels, Belgium, 4BioMagResBank, Department of Biochemistry, University of
Wisconsin-Madison, 433 Babcock Dr., Madison, WI, 53706, USA and 5Department of Biochemistry, University
of Leicester, Henry Wellcome Building, Lancaster Road, Leicester LE1 9HN, UK
ABSTRACT
For many macromolecular NMR ensembles from the
Protein Data Bank (PDB) the experiment-based restraint lists are available, while other experimental
data, mainly chemical shift values, are often available from the BioMagResBank. The accuracy and
precision of the coordinates in these macromolecular NMR ensembles can be improved by recalculation using the available experimental data and
present-day software. Such efforts, however, generally fail on half of all NMR ensembles due to the
syntactic and semantic heterogeneity of the
underlying data and the wide variety of formats
used for their deposition. We have combined the
remediated restraint information from our NMR
Restraints Grid (NRG) database with available
chemical shifts from the BioMagResBank and the
Common Interface for NMR structure Generation
(CING) structure validation reports into the weekly
updated NRG-CING database (http://nmr.cmbi.ru.nl/
NRG-CING). Eleven programs have been included in
the NRG-CING production pipeline to arrive at validation reports that list for each entry the potential
inconsistencies between the coordinates and the
available experimental NMR data. The longitudinal
validation of these data in a publicly available relational database yields a set of indicators that can be
used to judge the quality of every macromolecular
structure solved with NMR. The remediated NMR
experimental data sets and validation reports are
freely available online.
INTRODUCTION
Experimentally determined biomacromolecular threedimensional (3D) structures typically are deposited in
the Worldwide Protein Data Bank (wwPDB) (1–3) as a
requirement by most journals including NAR. As of
September 2011, there were over 76 000 entries in the
PDB (cf. Table 1) of which 9000 entries had been
solved by NMR. The BioMagResBank (BMRB) (4)
serves as a global repository of experimental NMR data,
such as restraints, assigned chemical shifts and dynamic
order parameters. Together, these repositories present a
valuable resource for numerous research areas in the life
sciences.
A series of experiments have shown that many NMR
structures can be improved if they are recalculated from
the original experimental data using present-day software
and refinement protocols (5–7) including the STAP
database published in this ‘Database’ issue of Nucleic
Acids Research. These efforts have revealed that the deposited experimental data were highly heterogeneous in
format, completeness and quality. Recently, we performed
a large-scale optimization of X-ray derived PDB entries
(8), which showed that nearly three quarters of these could
be improved in terms of fit with the experimental data and
geometric quality (9). The massive scale of this effort also
allowed the analysis of even the smallest improvements in
a statistically meaningful way (10).
*To whom correspondence should be addressed. Tel: +44 116 229 7076; Fax: +44 116 229 7018; Email:
Correspondence may also be addressed to Jurgen F. Doreleijers. Tel: +31 24 36 19674; Fax: +31 24 36 19395; Email:
ß The Author(s) 2011. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received October 19, 2011; Revised November 3, 2011; Accepted November 8, 2011
D520 Nucleic Acids Research, 2012, Vol. 40, Database issue
DATA PREPARATION
Table 1. PDB entries
Set
Entries
Data conversion
PDB
Solution NMR
NRG-CING
Proteins
Dimers
Complexes
Ligands
Deposition
Before 1990
1990-2000
After 2000
76 003
9042
8915
7967
413
1235
384
The creation of a coherent and validated database of both
structures and experimental data requires several steps.
For the NRG-CING production pipeline we employed
four stages, that we call C, R, S and F denoting coordinate, restraint, chemical shift and filtering, respectively
(Figure 1).
9
1920
7113
Coordinate stage. The coordinate data flow in from the
wwPDB using an mmCIF formatted file that adheres to
the PDB eXchange dictionary (pdbx).
Overview of subsets of PDB entries (23 September 2011).
Shift stage. We developed code in collaboration with
BMRB to run through a wide variety of data sources in
order to match older entries for which the match relation
between BMRB and PDB entries had not yet been
archived. The matching algorithms are documented for
the NRG part at: http://tinyurl.com/68dd9l9 and the
CING part at http://tinyurl.com/67vfuyl. The CS data
from BMRB are then merged by the FormatConverter
(FC) (19) in a procedure similar to the one used for the
restraints (15).
Filter stage. The distance restraints (DR) are stereospecifically checked and in some cases corrected by FC and
CING using the same method as currently in use at the
BMRB (11). Distance restraints with violations over 2 Å
(up to a maximum of three per entry) were omitted from
the NRG-CING database and are labelled as outliers.
Although such DRs are sometimes correct, the impact of
removing correct DRs is deemed to be less detrimental
compared to the effects of retaining potentially incorrect
ones. In particular, the latter situation could result in unjustified labelling of an entry to be in discord with its
experimental data. From anecdotal interactions with depositors we know that these restraints are often errant
violations that were not observed at the time of structure
calculation, but arose later as a consequence of correcting
other problems, for example, typographical errors that led
to a restraint being accidentally uncommented or incorrect
mapping of one or two atom names. The referencing of
the CS is validated during this stage by VASCO, which
compares the CS values for the atoms in a protein to their
statistical distribution in relation to the coordinatederived per-atom solvent exposure (16).
Cloud computing
The CING calcula (...truncated)