NRG-CING: integrated validation reports of remediated experimental biomolecular NMR data and coordinates in wwPDB (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/40/D1/D519.full.pdf

NRG-CING: integrated validation reports of remediated experimental biomolecular NMR data and coordinates in wwPDB

Published online 1 December 2011 Nucleic Acids Research, 2012, Vol. 40, Database issue D519–D524 doi:10.1093/nar/gkr1134 NRG-CING: integrated validation reports of remediated experimental biomolecular NMR data and coordinates in wwPDB Jurgen F. Doreleijers1,2,*, Wim F. Vranken3, Christopher Schulte4, John L. Markley4, Eldon L. Ulrich4, Gert Vriend2 and Geerten W. Vuister1,5,* 1 IMM, Radboud University Nijmegen, Geert Grooteplein 26-28, 6525 GA Nijmegen, The Netherlands, 2CMBI, Radboud University Nijmegen Medical Centre, Geert Grooteplein 26-28, 6525 GA Nijmegen, The Netherlands, 3 Department of Structural Biology, VIB and Structural Biology Brussels, Vrije Universiteit Brussel, Building E, 4th Floor, Pleinlaan 2, 1050 Brussels, Belgium, 4BioMagResBank, Department of Biochemistry, University of Wisconsin-Madison, 433 Babcock Dr., Madison, WI, 53706, USA and 5Department of Biochemistry, University of Leicester, Henry Wellcome Building, Lancaster Road, Leicester LE1 9HN, UK ABSTRACT For many macromolecular NMR ensembles from the Protein Data Bank (PDB) the experiment-based restraint lists are available, while other experimental data, mainly chemical shift values, are often available from the BioMagResBank. The accuracy and precision of the coordinates in these macromolecular NMR ensembles can be improved by recalculation using the available experimental data and present-day software. Such efforts, however, generally fail on half of all NMR ensembles due to the syntactic and semantic heterogeneity of the underlying data and the wide variety of formats used for their deposition. We have combined the remediated restraint information from our NMR Restraints Grid (NRG) database with available chemical shifts from the BioMagResBank and the Common Interface for NMR structure Generation (CING) structure validation reports into the weekly updated NRG-CING database (http://nmr.cmbi.ru.nl/ NRG-CING). Eleven programs have been included in the NRG-CING production pipeline to arrive at validation reports that list for each entry the potential inconsistencies between the coordinates and the available experimental NMR data. The longitudinal validation of these data in a publicly available relational database yields a set of indicators that can be used to judge the quality of every macromolecular structure solved with NMR. The remediated NMR experimental data sets and validation reports are freely available online. INTRODUCTION Experimentally determined biomacromolecular threedimensional (3D) structures typically are deposited in the Worldwide Protein Data Bank (wwPDB) (1–3) as a requirement by most journals including NAR. As of September 2011, there were over 76 000 entries in the PDB (cf. Table 1) of which 9000 entries had been solved by NMR. The BioMagResBank (BMRB) (4) serves as a global repository of experimental NMR data, such as restraints, assigned chemical shifts and dynamic order parameters. Together, these repositories present a valuable resource for numerous research areas in the life sciences. A series of experiments have shown that many NMR structures can be improved if they are recalculated from the original experimental data using present-day software and reﬁnement protocols (5–7) including the STAP database published in this ‘Database’ issue of Nucleic Acids Research. These efforts have revealed that the deposited experimental data were highly heterogeneous in format, completeness and quality. Recently, we performed a large-scale optimization of X-ray derived PDB entries (8), which showed that nearly three quarters of these could be improved in terms of ﬁt with the experimental data and geometric quality (9). The massive scale of this effort also allowed the analysis of even the smallest improvements in a statistically meaningful way (10). *To whom correspondence should be addressed. Tel: +44 116 229 7076; Fax: +44 116 229 7018; Email: Correspondence may also be addressed to Jurgen F. Doreleijers. Tel: +31 24 36 19674; Fax: +31 24 36 19395; Email: ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Received October 19, 2011; Revised November 3, 2011; Accepted November 8, 2011 D520 Nucleic Acids Research, 2012, Vol. 40, Database issue DATA PREPARATION Table 1. PDB entries Set Entries Data conversion PDB Solution NMR NRG-CING Proteins Dimers Complexes Ligands Deposition Before 1990 1990-2000 After 2000 76 003 9042 8915 7967 413 1235 384 The creation of a coherent and validated database of both structures and experimental data requires several steps. For the NRG-CING production pipeline we employed four stages, that we call C, R, S and F denoting coordinate, restraint, chemical shift and ﬁltering, respectively (Figure 1). 9 1920 7113 Coordinate stage. The coordinate data ﬂow in from the wwPDB using an mmCIF formatted ﬁle that adheres to the PDB eXchange dictionary (pdbx). Overview of subsets of PDB entries (23 September 2011). Shift stage. We developed code in collaboration with BMRB to run through a wide variety of data sources in order to match older entries for which the match relation between BMRB and PDB entries had not yet been archived. The matching algorithms are documented for the NRG part at: http://tinyurl.com/68dd9l9 and the CING part at http://tinyurl.com/67vfuyl. The CS data from BMRB are then merged by the FormatConverter (FC) (19) in a procedure similar to the one used for the restraints (15). Filter stage. The distance restraints (DR) are stereospecifically checked and in some cases corrected by FC and CING using the same method as currently in use at the BMRB (11). Distance restraints with violations over 2 Å (up to a maximum of three per entry) were omitted from the NRG-CING database and are labelled as outliers. Although such DRs are sometimes correct, the impact of removing correct DRs is deemed to be less detrimental compared to the effects of retaining potentially incorrect ones. In particular, the latter situation could result in unjustiﬁed labelling of an entry to be in discord with its experimental data. From anecdotal interactions with depositors we know that these restraints are often errant violations that were not observed at the time of structure calculation, but arose later as a consequence of correcting other problems, for example, typographical errors that led to a restraint being accidentally uncommented or incorrect mapping of one or two atom names. The referencing of the CS is validated during this stage by VASCO, which compares the CS values for the atoms in a protein to their statistical distribution in relation to the coordinatederived per-atom solvent exposure (16). Cloud computing The CING calcula (...truncated)