OPERA models for predicting physicochemical properties and environmental fate endpoints

Journal of Cheminformatics, Mar 2018

The collection of chemical structure information and associated experimental data for quantitative structure–activity/property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2–15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q2 of the models varied from 0.72 to 0.95, with an average of 0.86 and an R2 test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission’s Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure–activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency’s CompTox Chemistry Dashboard. Open image in new window

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2Fs13321-018-0263-1.pdf

OPERA models for predicting physicochemical properties and environmental fate endpoints

Mansouri et al. J Cheminform OPERA models for predicting physicochemical properties and environmental fate endpoints Kamel Mansouri 0 1 2 Chris M. Grulke 2 Richard S. Judson 2 Antony J. Williams 2 0 Present Address: ScitoVation LLC , 6 Davis Drive, Research Triangle Park, NC 27709 , USA 1 Oak Ridge Institute for Science and Education , 1299 Bethel Valley Road, Oak Ridge, TN 37830 , USA 2 National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency , Research Triangle Park, NC 27711 , USA The collection of chemical structure information and associated experimental data for quantitative structure-activity/ property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2-15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q2 of the models varied from 0.72 to 0.95, with an average of 0.86 and an R2 test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission's Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure-activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency's CompTox Chemistry Dashboard. OPERA; QSAR/QSPR; Physicochemical properties; Environmental fate; OECD principles; Open data; Open source; Model validation; QMRF Background The increase in the number and quantity of manufactured chemicals finding their way into the environment is proportionally increasing potential exposures of humans and wildlife to potentially harmful substances [ 1–7 ]. Due to constraints associated with time, costs, and animal welfare issues, most of these chemicals lack experimentally measured properties [ 8–11 ]. To quickly assess large numbers of chemicals for potential toxicity at reasonable cost, the U.S. Environmental Protection Agency (EPA) and other regulatory agencies need to develop new, more efficient testing and evaluation methods [ 2, 12–18 ]. Over the past decade, high-throughput screening (HTS) approaches developed by the pharmaceutical industry for drug discovery have been used as alternative approaches to traditional toxicity tests for environmental chemicals [ 19–22 ]. At the EPA, since 2007, the National Center for Computational Toxicology (NCCT) has been evaluating HTS approaches through its ToxCast program [ 9, 22–24 ]. However, because tens of thousands of chemicals require screening [ 3, 7, 15, 18, 25 ], faster and more costeffective in silico methods such as quantitative structure– activity/property relationships (QSAR/QSPR) modeling approaches [ 13, 16, 18, 26–28 ] are needed to prioritize chemicals for testing. The growing use of QSAR modeling approaches for virtual screening and data gap filling by the scientific community is establishing QSAR models as internationally recognized alternatives to empirical testing by regulatory agencies and organizations such as REACH and the United Nations Globally Harmonized System of Classification and Labeling of Hazardous Chemicals [ 18, 28–33 ]. In addition to aiding in prioritization, QSAR models including other calculated descriptors and predicted chemical properties [ 23, 34 ] can help overcome difficulties that may arise during in  vitro to in  vivo extrapolation (IVIVE) or exposure assessment. Therefore, reliable predictions for both physicochemical propertie (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs13321-018-0263-1.pdf

Kamel Mansouri, Chris M. Grulke, Richard S. Judson, Antony J. Williams. OPERA models for predicting physicochemical properties and environmental fate endpoints, Journal of Cheminformatics, 2018, pp. 10, Volume 10, Issue 1, DOI: 10.1186/s13321-018-0263-1