SOMAS: a platform for data-driven material discovery in redox flow battery development

Scientific Data, Dec 2022

Aqueous organic redox flow batteries offer an environmentally benign, tunable, and safe route to large-scale energy storage. The energy density is one of the key performance parameters of organic redox flow batteries, which critically depends on the solubility of the redox-active molecule in water. Prediction of aqueous solubility remains a challenge in chemistry. Recently, machine learning models have been developed for molecular properties prediction in chemistry and material science. The fidelity of a machine learning model critically depends on the diversity, accuracy, and abundancy of the training datasets. We build a comprehensive open access organic molecular database “Solubility of Organic Molecules in Aqueous Solution” (SOMAS) containing about 12,000 molecules that covers wider chemical and solubility regimes suitable for aqueous organic redox flow battery development efforts. In addition to experimental solubility, we also provide eight distinctive quantum descriptors including optimized geometry derived from high-throughput density functional theory calculations along with six molecular descriptors for each molecule. SOMAS builds a critical foundation for future efforts in artificial intelligence-based solubility prediction models.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41597-022-01814-4.pdf

SOMAS: a platform for data-driven material discovery in redox flow battery development

www.nature.com/scientificdata SOMAS: a platform for data-driven Data Descriptor material discovery in redox flow battery development OPEN Peiyuan Gao1 ✉, Amity Andersen2, Jonathan Sepulveda3, Gihan U. Panapitiya4, Aaron Hollas3, Emily G. Saldanha4, Vijayakumar Murugesan 1 ✉ & Wei Wang 3 ✉ Aqueous organic redox flow batteries offer an environmentally benign, tunable, and safe route to large-scale energy storage. The energy density is one of the key performance parameters of organic redox flow batteries, which critically depends on the solubility of the redox-active molecule in water. Prediction of aqueous solubility remains a challenge in chemistry. Recently, machine learning models have been developed for molecular properties prediction in chemistry and material science. The fidelity of a machine learning model critically depends on the diversity, accuracy, and abundancy of the training datasets. We build a comprehensive open access organic molecular database “Solubility of Organic Molecules in Aqueous Solution” (SOMAS) containing about 12,000 molecules that covers wider chemical and solubility regimes suitable for aqueous organic redox flow battery development efforts. In addition to experimental solubility, we also provide eight distinctive quantum descriptors including optimized geometry derived from high-throughput density functional theory calculations along with six molecular descriptors for each molecule. SOMAS builds a critical foundation for future efforts in artificial intelligence-based solubility prediction models. Background & Summary The aqueous solubility of organic molecules is a crucial property in multiple areas like synthesis chemistry, catalysis science, drug design, and energy science1–4. In energy science, to facilitate the rapid deployment of renewable energy, aqueous organic redox flow batteries (RFBs) have been increasingly recognized as a promising candidate for large-scale energy storage due to their inherent safety, potentially low-cost, and structure tunability5,6. In organic RFBs, the physicochemical properties of organic molecules significantly impact their performance characteristics7. The solubility of redox active organic species is a critical parameter in aqueous electrolyte design, as it determines the energy density of RFBs. Versatility of organic molecular editing, in terms of both structural variations and functional group attachments, offers a unique possibility for artificial intelligence-based designing of highly soluble redox molecules for RFB application. However, the predictive understanding of the relationship between a functional property such as solubility and the chemical structure of organic molecules is lacking. Some structural and physiochemical parameters such as solvent accessible surface area (SASA) and acid dissociation constant (pKa) are known to influence the solvation process. Multiple physics-based models were developed using these properties, but the accuracy remains unsatisfactory8–10. Linear regression-based models, such as quantitative structure−property relationships (QSPRs) using molecular parameters also fail to produce reliable solubility predictions11–13. For example, the state-of-the-art models render prediction of solubilities with root-mean-square errors (RMSEs) of approximately 0.3−0.4 (log units) for simple organic molecules and 0.7−1.0 (log units) for drug molecules in small test sets14. With recent development in both computer hardware and software, machine learning (ML) is increasingly being recognized as a powerful technique for material design and property prediction15,16. To develop generalizable and accurate ML models, large datasets with structural and chemical diversity of molecules with 1 Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, 99354, USA. 2Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA, 99354, USA. 3Energy and Environment Directorate, Pacific Northwest National Laboratory, Richland, WA, 99354, USA. 4 National Security Directorate, Pacific Northwest National Laboratory, Richland, WA, 99354, USA. ✉e-mail: peiyuan. ; ; Scientific Data | (2022) 9:740 | https://doi.org/10.1038/s41597-022-01814-4 1 www.nature.com/scientificdata/ www.nature.com/scientificdata Fig. 1 Thermodynamic cycle scheme of intrinsic solubility. ∆G = ∆Gsublimation + ∆Gsolvation. R, ideal gas constant, T, absolute temperature, S0, intrinsic solubility, Vm, crystalline molar volume. Fig. 2 Workflow for database curation and augmentation of specific quantum and molecular descriptors. relevant quantum and molecular descriptors are extremely important. However, previous open-source solubility databases primarily designed for drug design are based on a few hundred drug molecules, which is very small and does not represent the relevant chemical parameter space of redox flow battery electrolytes. For example, the desired solubility of organic molecules in RFBs is much larger (≥0.5 M) than that of the drug molecules (<0.1 M). Also, strong acidic or basic organic molecules can be effective electrolyte in RFB, but most drug candidates are relatively weak acids and bases17–19. Therefore, organic RFB development efforts require a comprehensive database that covers relevant chemical parameter space. In this work, we build a comprehensive open access database “Solubility of Organic Molecules in Aqueous Solution” (SOMAS) that can serve as an optimal platform for developing aqueous solubility prediction models using ML methods. Unlike previous solubility databases, the SOMAS database focused only on neutral organic molecules and excluded organic salts and organometallic compounds to reduce data set bias in predictive models. Our database has a total of 11,696 organic compounds, which is nearly twice the number of organic compounds in AqSolDB, an open source database reported recently20. Of equal importance is that the number of molecules in the range of high solubility (>0.5 M) is also about two times more than AqSolDB database, providing a more comprehensive training dataset20. In addition to the experimental solubility, eight quantum descriptors derived from high-throughput density functional theory (DFT) calculations along with traditional molecular descriptors were also added to each molecule in the database, rendering it as an optimal platform for solubility prediction models relevant for RFB application. The choice of quantum and molecular descriptors are carefully selected to represent the thermodynamic cycle of aqueous solubility shown in Fig. 1. We curated the molecular data of experimental aqueous solubility with specific temperature, and literature references collected from a wide range of material/chemical engineering databases and published papers/handbooks. To reduce the rate of duplicate entries in the database, we implemented a new cross-validation method using independent molecular identifiers. F (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41597-022-01814-4.pdf
Article home page: https://www.nature.com/articles/s41597-022-01814-4

Gao, Peiyuan, Andersen, Amity, Sepulveda, Jonathan, Panapitiya, Gihan U., Hollas, Aaron, Saldanha, Emily G., Murugesan, Vijayakumar, Wang, Wei. SOMAS: a platform for data-driven material discovery in redox flow battery development, Scientific Data, DOI: 10.1038/s41597-022-01814-4