SOMAS: a platform for data-driven material discovery in redox flow battery development
www.nature.com/scientificdata
SOMAS: a platform for data-driven
Data Descriptor material discovery in redox flow
battery development
OPEN
Peiyuan Gao1 ✉, Amity Andersen2, Jonathan Sepulveda3, Gihan U. Panapitiya4, Aaron Hollas3,
Emily G. Saldanha4, Vijayakumar Murugesan 1 ✉ & Wei Wang 3 ✉
Aqueous organic redox flow batteries offer an environmentally benign, tunable, and safe route to
large-scale energy storage. The energy density is one of the key performance parameters of organic
redox flow batteries, which critically depends on the solubility of the redox-active molecule in water.
Prediction of aqueous solubility remains a challenge in chemistry. Recently, machine learning models
have been developed for molecular properties prediction in chemistry and material science. The
fidelity of a machine learning model critically depends on the diversity, accuracy, and abundancy of
the training datasets. We build a comprehensive open access organic molecular database “Solubility of
Organic Molecules in Aqueous Solution” (SOMAS) containing about 12,000 molecules that covers wider
chemical and solubility regimes suitable for aqueous organic redox flow battery development efforts.
In addition to experimental solubility, we also provide eight distinctive quantum descriptors including
optimized geometry derived from high-throughput density functional theory calculations along with six
molecular descriptors for each molecule. SOMAS builds a critical foundation for future efforts in artificial
intelligence-based solubility prediction models.
Background & Summary
The aqueous solubility of organic molecules is a crucial property in multiple areas like synthesis chemistry,
catalysis science, drug design, and energy science1–4. In energy science, to facilitate the rapid deployment of
renewable energy, aqueous organic redox flow batteries (RFBs) have been increasingly recognized as a promising candidate for large-scale energy storage due to their inherent safety, potentially low-cost, and structure
tunability5,6. In organic RFBs, the physicochemical properties of organic molecules significantly impact their
performance characteristics7. The solubility of redox active organic species is a critical parameter in aqueous
electrolyte design, as it determines the energy density of RFBs.
Versatility of organic molecular editing, in terms of both structural variations and functional group attachments, offers a unique possibility for artificial intelligence-based designing of highly soluble redox molecules for
RFB application. However, the predictive understanding of the relationship between a functional property such
as solubility and the chemical structure of organic molecules is lacking. Some structural and physiochemical
parameters such as solvent accessible surface area (SASA) and acid dissociation constant (pKa) are known to
influence the solvation process. Multiple physics-based models were developed using these properties, but the
accuracy remains unsatisfactory8–10. Linear regression-based models, such as quantitative structure−property
relationships (QSPRs) using molecular parameters also fail to produce reliable solubility predictions11–13. For
example, the state-of-the-art models render prediction of solubilities with root-mean-square errors (RMSEs) of
approximately 0.3−0.4 (log units) for simple organic molecules and 0.7−1.0 (log units) for drug molecules in
small test sets14.
With recent development in both computer hardware and software, machine learning (ML) is increasingly
being recognized as a powerful technique for material design and property prediction15,16. To develop generalizable and accurate ML models, large datasets with structural and chemical diversity of molecules with
1
Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, 99354,
USA. 2Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA, 99354,
USA. 3Energy and Environment Directorate, Pacific Northwest National Laboratory, Richland, WA, 99354, USA.
4
National Security Directorate, Pacific Northwest National Laboratory, Richland, WA, 99354, USA. ✉e-mail: peiyuan.
; ;
Scientific Data |
(2022) 9:740 | https://doi.org/10.1038/s41597-022-01814-4
1
www.nature.com/scientificdata/
www.nature.com/scientificdata
Fig. 1 Thermodynamic cycle scheme of intrinsic solubility. ∆G = ∆Gsublimation + ∆Gsolvation. R, ideal gas
constant, T, absolute temperature, S0, intrinsic solubility, Vm, crystalline molar volume.
Fig. 2 Workflow for database curation and augmentation of specific quantum and molecular descriptors.
relevant quantum and molecular descriptors are extremely important. However, previous open-source solubility
databases primarily designed for drug design are based on a few hundred drug molecules, which is very small
and does not represent the relevant chemical parameter space of redox flow battery electrolytes. For example,
the desired solubility of organic molecules in RFBs is much larger (≥0.5 M) than that of the drug molecules
(<0.1 M). Also, strong acidic or basic organic molecules can be effective electrolyte in RFB, but most drug candidates are relatively weak acids and bases17–19. Therefore, organic RFB development efforts require a comprehensive database that covers relevant chemical parameter space.
In this work, we build a comprehensive open access database “Solubility of Organic Molecules in Aqueous
Solution” (SOMAS) that can serve as an optimal platform for developing aqueous solubility prediction models
using ML methods. Unlike previous solubility databases, the SOMAS database focused only on neutral organic
molecules and excluded organic salts and organometallic compounds to reduce data set bias in predictive
models. Our database has a total of 11,696 organic compounds, which is nearly twice the number of organic
compounds in AqSolDB, an open source database reported recently20. Of equal importance is that the number
of molecules in the range of high solubility (>0.5 M) is also about two times more than AqSolDB database,
providing a more comprehensive training dataset20. In addition to the experimental solubility, eight quantum
descriptors derived from high-throughput density functional theory (DFT) calculations along with traditional
molecular descriptors were also added to each molecule in the database, rendering it as an optimal platform for
solubility prediction models relevant for RFB application. The choice of quantum and molecular descriptors are
carefully selected to represent the thermodynamic cycle of aqueous solubility shown in Fig. 1. We curated the
molecular data of experimental aqueous solubility with specific temperature, and literature references collected
from a wide range of material/chemical engineering databases and published papers/handbooks. To reduce the
rate of duplicate entries in the database, we implemented a new cross-validation method using independent
molecular identifiers. F (...truncated)