Bayesian molecular design with a chemical language model (pdf)

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10822-016-0008-z.pdf

Bayesian molecular design with a chemical language model

Bayesian molecular design with a chemical language model Hisaki Ikebata 0 1 2 3 4 Kenta Hongo 0 1 2 3 4 Tetsu Isomura 0 1 2 3 4 Ryo Maezono 0 1 2 3 4 Ryo Yoshida 0 1 2 3 4 0 National Institute for Materials Science (NIMS) , Tsukuba , Japan 1 Japan Advanced Institute of Science and Technology (JAIST) , Nomi , Japan 2 The Graduate University for Advanced Studies (SOKENDAI) , Tachikawa , Japan 3 The Institute of Statistical Mathematics (ISM), Research Organization of Information and Systems , Tachikawa , Japan 4 PRESTO, Japan Science and Technology Agency (JST) , Kawaguchi , Japan The aim of computational molecular design is the identification of promising hypothetical molecules with a predefined set of desired properties. We address the issue of accelerating the material discovery with state-of-the-art machine learning techniques. The method involves two different types of prediction; the forward and backward predictions. The objective of the forward prediction is to create a set of machine learning models on various properties of a given molecule. Inverting the trained forward models through Bayes' law, we derive a posterior distribution for the backward prediction, which is conditioned by a desired property requirement. Exploring high-probability regions of the posterior with a sequential Monte Carlo technique, molecules that exhibit the desired properties can computationally be created. One major difficulty in the computational creation of molecules is the exclusion of the occurrence of chemically unfavorable structures. To circumvent this issue, we derive a chemical language model that acquires commonly occurring patterns of chemical fragments through natural language processing of ASCII strings of existing compounds, which follow the SMILES chemical language notation. In the backward prediction, the trained language model is used to refine chemical strings such that the properties of the resulting structures fall within the desired property region while chemically unfavorable structures are successfully removed. The present method is demonstrated through the design of small organic molecules with the property requirements on HOMO-LUMO gap and internal energy. The R package iqspr is available at the CRAN repository. Inverse-QSPR; Molecular design; Bayesian analysis; Small organic molecules; Natural language processing; SMILES - Computational molecular design has a great potential to promote enormous savings in time and cost in the discovery and development of functional molecules and assembles including drugs, dyes, solvents, polymers, and catalysis. The objective is to computationally create promising molecules that exhibit desired properties of various kinds, simultaneously. For instance, the chemical space of small organic molecules is known to consist of more than 1060 candidates. The problem entails a considerably complicated multi-objective optimization where it is impractical to fully explore the vast landscape of structure-property relationships. In general, the molecular design process involves two different types of prediction; the forward prediction is aimed at predicting physical, chemical and electric properties of a given molecular structure, and the backward prediction is to inversely identify appropriate molecular structures with the given desired properties. While the former design process is referred to as the quantitative structure-property relationship (QSPR) analysis, the latter is known as the inverse-QSPR analysis [1–9]. In this study, a Bayesian perspective is employed to unify the forward and backward prediction processes. Therefore, the present method is called the Bayesian molecular design. In cheminformatics or an emerging new research field called materials informatics, there have been extensive studies on the forward prediction; however, there has been considerably less progress made in the backward prediction. An obvious approach to the inverse problem is the use of combinatorial optimization techniques. The objective is to minimize the difference between given desired properties and those attained by the designed molecules. Some previous studies tackled this issue with genetic algorithms (GAs) [2, 4–7, 10–13] and molecular graph enumeration [8, 9, 14]. Graph enumeration is generally less effective due to the combinatorial complexity of the design space. To narrow down the candidates, several ways to use a restricted class of molecular graphs have been investigated [9, 14]. Using GAs [15], which have been more intensively studied, searches for optimal or suboptimal designs by successively modifying chemical structures with genetic operators consisting of mutation, crossover, and selection. The major difficulty of using a GA lies in the procedure of mutating molecules such that unfavorable structures are successfully excluded, for instance, unfavorable and/or unrealistic chemical bonds such as F–N and C=O=C. This issue is common to the graph enumeration. To avoid the (...truncated)