A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence (pdf)

Article PDF cannot be displayed. You can download it here:

https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00848-7

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

Zheng and Tomiura Journal of Cheminformatics (2024) 16:71 https://doi.org/10.1186/s13321-024-00848-7 Journal of Cheminformatics Open Access RESEARCH A BERT‑based pretraining model for extracting molecular structural information from a SMILES sequence Xiaofan Zheng1 and Yoichi Tomiura1* Abstract Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction Scientific contribution The 2-encoder pretraining is proposed by focusing on the lower dependency of symbols to the contextual environment in a SMILES than one in a natural language sentence and the corresponding of one compound to multiple SMILES sequences. The model pretrained with 2-encoder shows higher robustness in tasks of molecular properties prediction compared to BERT which is adept at natural language. Keywords SMILES, ADMET molecular properties prediction, Odor descriptors, Transformer model, BERT, Pretraining *Correspondence: Yoichi Tomiura 1 Graduate School of Information Science and Electrical Engineering, Department of Informatics, Kyushu University, Fukuoka, Japan Introduction Molecules as microscopic units constitute macroscopic matter, and their properties directly affect the application of substances in our daily lives. Depending on the direction of application, we require different chemical and physical properties of molecules, including simple properties such as hydrophilicity and complex properties such as protein binding. The factors that affect these properties can be traced back to deeper physical principles, but © The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Zheng and Tomiura Journal of Cheminformatics (2024) 16:71 the computational cost is huge for multi-particle systems. It takes a long time to obtain complex molecular properties by adopting either experimental or computational chemistry methods. Obtaining molecular properties through machine learning is thus being considered. Machine learning methods have been widely applied in chemistry, biology, and material informatics. Machine learning approaches have been proposed for the prediction of chemical properties [1, 2], the synthesis of compounds [3, 4], and the prediction of chemical reaction products [5]. Although molecular properties are diverse, the factors that determine their properties often depend on some common key factors such as the hydrophilicity of the molecule, whether it contains certain functional groups, etc. These common key factors of molecules can be quickly calculated, consequently, models such as random forest (RF) with inputs of molecular fingerprints and molecular descriptors often perform well in predicting molecular properties (even when the data size is small) as shown in [6]. In addition to using feature-based methods to infer unknown molecular properties, people also attempt to directly summarize features from molecular structures with artificial neural networks to infer molecular properties. We can roughly divided these feature-free methods of extracting features of molecular structures and predicting molecular properties using artificial neural networks into three categories according to the type of inputs to the model. The first category uses a SMILES (simplified molecular-input line-entry system) sequence as the input to the model. A SMILES sequence comprises symbols representing the molecular structure. The atoms that appear in a molecule are expressed by the symbol of their atom type, the substructure of a branch chain is represented in brackets ’()’, and the ring structure is represented by adding the same number after the start atom and end atom of the ring. A SMILES sequence can also represent stereo structures using ’\’ and ’/’ for the isomers due to double bonds and using ’@’ and ’@@’ for optical isomers. As SMILES sequences can be regarded as having the same data structure as a sentence, most works deal with SMILES sequences using a model developed for natural language. [7] proposed a model based on long short-term memory to predict molecular properties and interpreted the results with an attention mechanism. [8–11] applied the BERT model [12] to pretrain and predict molecular properties. [13] used fingerprints converted according to SMILES as inputs and pretrained the model with BERT. In addition, models can be pretrained with SMILES sequences in a language translation fashion by leveraging the fact that different SMILES sequences can represent the same molecular structure. [5, 14] pretrained a model by translating a SMILES sequence Page 2 of 9 to a different SMILES sequence (where the SMILES sequences represent the same molecule) with a transform (...truncated)