Improving VAE based molecular representations for compound property prediction (pdf)

Article PDF cannot be displayed. You can download it here:

https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-022-00648-x

Improving VAE based molecular representations for compound property prediction

(2022) 14:69 Tevosyan et al. Journal of Cheminformatics https://doi.org/10.1186/s13321-022-00648-x Journal of Cheminformatics Open Access RESEARCH Improving VAE based molecular representations for compound property prediction Ani Tevosyan1, Lusine Khondkaryan2, Hrant Khachatrian1,3, Gohar Tadevosyan2, Lilit Apresyan2, Nelly Babayan2,5, Helga Stopper4 and Zaven Navoyan5* Abstract Collecting labeled data for many important tasks in chemoinformatics is time consuming and requires expensive experiments. In recent years, machine learning has been used to learn rich representations of molecules using large scale unlabeled molecular datasets and transfer the knowledge to solve the more challenging tasks with limited datasets. Variational autoencoders are one of the tools that have been proposed to perform the transfer for both chemical property prediction and molecular generation tasks. In this work we propose a simple method to improve chemical property prediction performance of machine learning models by incorporating additional information on correlated molecular descriptors in the representations learned by variational autoencoders. We verify the method on three property prediction tasks. We explore the impact of the number of incorporated descriptors, correlation between the descriptors and the target properties, sizes of the datasets etc. Finally, we show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset in the representation space. Keywords: Variational autoencoders, Vector representation, Transfer learning, Property prediction Introduction One of the challenging tasks in chemoinformatics and molecular modeling is the prediction of physicochemical properties and/or biological activities of chemical compounds from their molecular/chemical features. Quantitative structure–activity/property relationship (QSAR/ QSPR) modeling based on machine learning (ML) methods, though, achieved a certain amount of success [1–3], but still suffer from a number of issues. Prediction efficacy and performance of ML methods is often limited by the availability of labeled datasets. The size of the training data is acknowledged as a key property for reliable and accurate predictions [4, 5]. Another obstacle of classical *Correspondence: 5 Toxometris.ai, Sarmen str. 7, 0009 Yerevan, Armenia Full list of author information is available at the end of the article machine learning is reliance on similarity of chemical/ biological space occupied by training and test datasets [6, 7]. Difference in training and test data distribution may lead to faulty extrapolation of the model and even degradation [8]. This scenario implies collection of a new training dataset, which in practice is hard to achieve due to expensive and time-consuming experimental procedures. One of the proposed solutions is called transfer learning, which consists of model pre-training on a large scale dataset with cheap or free labels and a further fine-tuning on a smaller target dataset [9, 10]. Ideally, this pre-training causes the model to develop general-purpose abilities and knowledge that can then be transferred to downstream tasks. Recent developments in transfer learning techniques have shown competitive results in QSAR/ QSPR modeling [11–13]. Additionally, the performance of an ML algorithm depends to a large extent on the way © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Tevosyan et al. Journal of Cheminformatics (2022) 14:69 input data is represented. Consequently, various descriptors and fingerprints engineered by experts have been widely employed for properties predictions [14, 15]. Achievements in deep neural networks put forward data-driven representation learning based on molecular graphs [16–18] or sequence-based strings, like SMILES (Simplified Molecular Input Line Entry Specification) [19, 20]. In recent years, unsupervised latent representation of molecules extracted from variational autoencoders (VAEs) with encoder-decoder architecture for QSAR/QSPR tasks has gained significant attention. One of the earliest works in this area is chemical VAE (CVAE) proposed by Gómez-Bombarelli et al. [21], which converts a SMILES sequence to and from a fixed sized continuous vectors. Though the main purpose of the developed version of VAE was de novo molecule generation, the authors also utilized latent representations for property predictions. Joint training of VAE with a property predictor results in a latent space organized in a way that the molecules are ordered according to their property values. Since then, various adaptations and modifications of VAEs have been Fig. 1 The training workflow Page 2 of 14 proposed, aiming to improve latent space organization [22–26]. Despite huge efforts, the performance of VAEs is still far from perfect. Recently, latent representations extracted from VAEs pre-trained on large datasets have been exploited as a basis for transfer learning [27, 28]. It was shown that although this approach resulted in reasonable performance, simpler ML models like gradient boosting or graph convolutional networks without transfer learning still outperform it. In the present work, we aim to test the hypothesis that jointly training a VAE with an additional predictor of descriptors specifically correlated with the property of interest may impact the latent space quality and, consequently, improve the prediction accuracy of the downstream task. For this, we adopted the following approach (Fig. 1): first we identified a subset of descriptors mostly correlated with a particular target in downstream tasks, then we pre-trained a VAE along with a predictor of the chosen descriptors on the data-rich ZINC dataset. Afterwards, extracted embeddings were utilized in building models for the target regression and classification tasks. Tevosyan et al. Journal of Cheminformatics (2022) (...truncated)