Improving VAE based molecular representations for compound property prediction
(2022) 14:69
Tevosyan et al. Journal of Cheminformatics
https://doi.org/10.1186/s13321-022-00648-x
Journal of Cheminformatics
Open Access
RESEARCH
Improving VAE based molecular
representations for compound property
prediction
Ani Tevosyan1, Lusine Khondkaryan2, Hrant Khachatrian1,3, Gohar Tadevosyan2, Lilit Apresyan2, Nelly Babayan2,5,
Helga Stopper4 and Zaven Navoyan5*
Abstract
Collecting labeled data for many important tasks in chemoinformatics is time consuming and requires expensive
experiments. In recent years, machine learning has been used to learn rich representations of molecules using large
scale unlabeled molecular datasets and transfer the knowledge to solve the more challenging tasks with limited datasets. Variational autoencoders are one of the tools that have been proposed to perform the transfer for both chemical
property prediction and molecular generation tasks. In this work we propose a simple method to improve chemical
property prediction performance of machine learning models by incorporating additional information on correlated
molecular descriptors in the representations learned by variational autoencoders. We verify the method on three
property prediction tasks. We explore the impact of the number of incorporated descriptors, correlation between the
descriptors and the target properties, sizes of the datasets etc. Finally, we show the relation between the performance
of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset in the representation space.
Keywords: Variational autoencoders, Vector representation, Transfer learning, Property prediction
Introduction
One of the challenging tasks in chemoinformatics and
molecular modeling is the prediction of physicochemical
properties and/or biological activities of chemical compounds from their molecular/chemical features. Quantitative structure–activity/property relationship (QSAR/
QSPR) modeling based on machine learning (ML) methods, though, achieved a certain amount of success [1–3],
but still suffer from a number of issues. Prediction efficacy and performance of ML methods is often limited by
the availability of labeled datasets. The size of the training
data is acknowledged as a key property for reliable and
accurate predictions [4, 5]. Another obstacle of classical
*Correspondence:
5
Toxometris.ai, Sarmen str. 7, 0009 Yerevan, Armenia
Full list of author information is available at the end of the article
machine learning is reliance on similarity of chemical/
biological space occupied by training and test datasets [6,
7]. Difference in training and test data distribution may
lead to faulty extrapolation of the model and even degradation [8]. This scenario implies collection of a new training dataset, which in practice is hard to achieve due to
expensive and time-consuming experimental procedures.
One of the proposed solutions is called transfer learning, which consists of model pre-training on a large scale
dataset with cheap or free labels and a further fine-tuning
on a smaller target dataset [9, 10]. Ideally, this pre-training causes the model to develop general-purpose abilities
and knowledge that can then be transferred to downstream tasks. Recent developments in transfer learning
techniques have shown competitive results in QSAR/
QSPR modeling [11–13]. Additionally, the performance
of an ML algorithm depends to a large extent on the way
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco
mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Tevosyan et al. Journal of Cheminformatics
(2022) 14:69
input data is represented. Consequently, various descriptors and fingerprints engineered by experts have been
widely employed for properties predictions [14, 15].
Achievements in deep neural networks put forward data-driven representation learning based on
molecular graphs [16–18] or sequence-based strings,
like SMILES (Simplified Molecular Input Line Entry
Specification) [19, 20]. In recent years, unsupervised
latent representation of molecules extracted from variational autoencoders (VAEs) with encoder-decoder
architecture for QSAR/QSPR tasks has gained significant attention. One of the earliest works in this area is
chemical VAE (CVAE) proposed by Gómez-Bombarelli
et al. [21], which converts a SMILES sequence to and
from a fixed sized continuous vectors. Though the main
purpose of the developed version of VAE was de novo
molecule generation, the authors also utilized latent
representations for property predictions. Joint training of VAE with a property predictor results in a latent
space organized in a way that the molecules are ordered
according to their property values. Since then, various adaptations and modifications of VAEs have been
Fig. 1 The training workflow
Page 2 of 14
proposed, aiming to improve latent space organization
[22–26]. Despite huge efforts, the performance of VAEs
is still far from perfect. Recently, latent representations
extracted from VAEs pre-trained on large datasets have
been exploited as a basis for transfer learning [27, 28].
It was shown that although this approach resulted in
reasonable performance, simpler ML models like gradient boosting or graph convolutional networks without
transfer learning still outperform it.
In the present work, we aim to test the hypothesis
that jointly training a VAE with an additional predictor of descriptors specifically correlated with the property of interest may impact the latent space quality and,
consequently, improve the prediction accuracy of the
downstream task. For this, we adopted the following
approach (Fig. 1): first we identified a subset of descriptors mostly correlated with a particular target in downstream tasks, then we pre-trained a VAE along with a
predictor of the chosen descriptors on the data-rich
ZINC dataset. Afterwards, extracted embeddings were
utilized in building models for the target regression and
classification tasks.
Tevosyan et al. Journal of Cheminformatics
(2022) (...truncated)