Chemical reaction enhanced graph learning for molecule representation
Bioinformatics, 2024, 40(10), btae558
https://doi.org/10.1093/bioinformatics/btae558
Advance Access Publication Date: 13 September 2024
Original Paper
Data and text mining
Chemical reaction enhanced graph learning for molecule
representation
1,�
, Elena Casiraghi
1,2,3,4
, Juho Rousu
1
1
Department of Computer Science, Aalto University, Espoo, 02150, Finland
AnacletoLab, Dipartimento di Informatica "Giovanni degli Antoni", University of Milan, Milan, 20133, Italy
3
Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, United States
4
ELLIS, European Laboratory for Learning and Intelligent Systems, Milan Unit (University of Milan), Milan, 20133, Italy
2
�Corresponding author. Department of Computer Science, Aalto University, Espoo, 02150, Finland. E-mail: (A.L.)
Associate Editor: Jonathan Wren
Abstract
Motivation: Molecular representation learning (MRL) models molecules with low-dimensional vectors to support biological and chemical
applications. Current methods primarily rely on intrinsic molecular information to learn molecular representations, but they often overlook
effectively integrating domain knowledge into MRL.
Results: In this article, we develop a reaction-enhanced graph learning (RXGL) framework for MRL, utilizing chemical reactions as domain knowledge.
RXGL introduces dual graph learning modules to model molecule representation. One module employs graph convolutions on molecular graphs to
capture molecule structures. The other module constructs a reaction-aware graph from chemical reactions and designs a novel graph attention
network on this graph to integrate reaction-level relations into molecular modeling. To refine molecule representations, we design a reaction-based
relation learning task, which considers the relations between the reactant and product sides in reactions. In addition, we introduce a cross-view
contrastive task to strengthen the cooperative associations between molecular and reaction-aware graph learning. Experiment results show that
our RXGL achieves strong performance in various downstream tasks, including product prediction, reaction classification, and molecular
property prediction.
Availability and implementation: The code is publicly available at https://github.com/coder-ACAC/RLM.
1 Introduction
Molecule representation learning (MRL) techniques are crucial
for combining machine learning with biological and chemical
sciences (Yi et al. 2022). MRL encodes molecules as lowdimensional vectors. These vectors retain molecule information,
facilitating their use as features in downstream applications (e.g.
product prediction, reaction classification, and molecular
property prediction). A variety of MRL methods have been
proposed, which roughly fall into the following two categories.
One school is SMILES-based methods (Fabian et al. 2020),
which utilize SMILES strings as input and employ natural lan
guage models as their base architectures. However, they struggle
with capturing molecule structures. The other school treats mol
ecule topology as a graph, and models molecules with graph
neural networks (GNNs) (Xu et al. 2021). Although GNNbased methods generally outperform SMILES-based ones, they
typically focus on designing GNN architectures, neglecting the
efficient integration of domain knowledge.
Recent studies (Wang et al. 2022a) use chemical reactions as
domain knowledge for MRL. Typically, reactions are repre
sented by equations, with reactants on the left side and products
on the right (cf Definition 2 in Section Preliminaries). These
methods first learn molecule embeddings from molecular graphs
and then optimize embeddings by equating the sum of reactant
embeddings with the sum of product embeddings for each
reaction. Despite effectiveness, we argue that they face at least
one of the following issues.
Firstly, these reaction-based methods treat molecules as iso
lated data instances and rely solely on molecule structures for
representation, which ignores the insights from molecule rela
tions inherent in chemical reactions. For example, molecules in
volved in the same reaction (as reactants/products) may exhibit
greater similarities and correlations with each other than with
molecules from different reactions. To illustrate this reactionrelated relation, we construct a reaction-aware graph (cf
Definition 4 in Section Preliminaries) based on a reaction set, as
shown in Fig. 1a. In this graph, nodes are molecules and edges
denote molecule relations driven by reactions. For molecule A,
its first-order neighbors (molecules F, G, and H) represent prod
ucts that can be derived from A through reactions. Molecule
A’s second-order neighbors (molecules B, C, and D) suggest a
property/structure similarity with A, inferred from shared reac
tion products. Moreover, molecule B is likely more similar to A
than C or D, as evidenced by a greater overlap in the reaction
products. These analyses inspire us to consider the potential
benefits of incorporating molecule relations from the reactionaware graph into MRL.
Secondly, these methods ignore the transformation relation
learning between reactants and products. Their assumption that
the summed embeddings of reactants and products should be
Received: 31 May 2024; Revised: 28 August 2024; Editorial Decision: 4 September 2024; Accepted: 11 September 2024
© The Author(s) 2024. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Anchen Li
2
Li et al.
equal (as shown in Fig. 1b1) essentially reduces all reactions to
an identity transformation, which oversimplifies the complexity
of chemical processes. In reality, reactions involve various
changes, such as the number of bonds (e.g. breaking old bonds
and forming new ones) and energy variations (e.g. endotherms
and exotherms) before and after the reaction. The assumption
in current studies fails to model these changes.
Motivated by these gaps, we introduce a reactionenhanced graph learning framework (RXGL) for MRL. In
the molecule modeling stage, we design dual graph learning
modules. The first module utilizes graph convolutions on mo
lecular graphs to capture the structural information of mole
cules. The second module first involves a reaction-aware
graph and then creates a GNN to extract reaction-level mo
lecular relations for molecule feature learning. In the optimi
zation stage, we introduce a reaction-based relation learning
method that considers the relation between reactants and
products in chemical reactions. Specifically, we employ a
memory network (Miller et al. 2016) to learn a latent relation
vector that connects reactant and product embeddings (as
shown in Fig. 1b2). Through the delicate key and memory
component (...truncated)