Chemical reaction enhanced graph learning for molecule representation

Bioinformatics, Oct 2024

Molecular representation learning (MRL) models molecules with low-dimensional vectors to support biological and chemical applications. Current methods primarily rely on intrinsic molecular information to learn molecular representations, but they often overlook effectively integrating domain knowledge into MRL.

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/40/10/btae558/59716457/btae558.pdf

Chemical reaction enhanced graph learning for molecule representation

Bioinformatics, 2024, 40(10), btae558 https://doi.org/10.1093/bioinformatics/btae558 Advance Access Publication Date: 13 September 2024 Original Paper Data and text mining Chemical reaction enhanced graph learning for molecule representation 1,� , Elena Casiraghi 1,2,3,4 , Juho Rousu 1 1 Department of Computer Science, Aalto University, Espoo, 02150, Finland AnacletoLab, Dipartimento di Informatica "Giovanni degli Antoni", University of Milan, Milan, 20133, Italy 3 Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, United States 4 ELLIS, European Laboratory for Learning and Intelligent Systems, Milan Unit (University of Milan), Milan, 20133, Italy 2 �Corresponding author. Department of Computer Science, Aalto University, Espoo, 02150, Finland. E-mail: (A.L.) Associate Editor: Jonathan Wren Abstract Motivation: Molecular representation learning (MRL) models molecules with low-dimensional vectors to support biological and chemical applications. Current methods primarily rely on intrinsic molecular information to learn molecular representations, but they often overlook effectively integrating domain knowledge into MRL. Results: In this article, we develop a reaction-enhanced graph learning (RXGL) framework for MRL, utilizing chemical reactions as domain knowledge. RXGL introduces dual graph learning modules to model molecule representation. One module employs graph convolutions on molecular graphs to capture molecule structures. The other module constructs a reaction-aware graph from chemical reactions and designs a novel graph attention network on this graph to integrate reaction-level relations into molecular modeling. To refine molecule representations, we design a reaction-based relation learning task, which considers the relations between the reactant and product sides in reactions. In addition, we introduce a cross-view contrastive task to strengthen the cooperative associations between molecular and reaction-aware graph learning. Experiment results show that our RXGL achieves strong performance in various downstream tasks, including product prediction, reaction classification, and molecular property prediction. Availability and implementation: The code is publicly available at https://github.com/coder-ACAC/RLM. 1 Introduction Molecule representation learning (MRL) techniques are crucial for combining machine learning with biological and chemical sciences (Yi et al. 2022). MRL encodes molecules as lowdimensional vectors. These vectors retain molecule information, facilitating their use as features in downstream applications (e.g. product prediction, reaction classification, and molecular property prediction). A variety of MRL methods have been proposed, which roughly fall into the following two categories. One school is SMILES-based methods (Fabian et al. 2020), which utilize SMILES strings as input and employ natural lan guage models as their base architectures. However, they struggle with capturing molecule structures. The other school treats mol ecule topology as a graph, and models molecules with graph neural networks (GNNs) (Xu et al. 2021). Although GNNbased methods generally outperform SMILES-based ones, they typically focus on designing GNN architectures, neglecting the efficient integration of domain knowledge. Recent studies (Wang et al. 2022a) use chemical reactions as domain knowledge for MRL. Typically, reactions are repre sented by equations, with reactants on the left side and products on the right (cf Definition 2 in Section Preliminaries). These methods first learn molecule embeddings from molecular graphs and then optimize embeddings by equating the sum of reactant embeddings with the sum of product embeddings for each reaction. Despite effectiveness, we argue that they face at least one of the following issues. Firstly, these reaction-based methods treat molecules as iso lated data instances and rely solely on molecule structures for representation, which ignores the insights from molecule rela tions inherent in chemical reactions. For example, molecules in volved in the same reaction (as reactants/products) may exhibit greater similarities and correlations with each other than with molecules from different reactions. To illustrate this reactionrelated relation, we construct a reaction-aware graph (cf Definition 4 in Section Preliminaries) based on a reaction set, as shown in Fig. 1a. In this graph, nodes are molecules and edges denote molecule relations driven by reactions. For molecule A, its first-order neighbors (molecules F, G, and H) represent prod ucts that can be derived from A through reactions. Molecule A’s second-order neighbors (molecules B, C, and D) suggest a property/structure similarity with A, inferred from shared reac tion products. Moreover, molecule B is likely more similar to A than C or D, as evidenced by a greater overlap in the reaction products. These analyses inspire us to consider the potential benefits of incorporating molecule relations from the reactionaware graph into MRL. Secondly, these methods ignore the transformation relation learning between reactants and products. Their assumption that the summed embeddings of reactants and products should be Received: 31 May 2024; Revised: 28 August 2024; Editorial Decision: 4 September 2024; Accepted: 11 September 2024 © The Author(s) 2024. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Anchen Li 2 Li et al. equal (as shown in Fig. 1b1) essentially reduces all reactions to an identity transformation, which oversimplifies the complexity of chemical processes. In reality, reactions involve various changes, such as the number of bonds (e.g. breaking old bonds and forming new ones) and energy variations (e.g. endotherms and exotherms) before and after the reaction. The assumption in current studies fails to model these changes. Motivated by these gaps, we introduce a reactionenhanced graph learning framework (RXGL) for MRL. In the molecule modeling stage, we design dual graph learning modules. The first module utilizes graph convolutions on mo lecular graphs to capture the structural information of mole cules. The second module first involves a reaction-aware graph and then creates a GNN to extract reaction-level mo lecular relations for molecule feature learning. In the optimi zation stage, we introduce a reaction-based relation learning method that considers the relation between reactants and products in chemical reactions. Specifically, we employ a memory network (Miller et al. 2016) to learn a latent relation vector that connects reactant and product embeddings (as shown in Fig. 1b2). Through the delicate key and memory component (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/bioinformatics/article-pdf/40/10/btae558/59716457/btae558.pdf
Article home page: https://academic.oup.com/bioinformatics/article/40/10/btae558/7756735

Li, Anchen, Casiraghi, Elena, Rousu, Juho. Chemical reaction enhanced graph learning for molecule representation, Bioinformatics, 2024, Volume 40, Issue 10, DOI: 10.1093/bioinformatics/btae558