Multi-PLI: interpretable multi‐task deep learning model for unifying protein–ligand interaction datasets (pdf)

Article PDF cannot be displayed. You can download it here:

https://jcheminf.biomedcentral.com/track/pdf/10.1186/s13321-021-00510-6

Multi-PLI: interpretable multi‐task deep learning model for unifying protein–ligand interaction datasets

(2021) 13:30 Hu et al. J Cheminform https://doi.org/10.1186/s13321-021-00510-6 Journal of Cheminformatics Open Access RESEARCH ARTICLE Multi‑PLI: interpretable multi‐task deep learning model for unifying protein–ligand interaction datasets Fan Hu† , Jiaxin Jiang†, Dongqi Wang, Muchun Zhu and Peng Yin* Abstract The assessment of protein–ligand interactions is critical at early stage of drug discovery. Computational approaches for efficiently predicting such interactions facilitate drug development. Recently, methods based on deep learning, including structure- and sequence-based models, have achieved impressive performance on several different datasets. However, their application still suffers from a generalizability issue because of insufficient data, especially for structure based models, as well as a heterogeneity problem because of different label measurements and varying proteins across datasets. Here, we present an interpretable multi-task model to evaluate protein–ligand interaction (Multi-PLI). The model can run classification (binding or not) and regression (binding affinity) tasks concurrently by unifying different datasets. The model outperforms traditional docking and machine learning on both binary classification and regression tasks and achieves competitive results compared with some structure-based deep learning methods, even with the same training set size. Furthermore, combined with the proposed occlusion algorithm, the model can predict the important amino acids of proteins that are crucial for binding, thus providing a biological interpretation. Keywords: Interpretable, Deep learning, Multi‐task, Drug discovery Introduction The development and approval of a new drug takes more than 10 years and costs almost 2 billion dollars. Identification of the interactions between proteins and ligands are critical at early stage of the drug discovery process. Computational methods for identifying possible ligands to target proteins at the initial phase of drug discovery indeed reduce the cost and improve the success rates of new drug development [1, 2]. However, traditional methods have limitations, for example, the dependence on expert knowledge may lead to low efficiency in screening and the limited results. Specifically, these conventional *Correspondence: † Fan Hu and Jiaxin Jiang contributed equally to this work Guangdong‑Hong Kong‑Macao Joint Laboratory of Human‑Machine Intelligence‑Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China structure-based methods need to first simulate the binding poses of proteins and ligands and then calculate their binding energies, which tends to be restricting the computational efficiency and accuracy. In recent years, researchers in this field have paid more attention on machine learning based methods [3, 4]. However, the fundamental limitation of models such as support vector machine is that they still rely on expert knowledge-based manual feature engineering. Recently, deep learning, which refers to an algorithm for numerous layers of nonlinear transformations, has achieved great success in many fields [5–7]. One main advantage of is that deep learning algorithm learns and extracts information from raw data without manual feature extraction. Inspired by the remarkable success, many researchers have applied deep learning into the field of drug discovery [8–14]. Wallach et al. proposed © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco mmons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Hu et al. J Cheminform (2021) 13:30 a method which based on convolutional neural network (CNN), an algorithm of deep learning, to divide active and inactive compounds for a given protein [9]. In their study, their model outperformed other traditional methods on the Directory of Useful Decoys Enhanced (DUDE) benchmark. In another study, Ragoza et al. described a CNN-based scoring function using a comprehensive 3-dimensional (3D) representation of a protein-ligand complex as input. They showed a better performance on virtual screening and pose prediction than the classical docking method AutoDock Vina [10]. Similarly, Stepniewska-Dziubinska et al. introduced a model taking a 3D grid representation structure as input and processing it by CNN. Rather than simply identifying whether the ligand can bind to the target, their model can accurately predict the binding affinity of the protein–ligand complex [11]. It should be noted that methods taking the 3D structure of a protein-compound complex as input, which is similar to traditional docking, may also be disadvantaged by the lack of data, especially for targets without structural information. Therefore, several studies have introduced methods that use only 1D sequences as input. Wan et al. applied the “word embedding” algorithm, which is widely used in natural language processing, to process raw protein and compound data into two separate compressed vectors [15]. Then, the two embedding vectors were fed into a deep neural network to predict the binding possibility. Similarly, to predict the binding value, Öztürk et al. proposed a model known as DeepDTA that applies convolution operations to protein and drug sequences separately, and their model obtained better results than other methods on kinase datasets [12]. Considering the model interpretability, Lee et al. performed convolution on various lengths of amino acid subsequences to capture local residue patterns [14]. They pooled the maximum convolution results from each filter to highlight important regions for prediction, and thus provided a partial explanation of their model. However, the robustness and applicability of a model are limited if the model is restricted to only one identical dataset or single task, namely, either classification or regression. Inspired by previous studies, here we present an interpretable multi-task model to evaluate protein-ligand interactions. Using sequence data, the model can run classifica (...truncated)