Multi-PLI: interpretable multi‐task deep learning model for unifying protein–ligand interaction datasets
(2021) 13:30
Hu et al. J Cheminform
https://doi.org/10.1186/s13321-021-00510-6
Journal of Cheminformatics
Open Access
RESEARCH ARTICLE
Multi‑PLI: interpretable multi‐task deep
learning model for unifying protein–ligand
interaction datasets
Fan Hu† , Jiaxin Jiang†, Dongqi Wang, Muchun Zhu and Peng Yin*
Abstract
The assessment of protein–ligand interactions is critical at early stage of drug discovery. Computational approaches
for efficiently predicting such interactions facilitate drug development. Recently, methods based on deep learning, including structure- and sequence-based models, have achieved impressive performance on several different
datasets. However, their application still suffers from a generalizability issue because of insufficient data, especially
for structure based models, as well as a heterogeneity problem because of different label measurements and varying
proteins across datasets. Here, we present an interpretable multi-task model to evaluate protein–ligand interaction
(Multi-PLI). The model can run classification (binding or not) and regression (binding affinity) tasks concurrently by
unifying different datasets. The model outperforms traditional docking and machine learning on both binary classification and regression tasks and achieves competitive results compared with some structure-based deep learning
methods, even with the same training set size. Furthermore, combined with the proposed occlusion algorithm, the
model can predict the important amino acids of proteins that are crucial for binding, thus providing a biological
interpretation.
Keywords: Interpretable, Deep learning, Multi‐task, Drug discovery
Introduction
The development and approval of a new drug takes more
than 10 years and costs almost 2 billion dollars. Identification of the interactions between proteins and ligands
are critical at early stage of the drug discovery process.
Computational methods for identifying possible ligands
to target proteins at the initial phase of drug discovery
indeed reduce the cost and improve the success rates of
new drug development [1, 2]. However, traditional methods have limitations, for example, the dependence on
expert knowledge may lead to low efficiency in screening
and the limited results. Specifically, these conventional
*Correspondence:
†
Fan Hu and Jiaxin Jiang contributed equally to this work
Guangdong‑Hong Kong‑Macao Joint Laboratory of Human‑Machine
Intelligence‑Synergy Systems, Shenzhen Institutes of Advanced
Technology, Chinese Academy of Sciences, Shenzhen 518055, China
structure-based methods need to first simulate the
binding poses of proteins and ligands and then calculate their binding energies, which tends to be restricting the computational efficiency and accuracy. In recent
years, researchers in this field have paid more attention
on machine learning based methods [3, 4]. However, the
fundamental limitation of models such as support vector
machine is that they still rely on expert knowledge-based
manual feature engineering.
Recently, deep learning, which refers to an algorithm
for numerous layers of nonlinear transformations, has
achieved great success in many fields [5–7]. One main
advantage of is that deep learning algorithm learns and
extracts information from raw data without manual
feature extraction. Inspired by the remarkable success,
many researchers have applied deep learning into the
field of drug discovery [8–14]. Wallach et al. proposed
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco
mmons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/
zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Hu et al. J Cheminform
(2021) 13:30
a method which based on convolutional neural network
(CNN), an algorithm of deep learning, to divide active
and inactive compounds for a given protein [9]. In their
study, their model outperformed other traditional methods on the Directory of Useful Decoys Enhanced (DUDE) benchmark. In another study, Ragoza et al. described
a CNN-based scoring function using a comprehensive
3-dimensional (3D) representation of a protein-ligand
complex as input. They showed a better performance
on virtual screening and pose prediction than the classical docking method AutoDock Vina [10]. Similarly,
Stepniewska-Dziubinska et al. introduced a model taking
a 3D grid representation structure as input and processing it by CNN. Rather than simply identifying whether
the ligand can bind to the target, their model can accurately predict the binding affinity of the protein–ligand
complex [11]. It should be noted that methods taking the
3D structure of a protein-compound complex as input,
which is similar to traditional docking, may also be disadvantaged by the lack of data, especially for targets
without structural information. Therefore, several studies have introduced methods that use only 1D sequences
as input. Wan et al. applied the “word embedding” algorithm, which is widely used in natural language processing, to process raw protein and compound data into two
separate compressed vectors [15]. Then, the two embedding vectors were fed into a deep neural network to
predict the binding possibility. Similarly, to predict the
binding value, Öztürk et al. proposed a model known as
DeepDTA that applies convolution operations to protein
and drug sequences separately, and their model obtained
better results than other methods on kinase datasets
[12]. Considering the model interpretability, Lee et al.
performed convolution on various lengths of amino acid
subsequences to capture local residue patterns [14]. They
pooled the maximum convolution results from each filter
to highlight important regions for prediction, and thus
provided a partial explanation of their model. However,
the robustness and applicability of a model are limited
if the model is restricted to only one identical dataset or
single task, namely, either classification or regression.
Inspired by previous studies, here we present an interpretable multi-task model to evaluate protein-ligand
interactions. Using sequence data, the model can run
classifica (...truncated)