Towards reproducible computational drug discovery
(2020) 12:9
Schaduangrat et al. J Cheminform
https://doi.org/10.1186/s13321-020-0408-x
Journal of Cheminformatics
Open Access
REVIEW
Towards reproducible computational drug
discovery
Nalini Schaduangrat1†, Samuel Lampa2†, Saw Simeon3†, Matthew Paul Gleeson4*, Ola Spjuth2*
and Chanin Nantasenamat1*
Abstract
The reproducibility of experiments has been a long standing impediment for further scientific progress. Computational methods have been instrumental in drug discovery efforts owing to its multifaceted utilization for data collection, pre-processing, analysis and inference. This article provides an in-depth coverage on the reproducibility of
computational drug discovery. This review explores the following topics: (1) the current state-of-the-art on reproducible research, (2) research documentation (e.g. electronic laboratory notebook, Jupyter notebook, etc.), (3) science of
reproducible research (i.e. comparison and contrast with related concepts as replicability, reusability and reliability), (4)
model development in computational drug discovery, (5) computational issues on model development and deployment, (6) use case scenarios for streamlining the computational drug discovery protocol. In computational disciplines,
it has become common practice to share data and programming codes used for numerical calculations as to not only
facilitate reproducibility, but also to foster collaborations (i.e. to drive the project further by introducing new ideas,
growing the data, augmenting the code, etc.). It is therefore inevitable that the field of computational drug design
would adopt an open approach towards the collection, curation and sharing of data/code.
Keywords: Reproducibility, Reproducible research, Drug discovery, Drug design, Open science, Open data, Data
sharing, Data science, Bioinformatics, Cheminformatics
Introduction
Traditional drug discovery and development is well
known to be time consuming and cost-intensive encompassing an average of 10 to 15 years until it is ready to
reach the market with an estimated cost of 58.8 billion
USD as of 2015 [1]. These numbers are a dramatic 10%
increase from previous years for both biotechnology
*Correspondence: ; ; chanin.
†
Nalini Schaduangrat, Samuel Lampa and Saw Simeon contributed
equally to this work
1
Center of Data Mining and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, 10700 Bangkok, Thailand
2
Department of Pharmaceutical Biosciences, Uppsala University, 751
24 Uppsala, Sweden
4
Department of Biomedical Engineering, Faculty of Engineering, King
Mongkut’s Institute of Technology Ladkrabang, 10520 Bangkok, Thailand
Full list of author information is available at the end of the article
and pharmaceutical companies. Of the library of 10,000
screened chemical compounds, only 250 or so will move
on to further clinical testings. In addition, those that are
tested in humans typically do not exceed more than 10
compounds [2]. Furthermore, from a study conducted
during 1995 to 2007 by the Tufts Center for the Study
of Drug Development revealed that out of all the drugs
that make it to Phase I of clinical trials, only 11.83% were
eventually approved for market [3]. In addition, during
2006 to 2015, the success rate of those drugs undergoing
clinical trials was only 9.6% [4]. The exacerbated cost and
high failure rate of this traditional path of drug discovery
and development has prompted the need for the use of
computer-aided drug discovery (CADD) which encompasses ligand-based, structure-based and systems-based
drug design (Fig. 1). Moreover, the major side effects of
drugs resulting in severe toxicity evokes the screening of
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco
mmons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/
zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Schaduangrat et al. J Cheminform
(2020) 12:9
Page 2 of 30
Hit
Target
Discovery
Identify
disease
modulating
target protein
Screen for Hit
compounds to
inhibit target
protein
Lead
Optimization
Pre-clinical
trials
Hit-to-Lead
conversion
and lead
optimization
Evaluate
pharmacokinetic
properties
Ligand-based
Structure-based
Clinical
trials
Evaluate
safety, dosage,
e cacy and
adverse e ects
QSAR modeling
Computational chemistry
Chemical space
Cheminformatics
Systems-based
Molecular modeling
Protein structure prediction
Molecular docking
Molecular dynamics
Network pharmacology
Proteochemometric modeling
Pathway analysis
Fig. 1 Schematic summary of the drug discovery process overlayed with corresponding computational approaches
ADMET (adsorption, distribution, metabolism, excretion and toxicity) properties at the early stage of drug
development in order to increase the success rate as well
as reduce time in screening candidates [5]. The process
of CADD begins with the identification of target or hit
compound using wet-lab experiments and subsequently
via high-throughput screening (HTS). In particular, the
typical role of CADD is to screen a library of compounds
against the target of interest thereby narrowing the candidates to a few smaller clusters [6]. However, owing to
the high requirement of resources for CADD coupled
with its extensive costs, opens the door for virtual screening methods such as molecular docking where the known
target of interest is screened against a virtual library of
compounds. Although this method is highly effective,
a crystal structure of the target of interest remains the
main criteria required of this approach in generating
an in silico binding model. However, in the absence of a
crystal structure, homology modeling or de novo prediction models can still be obtained against the large library
of compounds to acquire compounds with good binding affinity to the target [7] which are identified as hits
and could be further developed as lead compounds [8]. A
conceptual map on the experimental and computational
methodologies as applied to the drug discovery process is
summarized in Fig. 2.
In recent years, the expansion of data repositories
including those with chemical and phar (...truncated)