An automated framework for QSAR model building
Kausar and Falcao J Cheminform (2018) 10:1
https://doi.org/10.1186/s13321-017-0256-5
RESEARCH ARTICLE
Open Access
An automated framework for QSAR
model building
Samina Kausar1,2
and Andre O. Falcao1,2*
Abstract
Background: In-silico quantitative structure–activity relationship (QSAR) models based tools are widely used to
screen huge databases of compounds in order to determine the biological properties of chemical molecules based
on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known
chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may
lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform
can be an important addition to the QSAR community.
Results: In the presented workflow the process from data preparation to model building and validation has been
completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable
selection and validation) that largely influence the performance of QSAR models were focused. It is also included the
ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data
sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able
to remove 62–99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature
selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without
feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A
strong correlation was verified between the modelability scores and the PVE of the models produced with variable
selection.
Conclusions: We developed an extendable and highly customizable fully automated QSAR modeling framework.
This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise
in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using
private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid
time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions.
The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable
models even for challenging problems.
Keywords: Quantitative structure–activity relationship (QSAR), Machine learning, Feature selection, Variable
importance, Random forests, Support vector machines, KNIME, Data set modelability
*Correspondence:
1
LaSIGE, Departamento de Informática, Faculdade de Ciências,
Universidade de Lisboa, 1749‑016 Lisbon, Portugal
Full list of author information is available at the end of the article
© The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/
publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Kausar and Falcao J Cheminform (2018) 10:1
Introduction
Background
The advantages of automation of repetitive tasks in the
laborious drug discovery process are numerous and
include increased research quality by reducing error
along with significant time saving, boosted up productivity, and capacity to name a few. In this era where large
amounts of data are produced every day and large computational resources are available, the introduction of
machine learning approaches has significantly automated
the drug discovery procedure and provides a faster alternative for ultrahigh-throughput screening of large databases of chemical molecules against a biological target
[1–3].
Machine learning approaches are being applied in
the drug discovery cycle to produce a robust model,
capable of empirical predictions of biological properties of candidate compounds for new therapeutic molecules. Many successful studies have been reported in
the literature which attests the importance of machine
learning approaches combined with traditional practices to approach medicinal chemistry challenges [4].
In traditional lab work methodologies, many expensive tests are often required which many times include
animal testing to provide information about human
safety for suggested chemicals. The legislation does
not support such frequent experiments on laboratory
animals, but rather promotes the sharing of data to
the use of integrated alternative in-vitro and in-silico
strategies of toxicokinetics [5–7]. Currently the Avicenna Research and Technological Roadmap, funded
by the European Commission, strongly suggests the
use of in-silico techniques coupled with clinical trials
[8]. This framework describes strategic priorities to
establish the safety assessment of new medical interventions and at the same time minimizes the ethically concerned activities such as the animal or human
experimentation.
Several available in-silico QSAR models based tools
are widely used to screen very large databases of compounds in order to determine toxicity or any desired
biological effects of chemical molecules based on their
chemical structure [9, 10]. The well-characterized internationally accepted validation principles for creating validated models have been used by regulatory agencies of
United Sates (US) and gaining a boost in the European
Union (EU) too [8, 11–13]. In the EU, the standard recommendations of chemicals risk assessment by regulatory QSAR models has been set by the Registration,
Evaluation, Authorization, and Restriction of Chemicals
(REACH) [14] and the Organization for Economic Cooperation and Development (OECD) [15]. The progress
of such projects highlights the increased importance of
Page 2 of 23
productivity gains from fully accessible automation in the
drug discovery and QSAR modeling fields.
These days, the aim of pharmaceutical projects is the
integration of complex non-homogeneous data to build
global models intended to be applicable within wide
ranges of chemical space. However, with the passage of
time, there is an exponentially growing amount of synthesized and known chemical compounds data being
added to the many existing molecule data (...truncated)