Flame: an open source framework for model development, hosting, and usage in production environments
Journal of Cheminformatics
(2021) 13:31
Pastor et al. J Cheminform
https://doi.org/10.1186/s13321-021-00509-z
Open Access
SOFTWARE
Flame: an open source framework for model
development, hosting, and usage in production
environments
Manuel Pastor* , José Carlos Gómez‑Tamayo and Ferran Sanz
Abstract
This article describes Flame, an open source software for building predictive models and supporting their use in
production environments. Flame is a web application with a web-based graphic interface, which can be used as a
desktop application or installed in a server receiving requests from multiple users. Models can be built starting from
any collection of biologically annotated chemical structures since the software supports structural normalization,
molecular descriptor calculation, and machine learning model generation using predefined workflows. The model
building workflow can be customized from the graphic interface, selecting the type of normalization, molecular
descriptors, and machine learning algorithm to be used from a panel of state-of-the-art methods implemented
natively. Moreover, Flame implements a mechanism allowing to extend its source code, adding unlimited model
customization. Models generated with Flame can be easily exported, facilitating collaborative model development.
All models are stored in a model repository supporting model versioning. Models are identified by unique model IDs
and include detailed documentation formatted using widely accepted standards. The current version is the result of
nearly 3 years of development in collaboration with users from the pharmaceutical industry within the IMI eTRANSAFE
project, which aims, among other objectives, to develop high-quality predictive models based on shared legacy data
for assessing the safety of drug candidates.
Keywords: Modeling framework, Modeling tools, Reproducibility, Model management, Workflow, QSAR, Model
integration, Web-interfaces, In-silico toxicology
Introduction
In the last years, biomedical data is becoming widely
available, thanks to the creation of repositories like
PubChem [1] and ChEMBL [2], databases resulting from
public–private partnerships like eTOX [3, 4], as well as
data policies like FAIR [5], which facilitate the access of
existing data to the scientific community.
An interesting way of exploiting this vast amount
of data is the development of mathematical models
*Correspondence:
Research Programme on Biomedical Informatics (GRIB), Department
of Experimental and Health Sciences, Hospital del Mar Medical Research
Institute (IMIM), Universitat Pompeu Fabra, Barcelona, Spain
connecting the chemical structure of the substances with
their biological properties. Such models are not new.
Quantitative Structure–Activity Relationships (QSAR)
were first described in the 60 s [6, 7]. QSAR models use
regression methods to identify the structural properties
linked to quantitative biological properties or to predict
these properties for new substances. For biological properties characterized using qualitative descriptions (e.g.,
positive or negative) conceptually similar approaches
can be applied using classifiers. The first QSAR models
were developed using small series of congeneric compounds, often synthesized and tested ad-hoc for the
study. Nowadays, large series of structurally diverse compounds can be easily obtained from public repositories.
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco
mmons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/
zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Pastor et al. J Cheminform
(2021) 13:31
Page 2 of 15
Pharmaceutical companies can also extract these series
from their own internal repositories and use them isolated or combined with compounds from external
sources. This fact, combined with recent developments in
machine learning (ML) and deep learning (DL) methodologies [8] as well as with the implementation of many of
these methods in open source libraries [9], create an ideal
scenario for the development of predictive models with
biomedical application.
Indeed, the use of ML and DL is becoming very popular in biomedical research. A few remarkable models
developed recently have been listed in Table 1 as examples of applications of this methodology, illustrating their
usefulness.
More and more, the models obtained by the application of ML are seen as valuable business assets. Accurate
and appropriately shared models can bring a number of
benefits if we are able to make effective use of existing
expertise [17]. However, the true capability of a model for
solving real-world problems critically depends on aspects
related to model implementation, as the following.
Reproducibility
Models must produce the same results when used at different sites or times. This simple, basic requirement is
difficult to meet if (i) the training data is not available
and distinctively identified or (ii) the algorithms used are
not documented with enough detail, or if it is not possible to use exactly the same software (same version and
same platform). The fast evolution of computational
tools (both hardware and software) makes it difficult to
preserve a model for some time. This topic has been discussed by various authors, proposing diverse solutions
for mitigating this problem like the use of appropriate
standards for QSAR data interchanges [18] or a workflow
for implementing published QSAR models and recommendations to modelers [19].
Accessibility
Models are digital assets to which the FAIR accessibility principle can also be applied [5, 20]. Ideally, access
to existing models should be facilitated, particularly for
models developed in academic environments. In practice,
there are barriers related to the intellectual property of
the tools required to generate the predictions. This can
apply to commercial applications used to generate 3D
structures or molecular descriptors or even the modeling
software itself. For this reason, the use of open source
alternatives should be prioritized.
Not all accessibility barriers are related to intellectual
p (...truncated)