Flame: an open source framework for model development, hosting, and usage in production environments (pdf)

Article PDF cannot be displayed. You can download it here:

https://jcheminf.biomedcentral.com/track/pdf/10.1186/s13321-021-00509-z

Flame: an open source framework for model development, hosting, and usage in production environments

Journal of Cheminformatics (2021) 13:31 Pastor et al. J Cheminform https://doi.org/10.1186/s13321-021-00509-z Open Access SOFTWARE Flame: an open source framework for model development, hosting, and usage in production environments Manuel Pastor* , José Carlos Gómez‑Tamayo and Ferran Sanz Abstract This article describes Flame, an open source software for building predictive models and supporting their use in production environments. Flame is a web application with a web-based graphic interface, which can be used as a desktop application or installed in a server receiving requests from multiple users. Models can be built starting from any collection of biologically annotated chemical structures since the software supports structural normalization, molecular descriptor calculation, and machine learning model generation using predefined workflows. The model building workflow can be customized from the graphic interface, selecting the type of normalization, molecular descriptors, and machine learning algorithm to be used from a panel of state-of-the-art methods implemented natively. Moreover, Flame implements a mechanism allowing to extend its source code, adding unlimited model customization. Models generated with Flame can be easily exported, facilitating collaborative model development. All models are stored in a model repository supporting model versioning. Models are identified by unique model IDs and include detailed documentation formatted using widely accepted standards. The current version is the result of nearly 3 years of development in collaboration with users from the pharmaceutical industry within the IMI eTRANSAFE project, which aims, among other objectives, to develop high-quality predictive models based on shared legacy data for assessing the safety of drug candidates. Keywords: Modeling framework, Modeling tools, Reproducibility, Model management, Workflow, QSAR, Model integration, Web-interfaces, In-silico toxicology Introduction In the last years, biomedical data is becoming widely available, thanks to the creation of repositories like PubChem [1] and ChEMBL [2], databases resulting from public–private partnerships like eTOX [3, 4], as well as data policies like FAIR [5], which facilitate the access of existing data to the scientific community. An interesting way of exploiting this vast amount of data is the development of mathematical models *Correspondence: Research Programme on Biomedical Informatics (GRIB), Department of Experimental and Health Sciences, Hospital del Mar Medical Research Institute (IMIM), Universitat Pompeu Fabra, Barcelona, Spain connecting the chemical structure of the substances with their biological properties. Such models are not new. Quantitative Structure–Activity Relationships (QSAR) were first described in the 60 s [6, 7]. QSAR models use regression methods to identify the structural properties linked to quantitative biological properties or to predict these properties for new substances. For biological properties characterized using qualitative descriptions (e.g., positive or negative) conceptually similar approaches can be applied using classifiers. The first QSAR models were developed using small series of congeneric compounds, often synthesized and tested ad-hoc for the study. Nowadays, large series of structurally diverse compounds can be easily obtained from public repositories. © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco mmons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Pastor et al. J Cheminform (2021) 13:31 Page 2 of 15 Pharmaceutical companies can also extract these series from their own internal repositories and use them isolated or combined with compounds from external sources. This fact, combined with recent developments in machine learning (ML) and deep learning (DL) methodologies [8] as well as with the implementation of many of these methods in open source libraries [9], create an ideal scenario for the development of predictive models with biomedical application. Indeed, the use of ML and DL is becoming very popular in biomedical research. A few remarkable models developed recently have been listed in Table 1 as examples of applications of this methodology, illustrating their usefulness. More and more, the models obtained by the application of ML are seen as valuable business assets. Accurate and appropriately shared models can bring a number of benefits if we are able to make effective use of existing expertise [17]. However, the true capability of a model for solving real-world problems critically depends on aspects related to model implementation, as the following. Reproducibility Models must produce the same results when used at different sites or times. This simple, basic requirement is difficult to meet if (i) the training data is not available and distinctively identified or (ii) the algorithms used are not documented with enough detail, or if it is not possible to use exactly the same software (same version and same platform). The fast evolution of computational tools (both hardware and software) makes it difficult to preserve a model for some time. This topic has been discussed by various authors, proposing diverse solutions for mitigating this problem like the use of appropriate standards for QSAR data interchanges [18] or a workflow for implementing published QSAR models and recommendations to modelers [19]. Accessibility Models are digital assets to which the FAIR accessibility principle can also be applied [5, 20]. Ideally, access to existing models should be facilitated, particularly for models developed in academic environments. In practice, there are barriers related to the intellectual property of the tools required to generate the predictions. This can apply to commercial applications used to generate 3D structures or molecular descriptors or even the modeling software itself. For this reason, the use of open source alternatives should be prioritized. Not all accessibility barriers are related to intellectual p (...truncated)