A visual approach for analysis and inference of molecular activity spaces

Journal of Cheminformatics, Dec 2019

Molecular space visualization can help to explore the diversity of large heterogeneous chemical data, which ultimately may increase the understanding of structure-activity relationships (SAR) in drug discovery projects. Visual SAR analysis can therefore be useful for library design, chemical classification for their biological evaluation and virtual screening for the selection of compounds for synthesis or in vitro testing. As such, computational approaches for molecular space visualization have become an important issue in cheminformatics research. The proposed approach uses molecular similarity as the sole input for computing a probabilistic surface of molecular activity (PSMA). This similarity matrix is transformed in 2D using different dimension reduction algorithms (Principal Coordinates Analysis ( PCooA), Kruskal multidimensional scaling, Sammon mapping and t-SNE). From this projection, a kernel density function is applied to compute the probability of activity for each coordinate in the new projected space. This methodology was tested over four different quantitative structure-activity relationship (QSAR) binary classification data sets and the PSMAs were computed for each. The generated maps showed internal consistency with active molecules grouped together for all data sets and all dimensionality reduction algorithms. To validate the quality of the generated maps, the 2D coordinates of test molecules were computed into the new reference space using a data transformation matrix. In total sixteen PSMAs were built, and their performance was assessed using the Area Under Curve (AUC) and the Matthews Coefficient Correlation (MCC). For the best projections for each data set, AUC testing results ranged from 0.87 to 0.98 and the MCC scores ranged from 0.33 to 0.77, suggesting this methodology can validly capture the complexities of the molecular activity space. All four mapping functions provided generally good results yet the overall performance of PCooA and t-SNE was slightly better than Sammon mapping and Kruskal multidimensional scaling. Our result showed that by using an appropriate combination of metric space representation and dimensionality reduction applied over metric spaces it is possible to produce a visual PSMA for which its consistency has been validated by using this map as a classification model. The produced maps can be used as prediction tools as it is simple to project any molecule into this new reference space as long as the similarities to the molecules used to compute the initial similarity matrix can be computed.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1186%2Fs13321-019-0386-z.pdf

A visual approach for analysis and inference of molecular activity spaces

(2019) 11:63 Kausar and Falcao J Cheminform https://doi.org/10.1186/s13321-019-0386-z RESEARCH ARTICLE Journal of Cheminformatics Open Access A visual approach for analysis and inference of molecular activity spaces Samina Kausar1,2 and Andre O. Falcao1,2* Abstract Background: Molecular space visualization can help to explore the diversity of large heterogeneous chemical data, which ultimately may increase the understanding of structure-activity relationships (SAR) in drug discovery projects. Visual SAR analysis can therefore be useful for library design, chemical classification for their biological evaluation and virtual screening for the selection of compounds for synthesis or in vitro testing. As such, computational approaches for molecular space visualization have become an important issue in cheminformatics research. The proposed approach uses molecular similarity as the sole input for computing a probabilistic surface of molecular activity (PSMA). This similarity matrix is transformed in 2D using different dimension reduction algorithms (Principal Coordinates Analysis ( PCooA), Kruskal multidimensional scaling, Sammon mapping and t-SNE). From this projection, a kernel density function is applied to compute the probability of activity for each coordinate in the new projected space. Results: This methodology was tested over four different quantitative structure-activity relationship (QSAR) binary classification data sets and the PSMAs were computed for each. The generated maps showed internal consistency with active molecules grouped together for all data sets and all dimensionality reduction algorithms. To validate the quality of the generated maps, the 2D coordinates of test molecules were computed into the new reference space using a data transformation matrix. In total sixteen PSMAs were built, and their performance was assessed using the Area Under Curve (AUC) and the Matthews Coefficient Correlation (MCC). For the best projections for each data set, AUC testing results ranged from 0.87 to 0.98 and the MCC scores ranged from 0.33 to 0.77, suggesting this methodology can validly capture the complexities of the molecular activity space. All four mapping functions provided generally good results yet the overall performance of PCooA and t-SNE was slightly better than Sammon mapping and Kruskal multidimensional scaling. Conclusions: Our result showed that by using an appropriate combination of metric space representation and dimensionality reduction applied over metric spaces it is possible to produce a visual PSMA for which its consistency has been validated by using this map as a classification model. The produced maps can be used as prediction tools as it is simple to project any molecule into this new reference space as long as the similarities to the molecules used to compute the initial similarity matrix can be computed. Keywords: Structure activity relationship (SAR), Molecular/chemical space, Two dimensional kernel density estimation, Noncontiguous atom matching structural similarity function (NAMS), t-SNE, PCooA, Non-metric MDS, Sammon mapping *Correspondence: 1 LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749‑016 Lisboa, Portugal Full list of author information is available at the end of the article © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Kausar and Falcao J Cheminform (2019) 11:63 Introduction Chemical/molecular space reflects high dimensional conceptual spaces that describe the structural diversity of all possible potential pharmacologically active molecules. The size of molecular space is not well defined, yet a fraction of it ranging from thousands to millions of compounds is stored in small molecule databases. Consequently, a part of the huge molecular space is mainly focused to explore the complexity of a relevant small set of chemical structures in many different problems during drug design [1–3]. Nonetheless, molecular space interactive analysis and visualization can serve as a strong tool to explore the diversity of millions of compounds stored in public databases and can increase the performance of drug discovery process. For example, nearest neighbour searches in various defined property regions in molecular space (activity space map) can identify interesting similar molecules (potent analogues) with similar properties [1, 2, 4, 5]. Molecular space visualization methods require that molecules are projected into a reduced set of dimensions (most of the times, two or three) in such a way that the relative distances between molecules are better preserved in this new projected space. As distances should be preserved, molecules with similar activity profiles should appear clustered together [1, 6]. Thus, molecular space visual analysis combines the concept of molecular structure and activity similarity [6, 7]. Since molecular dis/ similarity is defined through pairwise distances between projected molecules in reference space, an appropriate choice of a molecular metric space (spatial) representation is crucial for reliable application of molecular spacial analysis. A molecule in metric space is defined as a set of distances computed from the similarity between that molecule to all the other molecules in a given chemical data set. For this purpose, many methods are available in literature to compute dis/similarity. A variety of methods uses either molecular descriptors or fingerprints, which represent different physico-chemical or structural characteristics [8–16]. These approaches entail that each molecule is initially reduced into a vector space by computing a set of attributes, that can be used to infer distance, yet this is not always required as other independent approaches like molecular graph matching approaches can also be used for a direct assessment of structural similarity [17–20]. In metric space representation, a set of M molecules is represented in M dimensions, as the distance to all the other elements of the set (including itself ) must be present. As such, the visualization of this M-dimensional metric space in reduced spatial dimensionality is a challenge in data diversity analysis [7, 21, 22]. To address this issue many linear and non-linear approaches have been Page 2 of 17 developed to reduce the dimensionality and complexity of molecular space [1, 6, 21, 23]. In all dimension reduction (DR) methods, the most important characteris (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs13321-019-0386-z.pdf
Article home page: https://link.springer.com/article/10.1186/s13321-019-0386-z

Samina Kausar, Andre O. Falcao. A visual approach for analysis and inference of molecular activity spaces, Journal of Cheminformatics, 2019, DOI: 10.1186/s13321-019-0386-z