A visual approach for analysis and inference of molecular activity spaces
(2019) 11:63
Kausar and Falcao J Cheminform
https://doi.org/10.1186/s13321-019-0386-z
RESEARCH ARTICLE
Journal of Cheminformatics
Open Access
A visual approach for analysis and inference
of molecular activity spaces
Samina Kausar1,2 and Andre O. Falcao1,2*
Abstract
Background: Molecular space visualization can help to explore the diversity of large heterogeneous chemical data,
which ultimately may increase the understanding of structure-activity relationships (SAR) in drug discovery projects.
Visual SAR analysis can therefore be useful for library design, chemical classification for their biological evaluation and
virtual screening for the selection of compounds for synthesis or in vitro testing. As such, computational approaches
for molecular space visualization have become an important issue in cheminformatics research. The proposed
approach uses molecular similarity as the sole input for computing a probabilistic surface of molecular activity
(PSMA). This similarity matrix is transformed in 2D using different dimension reduction algorithms (Principal Coordinates Analysis ( PCooA), Kruskal multidimensional scaling, Sammon mapping and t-SNE). From this projection, a kernel
density function is applied to compute the probability of activity for each coordinate in the new projected space.
Results: This methodology was tested over four different quantitative structure-activity relationship (QSAR) binary
classification data sets and the PSMAs were computed for each. The generated maps showed internal consistency
with active molecules grouped together for all data sets and all dimensionality reduction algorithms. To validate the
quality of the generated maps, the 2D coordinates of test molecules were computed into the new reference space
using a data transformation matrix. In total sixteen PSMAs were built, and their performance was assessed using the
Area Under Curve (AUC) and the Matthews Coefficient Correlation (MCC). For the best projections for each data set,
AUC testing results ranged from 0.87 to 0.98 and the MCC scores ranged from 0.33 to 0.77, suggesting this methodology can validly capture the complexities of the molecular activity space. All four mapping functions provided generally good results yet the overall performance of PCooA and t-SNE was slightly better than Sammon mapping and
Kruskal multidimensional scaling.
Conclusions: Our result showed that by using an appropriate combination of metric space representation and
dimensionality reduction applied over metric spaces it is possible to produce a visual PSMA for which its consistency
has been validated by using this map as a classification model. The produced maps can be used as prediction tools as
it is simple to project any molecule into this new reference space as long as the similarities to the molecules used to
compute the initial similarity matrix can be computed.
Keywords: Structure activity relationship (SAR), Molecular/chemical space, Two dimensional kernel density
estimation, Noncontiguous atom matching structural similarity function (NAMS), t-SNE, PCooA, Non-metric MDS,
Sammon mapping
*Correspondence:
1
LaSIGE, Departamento de Informática, Faculdade de Ciências,
Universidade de Lisboa, 1749‑016 Lisboa, Portugal
Full list of author information is available at the end of the article
© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/
publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Kausar and Falcao J Cheminform
(2019) 11:63
Introduction
Chemical/molecular space reflects high dimensional
conceptual spaces that describe the structural diversity
of all possible potential pharmacologically active molecules. The size of molecular space is not well defined,
yet a fraction of it ranging from thousands to millions of
compounds is stored in small molecule databases. Consequently, a part of the huge molecular space is mainly
focused to explore the complexity of a relevant small set
of chemical structures in many different problems during
drug design [1–3]. Nonetheless, molecular space interactive analysis and visualization can serve as a strong tool
to explore the diversity of millions of compounds stored
in public databases and can increase the performance of
drug discovery process. For example, nearest neighbour
searches in various defined property regions in molecular
space (activity space map) can identify interesting similar
molecules (potent analogues) with similar properties [1,
2, 4, 5].
Molecular space visualization methods require that
molecules are projected into a reduced set of dimensions
(most of the times, two or three) in such a way that the
relative distances between molecules are better preserved
in this new projected space. As distances should be preserved, molecules with similar activity profiles should
appear clustered together [1, 6]. Thus, molecular space
visual analysis combines the concept of molecular structure and activity similarity [6, 7]. Since molecular dis/
similarity is defined through pairwise distances between
projected molecules in reference space, an appropriate
choice of a molecular metric space (spatial) representation is crucial for reliable application of molecular spacial
analysis. A molecule in metric space is defined as a set
of distances computed from the similarity between that
molecule to all the other molecules in a given chemical
data set. For this purpose, many methods are available in
literature to compute dis/similarity. A variety of methods uses either molecular descriptors or fingerprints,
which represent different physico-chemical or structural characteristics [8–16]. These approaches entail that
each molecule is initially reduced into a vector space by
computing a set of attributes, that can be used to infer
distance, yet this is not always required as other independent approaches like molecular graph matching
approaches can also be used for a direct assessment of
structural similarity [17–20].
In metric space representation, a set of M molecules is
represented in M dimensions, as the distance to all the
other elements of the set (including itself ) must be present. As such, the visualization of this M-dimensional
metric space in reduced spatial dimensionality is a challenge in data diversity analysis [7, 21, 22]. To address this
issue many linear and non-linear approaches have been
Page 2 of 17
developed to reduce the dimensionality and complexity
of molecular space [1, 6, 21, 23]. In all dimension reduction (DR) methods, the most important characteris (...truncated)