Comment on “The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability”
J Cheminform
Comment on “ The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability”
M. Šícho 0
M. Voršilák 0
D. Svozil 0
0 CZ‐OPENSCREEN:National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague , Prague , Czech Republic
Recently, a new metric for virtual screening applications was reported by Lopes et al. [1]. This metric is called the power metric (PM) as it is based on the principles of the statistical power of a hypothesis test. In this comment, we add to the original article and discuss the similarity of PM to precision (Pre) and draw new conclusions from their functional relationship. PM is defined as:
-
PM =
TPR
TPR + FPR
and can be reformulated as follows:
PM =
=
TPR
=
TP
TP+FN
TPR + FPR TPT+PFN + FPF+PTN
N · TP N N·TP TP
N · TP + P · FP = N ·TPN+P·FP = TP + NP FP
TP
P
= TPP + FNP
In this formula, P is a total number of positive and N
a total number of negative examples in a data set.
Similarly, Pre is defined as:
TP TPR
Pre =
TP + FP
=
TPR + NP FPR
From the comparisonPof Eqs. 2 and 3 follows that PM
differs from Pre by the N term which precedes the
number of false positives FP in PM. Thus, the influence of FP
(1)
(2)
(3)
in PM is decreased in imbalanced data sets with a high
number of negative examples andP the magnitude of
this effect directly depends on the N ratio. Due to this
dependency, PM has the ability to cancel out the
influence of negative examples and is, in this regard, more
robust than Pre.
Pre and PM are, however, not mutually exclusive and
depend on each other. From Eqs. 1 and 3, the following
functional relationship can be derived:
PM
Pre
=
=
TPR
TPR+FPR
TPR
TPR+ NP FPR
=
TPR + NP FPR
TPR + FPR
TPR + FPR − FPR + NP FPR
TPR + FPR
= 1 +
NP − 1 FPR
TPR + FPR
Because of this relationship, both PM and Pre capture
model performance trends in a very similar way as we
will demonstrate further.
Using the same approach as described in [
1
], we
gen100
erated three models with NP = 9900: one of poor quality
(λ = 3), one of good quality (λ = 10) and one of excellent
quality (λ = 30) (Fig. 1). Each model yields an ordered
set of compounds from which a fraction of molecules,
defined by the cutoff threshold χ, is selected as hits (i.e.,
FP + TP). The influence of χ cutoff on both metrics in the
early recovery region with χ < 0.1 is shown in Fig. 2.
Figure 2 clearly shows that both PM and Pre capture
the same trends, albeit at different scales. For a poor
quality model, PM values varyP considerably more than
Pre values, which is due to the N ratio. While PM is more
P
sensitive to the increase in accepted actives (N decreases
(4)
the influence of false positives for PM, see Eq. 2), Pre
value shows less variance and it quickly approaches zero
because the list of the top hits gets “flooded” with false
positives. On the other hand, for good and excellent
quality models we find more variance in Pre than in PM
(Fig. 2). In particular for an excellent quality model, PM
P
varies very little, again due to the influence of N .
Therefore using Pre, one can identify a range of χ values where
a small shift in χ results in the acceptance of a large
number of false positives (Fig. 2, black line segment). This
effect is, however, not captured so distinctively by PM.
Therefore, we may conclude that the main advantage of
PM over Pre is its robustness with respect to the
imbalance of positive and negative examples. However, PM
fails to capture, especially for well-performing models,
the influence of false positives. In addition, PM and Pre
metrics are in a functional relationship. Therefore, if
PM and Pre are used for the comparison of two
different models on the same data set, the conclusions are the
same irrespective of the metric. Lastly, it is also
impor
P
tant to note that when the N ratio equals to 1 (i.e., in a
balanced data set), PM and Pre become equivalent.
In the end, we would like to emphasize that PM is not
a suitable metric for the performance assessment of
classification models. Similarly to Pre, it does not take into
account the number of true or false negatives. Thus, it
should be accompanied by a metric taking negative
classifications into account, just as Pre is commonly reported
together with a recall.
This comment refers to the article available at https://doi.
org/10.1186/s13321‑018‑0262‑2; https://doi.org/10.1186/s13321‑016‑
0189‑4.
Authors’ contributions
MS, MV and DS carried out the analyses. DS wrote the manuscript, MS and MV
edited the manuscript. All authors read and approved the final manuscript.
Author details
1 CZ‑OPENSCREEN:National Infrastructure for Chemical Biology, Department
of Informatics and Chemistry, Faculty of Chemical Technology, University
of Chemistry and Technology Prague, Prague, Czech Republic. 2 CZ‑OPEN‑
SCREEN: National Infrastructure for Chemica (...truncated)