Classifier uncertainty: evidence, potential impact, and probabilistic treatment
Classifier uncertainty: evidence, potential
impact, and probabilistic treatment
Niklas Tötsch and Daniel Hoffmann
Faculty of Biology, University of Duisburg-Essen, Essen, Germany
ABSTRACT
Classifiers are often tested on relatively small data sets, which should lead to
uncertain performance metrics. Nevertheless, these metrics are usually taken at
face value. We present an approach to quantify the uncertainty of classification
performance metrics, based on a probability model of the confusion matrix.
Application of our approach to classifiers from the scientific literature and a
classification competition shows that uncertainties can be surprisingly large and
limit performance evaluation. In fact, some published classifiers may be
misleading. The application of our approach is simple and requires only the
confusion matrix. It is agnostic of the underlying classifier. Our method can also
be used for the estimation of sample sizes that achieve a desired precision of a
performance metric.
Subjects Computational Biology, Data Mining and Machine Learning, Scientific Computing and
Simulation
Keywords Classification, Machine learning, Uncertainty, Bayesian modeling, Reproducibility,
Statistics
INTRODUCTION
Submitted 2 September 2020
Accepted 27 January 2021
Published 4 March 2021
Corresponding author
Niklas Tötsch,
Academic editor
Sebastian Ventura
Additional Information and
Declarations can be found on
page 14
Classifiers are ubiquitous in science and every aspect of life. They can be based on
experiments, simulations, mathematical models or even expert judgement. The recent
rise of machine learning has further increased their importance. But machine learning
practitioners are by far not the only ones who should be concerned by the quality of
classifiers. Classifiers are often used to make decisions with far-reaching consequences.
In medicine, a therapy might be chosen based on a prediction of treatment outcome.
In court, a defendant might be considered guilty or not based on forensic tests. Therefore,
it is crucial to assess how well classifiers work.
In a binary classification task, results are presented in a 2 × 2 confusion matrix (CM),
comprising the numbers of true positive (TP), false negative (FN), true negative (TN) and
false positive (FP) predictions.
TP
CM ¼
FP
FN
TN
(1)
DOI 10.7717/peerj-cs.398
Copyright
2021 Tötsch and Hoffmann
Distributed under
Creative Commons CC-BY 4.0
The confusion matrix contains all necessary information to determine metrics which
are used to evaluate the performance of a classifier. Popular examples are accuracy (ACC),
true positive rate (TPR), and true negative rate (TNR).
How to cite this article Tötsch N, Hoffmann D. 2021. Classifier uncertainty: evidence, potential impact, and probabilistic treatment. PeerJ
Comput. Sci. 7:e398 DOI 10.7717/peerj-cs.398
ACC ¼
TP þ TN
TP þ FN þ FP þ TN
(2)
TPR ¼
TP
TP þ FN
(3)
TNR ¼
TN
TN þ FP
(4)
These are given as precise numbers, irrespective of the sample sizes (Ns) used for their
calculation in performance tests. This is problematic especially in fields such as biology
or medicine, where data collection is often expensive, tedious, or limited by ethical
concerns, leading often to small Ns. In this study we demonstrate that in those cases
the uncertainty of the CM entries cannot be neglected, which in turn makes all
performance metrics derived from the CM uncertain, too. In the light of the ongoing
replication crisis (Baker, 2016), it is plausible that negligence of the metric uncertainty
impedes reproducible classification experiments.
There is a lack of awareness of this problem, especially outside the machine learning
community. One often encounters discussions of classifier performance lacking any
statistical analysis of the validity in the literature. If there is a statistical analysis it usually
relies on frequentist methods such as confidence intervals for the metrics or null
hypothesis significance testing (NHST) to determine if a classifier is truly better than
random guessing. NHST “must be viewed as approximate, heuristic tests, rather than as
rigorously correct statistical methods” (Dietterich, 1998).
Bayesian methods can be valuable alternatives (Benavoli et al., 2017). To properly
account for the uncertainty, we have to replace the point estimates in the CM and all
dependent performance metrics by probability distributions. Correct and incorrect
classifications are outcomes of a Binomial experiment (Brodersen et al., 2010a). Therefore,
Brodersen et al. model ACC with a beta-binomial distribution (BBD)
ACC BetaðTP þ TN þ 1; FP þ FN þ 1Þ:
(5)
Some of the more complex metrics, such as balanced accuracy, can be described by
combining two BBDs (Brodersen et al., 2010a).
Caelen presented a Bayesian interpretation of the CM (Caelen, 2017). This elegant
approach, based on a single Dirichlet-multinomial distribution, allows to replace the count
data of the confusion matrix with distributions which account for the uncertainty.
CM Multðu; NÞ
(6)
u Dirichletðð1; 1; 1; 1ÞÞ
(7)
where u ¼ ½uTP ; uFN ; uTN ; uFP is the confusion probability matrix which represents the
probabilities to draw each entry of the CM. The major advantage of Caelen’s approach over
the one presented by Brodersen lies in a complete description of the CM. From there, all
metrics can be computed directly, even those that cannot simply be described as BBD.
Tötsch and Hoffmann (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.398
2/15
Caelen calculates metric distributions from confusion matrices that are sampled
according to Eq. (6). Here, we demonstrate that this approach is flawed and derive a
correct model. Whereas previous studies focused on the statistical methods, we prove
that classifier performance in many peer-reviewed publications is highly uncertain.
We studied a variety of classifiers from the chemical, biological and medicinal literature
and found cases where it is not clear if the classifier is better than random guessing.
Additionally, we investigate metric uncertainty in a Kaggle machine learning competition
where sample size is relatively large but a precise estimate of the metrics is required.
In order to help non-statisticians to deal with these problems in the future, we derive a rule
for sample size determination and offer a free, simple to use webtool to determine metric
uncertainty.
METHODS
Model
The confusion probability matrix (θ), that is the probabilities to generate entries of a
confusion matrix, can be derived if prevalence (ϕ), TPR and TNR are known (Kruschke,
2015a).
uTP ¼ TPR f
(8)
uFN ¼ ð1 TPRÞ f
(9)
uTN ¼ TNR ð1 fÞ
(10)
uFP ¼ ð1 TNRÞ ð1 fÞ
(11)
The idea that these metrics can also be inferred from data, propagating the uncertainty,
is the starting point of the present study. Using three BBDs, one for each of ϕ, TPR
and TNR, we can express all entries of the CM (Fig. 1). Since ϕ, TPR and TNR are
distributions, the entries of cpm ½uTP ; uFN ; uTN ; uFP are too (...truncated)