Classifier uncertainty: evidence, potential impact, and probabilistic treatment (pdf)

Article PDF cannot be displayed. You can download it here:

Classifier uncertainty: evidence, potential impact, and probabilistic treatment

Classiﬁer uncertainty: evidence, potential impact, and probabilistic treatment Niklas Tötsch and Daniel Hoffmann Faculty of Biology, University of Duisburg-Essen, Essen, Germany ABSTRACT Classiﬁers are often tested on relatively small data sets, which should lead to uncertain performance metrics. Nevertheless, these metrics are usually taken at face value. We present an approach to quantify the uncertainty of classiﬁcation performance metrics, based on a probability model of the confusion matrix. Application of our approach to classiﬁers from the scientiﬁc literature and a classiﬁcation competition shows that uncertainties can be surprisingly large and limit performance evaluation. In fact, some published classiﬁers may be misleading. The application of our approach is simple and requires only the confusion matrix. It is agnostic of the underlying classiﬁer. Our method can also be used for the estimation of sample sizes that achieve a desired precision of a performance metric. Subjects Computational Biology, Data Mining and Machine Learning, Scientiﬁc Computing and Simulation Keywords Classiﬁcation, Machine learning, Uncertainty, Bayesian modeling, Reproducibility, Statistics INTRODUCTION Submitted 2 September 2020 Accepted 27 January 2021 Published 4 March 2021 Corresponding author Niklas Tötsch, Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 14 Classiﬁers are ubiquitous in science and every aspect of life. They can be based on experiments, simulations, mathematical models or even expert judgement. The recent rise of machine learning has further increased their importance. But machine learning practitioners are by far not the only ones who should be concerned by the quality of classiﬁers. Classiﬁers are often used to make decisions with far-reaching consequences. In medicine, a therapy might be chosen based on a prediction of treatment outcome. In court, a defendant might be considered guilty or not based on forensic tests. Therefore, it is crucial to assess how well classiﬁers work. In a binary classiﬁcation task, results are presented in a 2 × 2 confusion matrix (CM), comprising the numbers of true positive (TP), false negative (FN), true negative (TN) and false positive (FP) predictions. TP CM ¼ FP FN TN (1) DOI 10.7717/peerj-cs.398 Copyright 2021 Tötsch and Hoffmann Distributed under Creative Commons CC-BY 4.0 The confusion matrix contains all necessary information to determine metrics which are used to evaluate the performance of a classiﬁer. Popular examples are accuracy (ACC), true positive rate (TPR), and true negative rate (TNR). How to cite this article Tötsch N, Hoffmann D. 2021. Classiﬁer uncertainty: evidence, potential impact, and probabilistic treatment. PeerJ Comput. Sci. 7:e398 DOI 10.7717/peerj-cs.398 ACC ¼ TP þ TN TP þ FN þ FP þ TN (2) TPR ¼ TP TP þ FN (3) TNR ¼ TN TN þ FP (4) These are given as precise numbers, irrespective of the sample sizes (Ns) used for their calculation in performance tests. This is problematic especially in ﬁelds such as biology or medicine, where data collection is often expensive, tedious, or limited by ethical concerns, leading often to small Ns. In this study we demonstrate that in those cases the uncertainty of the CM entries cannot be neglected, which in turn makes all performance metrics derived from the CM uncertain, too. In the light of the ongoing replication crisis (Baker, 2016), it is plausible that negligence of the metric uncertainty impedes reproducible classiﬁcation experiments. There is a lack of awareness of this problem, especially outside the machine learning community. One often encounters discussions of classiﬁer performance lacking any statistical analysis of the validity in the literature. If there is a statistical analysis it usually relies on frequentist methods such as conﬁdence intervals for the metrics or null hypothesis signiﬁcance testing (NHST) to determine if a classiﬁer is truly better than random guessing. NHST “must be viewed as approximate, heuristic tests, rather than as rigorously correct statistical methods” (Dietterich, 1998). Bayesian methods can be valuable alternatives (Benavoli et al., 2017). To properly account for the uncertainty, we have to replace the point estimates in the CM and all dependent performance metrics by probability distributions. Correct and incorrect classiﬁcations are outcomes of a Binomial experiment (Brodersen et al., 2010a). Therefore, Brodersen et al. model ACC with a beta-binomial distribution (BBD) ACC BetaðTP þ TN þ 1; FP þ FN þ 1Þ: (5) Some of the more complex metrics, such as balanced accuracy, can be described by combining two BBDs (Brodersen et al., 2010a). Caelen presented a Bayesian interpretation of the CM (Caelen, 2017). This elegant approach, based on a single Dirichlet-multinomial distribution, allows to replace the count data of the confusion matrix with distributions which account for the uncertainty. CM Multðu; NÞ (6) u Dirichletðð1; 1; 1; 1ÞÞ (7) where u ¼ ½uTP ; uFN ; uTN ; uFP is the confusion probability matrix which represents the probabilities to draw each entry of the CM. The major advantage of Caelen’s approach over the one presented by Brodersen lies in a complete description of the CM. From there, all metrics can be computed directly, even those that cannot simply be described as BBD. Tötsch and Hoffmann (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.398 2/15 Caelen calculates metric distributions from confusion matrices that are sampled according to Eq. (6). Here, we demonstrate that this approach is ﬂawed and derive a correct model. Whereas previous studies focused on the statistical methods, we prove that classiﬁer performance in many peer-reviewed publications is highly uncertain. We studied a variety of classiﬁers from the chemical, biological and medicinal literature and found cases where it is not clear if the classiﬁer is better than random guessing. Additionally, we investigate metric uncertainty in a Kaggle machine learning competition where sample size is relatively large but a precise estimate of the metrics is required. In order to help non-statisticians to deal with these problems in the future, we derive a rule for sample size determination and offer a free, simple to use webtool to determine metric uncertainty. METHODS Model The confusion probability matrix (θ), that is the probabilities to generate entries of a confusion matrix, can be derived if prevalence (ϕ), TPR and TNR are known (Kruschke, 2015a). uTP ¼ TPR f (8) uFN ¼ ð1 TPRÞ f (9) uTN ¼ TNR ð1 fÞ (10) uFP ¼ ð1 TNRÞ ð1 fÞ (11) The idea that these metrics can also be inferred from data, propagating the uncertainty, is the starting point of the present study. Using three BBDs, one for each of ϕ, TPR and TNR, we can express all entries of the CM (Fig. 1). Since ϕ, TPR and TNR are distributions, the entries of cpm ½uTP ; uFN ; uTN ; uFP are too (...truncated)