Classifier uncertainty: evidence, potential impact, and probabilistic treatment

PeerJ Computer Science, Mar 2021

Classifiers are often tested on relatively small data sets, which should lead to uncertain performance metrics. Nevertheless, these metrics are usually taken at face value. We present an approach to quantify the uncertainty of classification performance metrics, based on a probability model of the confusion matrix. Application of our approach to classifiers from the scientific literature and a classification competition shows that uncertainties can be surprisingly large and limit performance evaluation. In fact, some published classifiers may be misleading. The application of our approach is simple and requires only the confusion matrix. It is agnostic of the underlying classifier. Our method can also be used for the estimation of sample sizes that achieve a desired precision of a performance metric.

Article PDF cannot be displayed. You can download it here:

https://peerj.com/articles/cs-398.pdf

Classifier uncertainty: evidence, potential impact, and probabilistic treatment

Classifier uncertainty: evidence, potential impact, and probabilistic treatment Niklas Tötsch and Daniel Hoffmann Faculty of Biology, University of Duisburg-Essen, Essen, Germany ABSTRACT Classifiers are often tested on relatively small data sets, which should lead to uncertain performance metrics. Nevertheless, these metrics are usually taken at face value. We present an approach to quantify the uncertainty of classification performance metrics, based on a probability model of the confusion matrix. Application of our approach to classifiers from the scientific literature and a classification competition shows that uncertainties can be surprisingly large and limit performance evaluation. In fact, some published classifiers may be misleading. The application of our approach is simple and requires only the confusion matrix. It is agnostic of the underlying classifier. Our method can also be used for the estimation of sample sizes that achieve a desired precision of a performance metric. Subjects Computational Biology, Data Mining and Machine Learning, Scientific Computing and Simulation Keywords Classification, Machine learning, Uncertainty, Bayesian modeling, Reproducibility, Statistics INTRODUCTION Submitted 2 September 2020 Accepted 27 January 2021 Published 4 March 2021 Corresponding author Niklas Tötsch, Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 14 Classifiers are ubiquitous in science and every aspect of life. They can be based on experiments, simulations, mathematical models or even expert judgement. The recent rise of machine learning has further increased their importance. But machine learning practitioners are by far not the only ones who should be concerned by the quality of classifiers. Classifiers are often used to make decisions with far-reaching consequences. In medicine, a therapy might be chosen based on a prediction of treatment outcome. In court, a defendant might be considered guilty or not based on forensic tests. Therefore, it is crucial to assess how well classifiers work. In a binary classification task, results are presented in a 2 × 2 confusion matrix (CM), comprising the numbers of true positive (TP), false negative (FN), true negative (TN) and false positive (FP) predictions.  TP CM ¼ FP FN TN  (1) DOI 10.7717/peerj-cs.398 Copyright 2021 Tötsch and Hoffmann Distributed under Creative Commons CC-BY 4.0 The confusion matrix contains all necessary information to determine metrics which are used to evaluate the performance of a classifier. Popular examples are accuracy (ACC), true positive rate (TPR), and true negative rate (TNR). How to cite this article Tötsch N, Hoffmann D. 2021. Classifier uncertainty: evidence, potential impact, and probabilistic treatment. PeerJ Comput. Sci. 7:e398 DOI 10.7717/peerj-cs.398 ACC ¼ TP þ TN TP þ FN þ FP þ TN (2) TPR ¼ TP TP þ FN (3) TNR ¼ TN TN þ FP (4) These are given as precise numbers, irrespective of the sample sizes (Ns) used for their calculation in performance tests. This is problematic especially in fields such as biology or medicine, where data collection is often expensive, tedious, or limited by ethical concerns, leading often to small Ns. In this study we demonstrate that in those cases the uncertainty of the CM entries cannot be neglected, which in turn makes all performance metrics derived from the CM uncertain, too. In the light of the ongoing replication crisis (Baker, 2016), it is plausible that negligence of the metric uncertainty impedes reproducible classification experiments. There is a lack of awareness of this problem, especially outside the machine learning community. One often encounters discussions of classifier performance lacking any statistical analysis of the validity in the literature. If there is a statistical analysis it usually relies on frequentist methods such as confidence intervals for the metrics or null hypothesis significance testing (NHST) to determine if a classifier is truly better than random guessing. NHST “must be viewed as approximate, heuristic tests, rather than as rigorously correct statistical methods” (Dietterich, 1998). Bayesian methods can be valuable alternatives (Benavoli et al., 2017). To properly account for the uncertainty, we have to replace the point estimates in the CM and all dependent performance metrics by probability distributions. Correct and incorrect classifications are outcomes of a Binomial experiment (Brodersen et al., 2010a). Therefore, Brodersen et al. model ACC with a beta-binomial distribution (BBD) ACC  BetaðTP þ TN þ 1; FP þ FN þ 1Þ: (5) Some of the more complex metrics, such as balanced accuracy, can be described by combining two BBDs (Brodersen et al., 2010a). Caelen presented a Bayesian interpretation of the CM (Caelen, 2017). This elegant approach, based on a single Dirichlet-multinomial distribution, allows to replace the count data of the confusion matrix with distributions which account for the uncertainty. CM  Multðu; NÞ (6) u  Dirichletðð1; 1; 1; 1ÞÞ (7) where u ¼ ½uTP ; uFN ; uTN ; uFP  is the confusion probability matrix which represents the probabilities to draw each entry of the CM. The major advantage of Caelen’s approach over the one presented by Brodersen lies in a complete description of the CM. From there, all metrics can be computed directly, even those that cannot simply be described as BBD. Tötsch and Hoffmann (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.398 2/15 Caelen calculates metric distributions from confusion matrices that are sampled according to Eq. (6). Here, we demonstrate that this approach is flawed and derive a correct model. Whereas previous studies focused on the statistical methods, we prove that classifier performance in many peer-reviewed publications is highly uncertain. We studied a variety of classifiers from the chemical, biological and medicinal literature and found cases where it is not clear if the classifier is better than random guessing. Additionally, we investigate metric uncertainty in a Kaggle machine learning competition where sample size is relatively large but a precise estimate of the metrics is required. In order to help non-statisticians to deal with these problems in the future, we derive a rule for sample size determination and offer a free, simple to use webtool to determine metric uncertainty. METHODS Model The confusion probability matrix (θ), that is the probabilities to generate entries of a confusion matrix, can be derived if prevalence (ϕ), TPR and TNR are known (Kruschke, 2015a). uTP ¼ TPR  f (8) uFN ¼ ð1  TPRÞ  f (9) uTN ¼ TNR  ð1  fÞ (10) uFP ¼ ð1  TNRÞ  ð1  fÞ (11) The idea that these metrics can also be inferred from data, propagating the uncertainty, is the starting point of the present study. Using three BBDs, one for each of ϕ, TPR and TNR, we can express all entries of the CM (Fig. 1). Since ϕ, TPR and TNR are distributions, the entries of cpm ½uTP ; uFN ; uTN ; uFP  are too (...truncated)


This is a preview of a remote PDF: https://peerj.com/articles/cs-398.pdf
Article home page: https://doaj.org/article/ce0317b75cc6455789ec6681e6fb630b

Niklas Tötsch, Daniel Hoffmann. Classifier uncertainty: evidence, potential impact, and probabilistic treatment, PeerJ Computer Science, 2021, pp. e398, Issue 7, DOI: 10.7717/peerj-cs.398