The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification (pdf)

Article PDF cannot be displayed. You can download it here:

https://biodatamining.biomedcentral.com/counter/pdf/10.1186/s13040-023-00322-4

The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification

(2023) 16:4 Chicco and Jurman BioData Mining https://doi.org/10.1186/s13040-023-00322-4 BioData Mining Open Access METHODOLOGY The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification Davide Chicco1* and Giuseppe Jurman2 *Correspondence: 1 Institute of Health Policy Management and Evaluation, University of Toronto, 155 College Street, M5T 3M7 Toronto, Ontario, Canada 2 Data Science for Health Unit, Fondazione Bruno Kessler, Via Sommarive 18, 38123 Povo, Trento, Italy Abstract Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [−1; +1] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC = 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields. Keywords: Matthews correlation coefficient, Receiver operating characteristic curve, ROC, Area under the curve, AUC, ROC AUC, Confusion matrix, Binary classification, Supervised machine learning, Data mining, Data science The advantages of MCC over ROC AUC Binary classification. A binary classification is a task where data of two groups need to be classified or predicted to be part of one of those two groups. Typically, the elements of one of the two groups are called negatives or zeros and the elements of the other group are © The Author(s) 2023, corrected publication 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Chicco and Jurman BioData Mining (2023) 16:4 Page 2 of 23 called positives or ones. To evaluate the binary classification, researchers have introduced the concept of confusion matrix, a 2 × 2 contingency table where the positive elements correctly classified as positives are called true positives (TP), the negative elements wrongly classified as positive are called false positives (FP), the negative elements correctly classified as negatives are called true negatives (TN), and the positive elements wrongly classified as negatives are called false negatives (FN). When the predictions are binary, the evaluation involves a single confusion matrix. Many times, however, the predictions are real values in the [0; 1] interval. In such cases, a heuristic cut-off threshold τ = 0.5 is used to map the real values into zeros or ones: predictions below τ are considered zeros, and the predictions equal or above τ are considered ones. Caveat emptor: in this study, we refer to all the confusion matrix rates generated with cut-off threshold τ = 0.5 for the confusion matrix, except ROC AUC which refers to all the possible cut-off thresholds, as we explain later. This choice of the threshold follows a well consolidated convention in the literature, and allows a fair comparison of the considerations presented hereafter with the outcome of most of the published references. When we write TPR = 0.724, for example, we refer to a sensitivity value calculated when the confusion matrix cut-off threshold is τ = 0.5. In the tables, we highlight this aspect by using the notation TPRτ =0.5 rather than just TPR. However, in the body of this manuscript we decided to use the simple term TPR to make this study more readable. Additionally, even if some scientific discoveries presented in this study are valid also for multi-class classification, we concentrated this study on binary classifications for space reasons. Analysis of multi-class classification rates [1–3] can be an interesting development for a future study. Confusion matrix rates. The four categories of the confusion matrix, by themselves alone, do not say much about the quality of the classification. To summarize the outcome of the confusion matrix, researchers have introduced statistics that indicate ratios of the four confusion matrix tallies, such as accuracy and F1 score. In a previous study [4], we defined basic rates for confusion matrices as the following four rates: sensitivity (Eq. 1), specificity (Eq. 2), precision (Eq. 3), and negative predictive value (Eq. 4). TP TP+FN (worst and minimum value 0; best and maximum value 1) (1) TN TN+FP (worst and minimum value 0; best and maximum value 1) (2) TP TP+FP (worst and minimum value 0; best and maximum value 1) (3) true positive rate, recall, sensitivity, TPR = true negative rate, specificity, TNR = positive predictive value, prec (...truncated)