The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification
(2023) 16:4
Chicco and Jurman BioData Mining
https://doi.org/10.1186/s13040-023-00322-4
BioData Mining
Open Access
METHODOLOGY
The Matthews correlation coefficient (MCC)
should replace the ROC AUC as the standard
metric for assessing binary classification
Davide Chicco1* and Giuseppe Jurman2
*Correspondence:
1
Institute of Health Policy
Management and Evaluation,
University of Toronto, 155
College Street, M5T 3M7 Toronto,
Ontario, Canada
2
Data Science for Health Unit,
Fondazione Bruno Kessler, Via
Sommarive 18, 38123 Povo,
Trento, Italy
Abstract
Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic
curve (ROC AUC) has become the common standard metric to evaluate binary
classifications in most scientific fields. The ROC curve has true positive rate (also called
sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC
can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several
flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive
predictive value (also known as precision) nor negative predictive value (NPV) obtained
by the classifier, therefore potentially generating inflated overoptimistic results. Since
it is common to include ROC AUC alone without precision and negative predictive
value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion
matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability
of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [−1; +1] interval only if the classifier scored
a high value for all the four basic rates of the confusion matrix: sensitivity, specificity,
precision, and negative predictive value. A high MCC (for example, MCC = 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study,
we explain why the Matthews correlation coefficient should replace the ROC AUC
as standard statistic in all the scientific studies involving a binary classification, in all
scientific fields.
Keywords: Matthews correlation coefficient, Receiver operating characteristic curve,
ROC, Area under the curve, AUC, ROC AUC, Confusion matrix, Binary classification,
Supervised machine learning, Data mining, Data science
The advantages of MCC over ROC AUC
Binary classification. A binary classification is a task where data of two groups need to
be classified or predicted to be part of one of those two groups. Typically, the elements of
one of the two groups are called negatives or zeros and the elements of the other group are
© The Author(s) 2023, corrected publication 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in
a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of
this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco
mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Chicco and Jurman BioData Mining
(2023) 16:4
Page 2 of 23
called positives or ones. To evaluate the binary classification, researchers have introduced
the concept of confusion matrix, a 2 × 2 contingency table where the positive elements correctly classified as positives are called true positives (TP), the negative elements wrongly
classified as positive are called false positives (FP), the negative elements correctly classified
as negatives are called true negatives (TN), and the positive elements wrongly classified as
negatives are called false negatives (FN). When the predictions are binary, the evaluation
involves a single confusion matrix. Many times, however, the predictions are real values in
the [0; 1] interval. In such cases, a heuristic cut-off threshold τ = 0.5 is used to map the
real values into zeros or ones: predictions below τ are considered zeros, and the predictions
equal or above τ are considered ones.
Caveat emptor: in this study, we refer to all the confusion matrix rates generated with
cut-off threshold τ = 0.5 for the confusion matrix, except ROC AUC which refers to all the
possible cut-off thresholds, as we explain later. This choice of the threshold follows a well
consolidated convention in the literature, and allows a fair comparison of the considerations presented hereafter with the outcome of most of the published references. When we
write TPR = 0.724, for example, we refer to a sensitivity value calculated when the confusion matrix cut-off threshold is τ = 0.5. In the tables, we highlight this aspect by using
the notation TPRτ =0.5 rather than just TPR. However, in the body of this manuscript we
decided to use the simple term TPR to make this study more readable.
Additionally, even if some scientific discoveries presented in this study are valid also for
multi-class classification, we concentrated this study on binary classifications for space reasons. Analysis of multi-class classification rates [1–3] can be an interesting development for
a future study.
Confusion matrix rates. The four categories of the confusion matrix, by themselves
alone, do not say much about the quality of the classification. To summarize the outcome of
the confusion matrix, researchers have introduced statistics that indicate ratios of the four
confusion matrix tallies, such as accuracy and F1 score.
In a previous study [4], we defined basic rates for confusion matrices as the following
four rates: sensitivity (Eq. 1), specificity (Eq. 2), precision (Eq. 3), and negative predictive
value (Eq. 4).
TP
TP+FN
(worst and minimum value 0; best and maximum value 1)
(1)
TN
TN+FP
(worst and minimum value 0; best and maximum value 1)
(2)
TP
TP+FP
(worst and minimum value 0; best and maximum value 1)
(3)
true positive rate, recall, sensitivity, TPR =
true negative rate, specificity, TNR =
positive predictive value, prec (...truncated)