Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations
PLOS ONE
RESEARCH ARTICLE
Increasing transparency in machine learning
through bootstrap simulation and shapely
additive explanations
Alexander A. Huang ID1,2☯, Samuel Y. Huang ID1,3☯*
1 Department of Statistics and Data Science, Cornell University, Ithaca, New York, United States of America,
2 Department of MD Education, Northwestern University Feinberg School of Medicine, Chicago, Illinois,
United States of America, 3 Department of Internal Medicine, Virginia Commonwealth University School of
Medicine, Richmond, Virginia, United States of America
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Huang AA, Huang SY (2023) Increasing
transparency in machine learning through
bootstrap simulation and shapely additive
explanations. PLoS ONE 18(2): e0281922. https://
doi.org/10.1371/journal.pone.0281922
Editor: Loredana Bellantuono, Università degli
Studi di Bari Aldo Moro: Universita degli Studi di
Bari Aldo Moro, ITALY
Received: November 23, 2022
Accepted: February 5, 2023
Published: February 23, 2023
Peer Review History: PLOS recognizes the
benefits of transparency in the peer review
process; therefore, we enable the publication of
all of the content of peer review and author
responses alongside final, published articles. The
editorial history of this article is available here:
https://doi.org/10.1371/journal.pone.0281922
Copyright: © 2023 Huang, Huang. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
☯ These authors contributed equally to this work.
*
Abstract
Machine learning methods are widely used within the medical field. However, the reliability
and efficacy of these models is difficult to assess, making it difficult for researchers to
identify which machine-learning model to apply to their dataset. We assessed whether
variance calculations of model metrics (e.g., AUROC, Sensitivity, Specificity) through
bootstrap simulation and SHapely Additive exPlanations (SHAP) could increase model
transparency and improve model selection. Data from the England National Health Services Heart Disease Prediction Cohort was used. After comparison of model metrics for
XGBoost, Random Forest, Artificial Neural Network, and Adaptive Boosting, XGBoost
was used as the machine-learning model of choice in this study. Boost-strap simulation
(N = 10,000) was used to empirically derive the distribution of model metrics and covariate
Gain statistics. SHapely Additive exPlanations (SHAP) to provide explanations to
machine-learning output and simulation to evaluate the variance of model accuracy metrics. For the XGBoost modeling method, we observed (through 10,000 completed simulations) that the AUROC ranged from 0.771 to 0.947, a difference of 0.176, the balanced
accuracy ranged from 0.688 to 0.894, a 0.205 difference, the sensitivity ranged from
0.632 to 0.939, a 0.307 difference, and the specificity ranged from 0.595 to 0.944, a 0.394
difference. Among 10,000 simulations completed, we observed that the gain for Angina
ranged from 0.225 to 0.456, a difference of 0.231, for Cholesterol ranged from 0.148 to
0.326, a difference of 0.178, for maximum heart rate (MaxHR) ranged from 0.081 to
0.200, a range of 0.119, and for Age ranged from 0.059 to 0.157, difference of 0.098. Use
of simulations to empirically evaluate the variability of model metrics and explanatory
algorithms to observe if covariates match the literature are necessary for increased transparency, reliability, and utility of machine learning methods. These variance statistics,
combined with model accuracy statistics can help researchers identify the best model for
a given dataset.
Data Availability Statement: All relevant data are
within the manuscript and its Supporting
information files.
PLOS ONE | https://doi.org/10.1371/journal.pone.0281922 February 23, 2023
1 / 15
PLOS ONE
Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations
Funding: The authors received no specific funding
for this work.
Introduction
Competing interests: The authors have declared
that no competing interests exist.
Machine learning (ML) algorithms generate predictions from sample data without explicit
directions from the user [1–4]. Common ML algorithms (e.g., XGBoost, Random Forest, Neural Networks) have been found to be more accurate than traditional parametric methods (linear regression, logistic regression) [5–8]. It has been hypothesized that this increase in
accuracy can be attributed to potential non-linear relationships between the independent and
dependent variables and interactions between multiple covariates [9, 10]. However, the
increase in ML algorithms compared to traditional parametric methods comes at a significant
cost: interpretability [11–15]. Linear regression and logistic regression have clear interpretable
output that have been widely studied [16–18]. Machine-learning algorithms are often noninterpretable, leading to their reputation as a “black box” algorithm [10, 19–21]. As a result,
the interpretability, reliability, and efficacy of machine-learning models is often difficult to
assess [14, 20, 22–24].
Without methods that explain how machine learning algorithms reach their predictions,
clinicians will not be able to identify if models are reliable and generalizable or just replicating the biases within the training datasets [11, 13, 25]. Provision of explanations about how
model predictions are researched and providing accurate summary statistics for model accuracy metrics (e.g., AUROC, Sensitivity, Specificity, F1, Balanced Accuracy) will increase the
transparency of machine learning methods and increase confidence when using their predictions [8, 9, 26, 27]. Potential solutions to these weaknesses in machine learning that have
been applied within the field of computer science are SHapely Additive exPlanations (SHAP)
for model interpretability and bootstrap simulation for quantifying the statistical distribution of model accuracy metrics [28–30]. However, little is known about the efficacy of SHAP
and Bootstrap in evaluating machine-learning methods for medical outcomes such as heart
disease. Given these limitations in the literature, with data from the England National Health
Services Heart Disease Prediction Cohort, we leveraged SHAP to provide explanations to
machine-learning output and bootstrap simulation to evaluate the variance of model accuracy metrics.
Methods
A retrospective, cohort study using the publicly available Heart Disease Prediction cohort
(from the England National Health Services database) was conducted. All methods in this
research were carried out in accordance with ethical guidelines detailed by the Data Alliance
Partnership Board (DAPB) approved national information standards and data collections for
use in health and adult (...truncated)