Machine learning to predict mortality after rehabilitation among patients with severe stroke
www.nature.com/scientificreports
OPEN
Machine learning to predict
mortality after rehabilitation
among patients with severe stroke
Domenico Scrutinio1, Carlo Ricciardi 1,2*, Leandro Donisi1,2, Ernesto Losavio1,
Petronilla Battista1, Pietro Guida1, Mario Cesarelli1,3, Gaetano Pagano1 & Giovanni D’Addio1
Stroke is among the leading causes of death and disability worldwide. Approximately 20–25%
of stroke survivors present severe disability, which is associated with increased mortality risk.
Prognostication is inherent in the process of clinical decision-making. Machine learning (ML) methods
have gained increasing popularity in the setting of biomedical research. The aim of this study was
twofold: assessing the performance of ML tree-based algorithms for predicting three-year mortality
model in 1207 stroke patients with severe disability who completed rehabilitation and comparing the
performance of ML algorithms to that of a standard logistic regression. The logistic regression model
achieved an area under the Receiver Operating Characteristics curve (AUC) of 0.745 and was well
calibrated. At the optimal risk threshold, the model had an accuracy of 75.7%, a positive predictive
value (PPV) of 33.9%, and a negative predictive value (NPV) of 91.0%. The ML algorithm outperformed
the logistic regression model through the implementation of synthetic minority oversampling
technique and the Random Forests, achieving an AUC of 0.928 and an accuracy of 86.3%. The PPV was
84.6% and the NPV 87.5%. This study introduced a step forward in the creation of standardisable tools
for predicting health outcomes in individuals affected by stroke.
Stroke is among the leading causes of death and disability worldwide1–4. Approximately 20–25% of stroke survivors present severe d
isability5. Severe disability after stroke is associated with increased risk of mortality and
readmission, wider inter-individual variation in responsiveness to rehabilitation, and higher healthcare and
social costs compared with less severe s trokes6,7. Moreover, there is evidence that patients with severe post-stroke
disability are less likely to be admitted to specialized inpatient rehabilitation facilities (IRF) and to receive appropriate secondary prevention than those with mild-to-moderate d
isability8–12, with a possible negative impact
on prognosis.
Prognostication is inherent in the process of clinical decision-making13. The assessment of risk in stroke
patients with severe disability might improve clinical decision-making, prompt clinicians to consider closer
surveillance and more aggressive treatment to achieve goals in secondary prevention, and influence patient
management. While not routinely used in clinical practice, multivariable models are well-accepted tools to
predict prognosis. Three well-known prognostic models were developed to predict 90-day or 1-year mortality in
patients with acute stroke14–16. These models had good discriminatory properties (C statistic ranging 0.706 and
0.840). However, the application of models developed from patients with heterogeneous neurological deficits
using variables recorded at acute care admission to the subset of patients with severe stroke after discharge from
the acute care setting can result in miscalibrated estimates of life expectancy and decreased discriminatory value.
In addition, the beneficial effect of inpatient rehabilitation on mortality might confound the association between
predictors recorded at admission to acute care and m
ortality17–19.
The standard approach to develop prognostic models involves the use of statistical regression models. Correlation between covariates, nonlinearity of the association between continuous covariates and risk for the
outcome of interest, and potential complex interactions among covariates represent common analytic challenges
in regression m
odelling20,21. In comparison with statistical models, machine-learning (ML) methods have the
advantages of using a larger number of predictors, requiring fewer assumptions, using an agnostic approach
instead of a priori hypotheses, incorporating “multi-dimensional correlations that contain prognostic information”, and producing a “more flexible relationship among the predictor variables (alone or in combination) and
the outcome”20,22–24. As observed by D
eo24, “there may be features that are useful in combinations but not on their
1
Istituti Clinici Scientifici Maugeri IRCCS, Pavia, Italy. 2Department of Advanced Biomedical Sciences, University
Hospital of Naples “Federico II”, Naples, Italy. 3Department of Electrical Engineering and Information Technology,
University of Naples “Federico II”, Naples, Italy. *email:
Scientific Reports |
(2020) 10:20127
| https://doi.org/10.1038/s41598-020-77243-3
1
Vol.:(0123456789)
www.nature.com/scientificreports/
Figure 1. The workflow of the study is represented: the data of 1207 patients from three facilities of Maugeri
Institute in the South and in the North of Italy were collected and used to create models through a multivariate
logistic regression and tree-based ML algorithms to predict three-year mortality in stroke patients after
rehabilitation.
own”. Theoretically, these properties might allow achieve an improved model performance for prognostication
of the outcome of interest.
The workflow of the study is shown in Fig. 1 and its aim was two-fold:
(1)
(2)
Assessing the performance of ML–based algorithms for predicting long-term mortality in stroke patients
with severe disability;
Comparing the performance of ML algorithms to that of a standard regression model.
To address these issues, we studied 1207 patients admitted to inpatients rehabilitation and classified as CaseMix Groups (CMGs) 0108, 0109, and 0110 of the Medicare case-mix classification s ystem25, which was specifically developed to account for “the level of severity of a given case”26. Case-mix groups 0108, 0109, and 0110
encompass the most severe strokes. Since our primary was a dichotomous outcome (dead/alive) rather than
time-to-event and nearly all survivors had a complete follow-up up to three years, we chose to focus on a logistic
regression analysis instead of a Cox regression analysis. We found that ML algorithms outperformed a standard
regression model.
Results
Table 1 shows baseline patients’ characteristics. Of the 1241 patients who fulfilled the selection criteria, 34 were
lost to follow-up after discharge, leaving 1207 patients available for analysis. A total of 3,267 person-years of
follow-up were examined during which 189 deaths (5.8 deaths/100 person-years) occurred. The mean follow-up
was 988 ± 273 days. The actual mortality rates were 8.3% at 1 year, 13.0% at 2 years, and 15.7% at 3 years.
Logistic regression. At multivariate analysis, age, diabetes, CAD, AF, anemia, renal dysfunction, neglect,
and cognitive FIM score were significantly associated with 3-year mortality (Table 2). Age was the most important variable (Table 3).
The logistic mo (...truncated)