The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

PLOS ONE, Jan 2023

Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889). The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826–0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC ≥ 0.8). We were able to develop machine learning models that predict high-cost patients with ‘good’ performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR.

The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

PLOS ONE RESEARCH ARTICLE The application of machine learning to predict high-cost patients: A performancecomparison of different models using healthcare claims data Benedikt Langenberger ID1*, Timo Schulte2,3, Oliver Groene2,3 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 1 Department of Health Care Management, Technische Universität Berlin, Berlin, Germany, 2 OptiMedis, Hamburg, Germany, 3 Department of Management & Innovation in Healthcare, Faculty of Health, University of Witten/Herdecke, Witten, Germany * Abstract OPEN ACCESS Citation: Langenberger B, Schulte T, Groene O (2023) The application of machine learning to predict high-cost patients: A performancecomparison of different models using healthcare claims data. PLoS ONE 18(1): e0279540. https:// doi.org/10.1371/journal.pone.0279540 Editor: Maciej Huk, Wroclaw University of Science and Technology, POLAND Received: May 21, 2021 Accepted: December 10, 2022 Published: January 18, 2023 Copyright: © 2023 Langenberger et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: The data associated with the paper are not publicly available due to legal restrictions imposed by the health insurance companies providing the data (data contains potentially identifying patient information). The data set supporting the conclusions of this article are owned by German statutory health insurance and are subject to strict data protection rules according to the German social security code. Therefore, the data cannot be made publicly accessible. The data we accessed is collected by health insurances when health providers bill their services towards Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889). The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826– 0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC � 0.8). We were able to develop machine learning models that predict high-cost patients with ‘good’ performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR. Introduction The patterns of health service utilization as well as the resulting costs vary substantially across individuals within a given population. The fact that five percent of the population account for about half of the total populations healthcare costs is known to hold for various countries such as the US [1, 2], Germany [3, 4], Canada [5], Denmark [6], Japan [7], the Netherlands [8] or Australia [9]. The top five percent of patients among the cost distribution are referred to as high-cost patients (HCPs). Compared to non-HCPs, HCPs are more likely to have a low PLOS ONE | https://doi.org/10.1371/journal.pone.0279540 January 18, 2023 1 / 16 PLOS ONE the health insurance. The authors did not have special access privileges to the health insurances. To request the data please contact (. de). To fulfill the legal requirements to obtain the data, data users must conclude a contract with the statutory health insurer regarding data access. The licensee is permitted to use the data for the purpose as set out in the contract. Licensees are not allowed to pass the data to a third party. For assistance in obtaining access to the data, please contact the corresponding author (BL). Funding: OptiMedis AG sponsored the data and publication fees. OG is employed by OptiMedis AG. The involvement of OptiMedis AG did not influence our analysis or the interpretation of our results. The funder had no role in study design, data analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Machine learning and high-cost patients income, suffer from several chronic conditions [5], depend on multiple drugs [10], be of white skin color (at least in the US), have consulted a physician within the last year, be physically inactive and overweight, and have smoked within the past. The odds of becoming a HCP for individuals aged 80 or older within the next five years was found to be 37 times as high as for individuals aged younger than 30 [11]. The knowledge that only a small fraction of individuals accounts for a dramatic share of healthcare expenses leads to a question: How can the intensive resource consumption of HCPs be prevented to the benefit of health care systems, payers, and patients? An often-practiced approach to reduce cost is to focus on individuals which are already HCPs and try to reduce the future costs among those patients. However, a better approach could be to make accurate prognosis on which individuals will be HCPs in the future (e.g. next year) and intervene with appropriate measures preventively [12]. Variables such as age [7, 11, 13, 14], current high healthcare utilization/costs [7, 13, 15], hospitalization [13, 16], (number of) chronic conditions [13, 15], social deprivation [11, 13], patients general health status [11, 14], mental disorders [13, 14], obesity-related factors [7, 11] and diabetes or cardiovascular disease indicators [7] were found to be important predictors for becoming a HCP. Tamang et al (2017) [6] found that about one third of HCPs remain HCPs for the forthcoming year, while Wodchis et al (2016) [1] found that about 30 percent remain HCP for the following two years, once they are currently HCPs. Nevertheless, about two thirds of HCP were new to this group and labelled as ‘cost bloomers’ [6]. Among those, becoming a HCP is sometimes triggered by relatively rare or unforeseeable but expensive conditions [1], such as accidents. Consequentially, a certain percentage of HCP will most likely always stay unpredictable, even with sophisticated prediction (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0279540&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0279540

Benedikt Langenberger, Timo Schulte, Oliver Groene. The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data, PLOS ONE, 2023, Volume 18, Issue 1, DOI: 10.1371/journal.pone.0279540