The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
PLOS ONE
RESEARCH ARTICLE
The application of machine learning to predict
high-cost patients: A performancecomparison of different models using
healthcare claims data
Benedikt Langenberger ID1*, Timo Schulte2,3, Oliver Groene2,3
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
1 Department of Health Care Management, Technische Universität Berlin, Berlin, Germany, 2 OptiMedis,
Hamburg, Germany, 3 Department of Management & Innovation in Healthcare, Faculty of Health, University
of Witten/Herdecke, Witten, Germany
*
Abstract
OPEN ACCESS
Citation: Langenberger B, Schulte T, Groene O
(2023) The application of machine learning to
predict high-cost patients: A performancecomparison of different models using healthcare
claims data. PLoS ONE 18(1): e0279540. https://
doi.org/10.1371/journal.pone.0279540
Editor: Maciej Huk, Wroclaw University of Science
and Technology, POLAND
Received: May 21, 2021
Accepted: December 10, 2022
Published: January 18, 2023
Copyright: © 2023 Langenberger et al. This is an
open access article distributed under the terms of
the Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: The data associated
with the paper are not publicly available due to legal
restrictions imposed by the health insurance
companies providing the data (data contains
potentially identifying patient information). The data
set supporting the conclusions of this article are
owned by German statutory health insurance and
are subject to strict data protection rules according
to the German social security code. Therefore, the
data cannot be made publicly accessible. The data
we accessed is collected by health insurances
when health providers bill their services towards
Our aim was to predict future high-cost patients with machine learning using healthcare
claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the
following year. Therefore, we exploited routinely collected sickness funds claims and cost
data of the years 2016, 2017 and 2018. Various specifications of each algorithm were
trained and cross-validated on training data (n = 20,984) with claims and cost data from
2016 and outcomes from 2017. The best performing specifications of each algorithm were
selected based on validation dataset performance. For performance comparison, selected
models were applied to unforeseen data with features of the year 2017 and outcomes of the
year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area
under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval
(CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889).
The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826–
0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the
LR performed ´good´ (i.e. 0.9 > AUC � 0.8). We were able to develop machine learning
models that predict high-cost patients with ‘good’ performance facilitating routinely collected
sickness fund claims and cost data. We found that tree-based models performed best and
outperformed the ANN and LR.
Introduction
The patterns of health service utilization as well as the resulting costs vary substantially across
individuals within a given population. The fact that five percent of the population account for
about half of the total populations healthcare costs is known to hold for various countries such
as the US [1, 2], Germany [3, 4], Canada [5], Denmark [6], Japan [7], the Netherlands [8] or
Australia [9]. The top five percent of patients among the cost distribution are referred to as
high-cost patients (HCPs). Compared to non-HCPs, HCPs are more likely to have a low
PLOS ONE | https://doi.org/10.1371/journal.pone.0279540 January 18, 2023
1 / 16
PLOS ONE
the health insurance. The authors did not have
special access privileges to the health insurances.
To request the data please contact (.
de). To fulfill the legal requirements to obtain the
data, data users must conclude a contract with the
statutory health insurer regarding data access. The
licensee is permitted to use the data for the
purpose as set out in the contract. Licensees are
not allowed to pass the data to a third party. For
assistance in obtaining access to the data, please
contact the corresponding author (BL).
Funding: OptiMedis AG sponsored the data and
publication fees. OG is employed by OptiMedis AG.
The involvement of OptiMedis AG did not influence
our analysis or the interpretation of our results. The
funder had no role in study design, data analysis,
decision to publish, or preparation of the
manuscript.
Competing interests: The authors have declared
that no competing interests exist.
Machine learning and high-cost patients
income, suffer from several chronic conditions [5], depend on multiple drugs [10], be of white
skin color (at least in the US), have consulted a physician within the last year, be physically
inactive and overweight, and have smoked within the past. The odds of becoming a HCP for
individuals aged 80 or older within the next five years was found to be 37 times as high as for
individuals aged younger than 30 [11].
The knowledge that only a small fraction of individuals accounts for a dramatic share of
healthcare expenses leads to a question: How can the intensive resource consumption of HCPs
be prevented to the benefit of health care systems, payers, and patients? An often-practiced
approach to reduce cost is to focus on individuals which are already HCPs and try to reduce
the future costs among those patients. However, a better approach could be to make accurate
prognosis on which individuals will be HCPs in the future (e.g. next year) and intervene with
appropriate measures preventively [12].
Variables such as age [7, 11, 13, 14], current high healthcare utilization/costs [7, 13, 15],
hospitalization [13, 16], (number of) chronic conditions [13, 15], social deprivation [11, 13],
patients general health status [11, 14], mental disorders [13, 14], obesity-related factors [7, 11]
and diabetes or cardiovascular disease indicators [7] were found to be important predictors for
becoming a HCP. Tamang et al (2017) [6] found that about one third of HCPs remain HCPs
for the forthcoming year, while Wodchis et al (2016) [1] found that about 30 percent remain
HCP for the following two years, once they are currently HCPs. Nevertheless, about two thirds
of HCP were new to this group and labelled as ‘cost bloomers’ [6]. Among those, becoming a
HCP is sometimes triggered by relatively rare or unforeseeable but expensive conditions [1],
such as accidents. Consequentially, a certain percentage of HCP will most likely always stay
unpredictable, even with sophisticated prediction (...truncated)