EHR foundation models improve robustness in the presence of temporal distribution shift
www.nature.com/scientificreports
OPEN
EHR foundation models improve
robustness in the presence
of temporal distribution shift
Lin Lawrence Guo 1, Ethan Steinberg 2, Scott Lanyon Fleming 2, Jose Posada 3,
Joshua Lemmon 1, Stephen R. Pfohl 2, Nigam Shah 2, Jason Fries 2,5 & Lillian Sung 1,4,5*
Temporal distribution shift negatively impacts the performance of clinical prediction models over
time. Pretraining foundation models using self-supervised learning on electronic health records
(EHR) may be effective in acquiring informative global patterns that can improve the robustness
of task-specific models. The objective was to evaluate the utility of EHR foundation models in
improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction
models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR
of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g.,
2009–2012) and were subsequently used to construct patient representations for patients admitted
to inpatient units. These representations were used to train logistic regression models to predict
hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR
foundation models with baseline logistic regression models learned on count-based representations
(count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiveroperating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute
calibration error. Both transformer and recurrent-based foundation models generally showed better
ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is
observable degradation of discrimination performance (average AUROC decay of 3% for transformerbased foundation model vs. 7% for count-LR after 5–9 years). In addition, the performance and
robustness of transformer-based foundation models continued to improve as pretraining set size
increased. These results suggest that pretraining EHR foundation models at scale is a useful approach
for developing clinical prediction models that perform well in the presence of temporal distribution
shift.
The large increase in the adoption of electronic health records (EHR) has enabled the use of machine learning
to develop highly performant clinical prediction models that have the potential to improve the care of p
atients1.
However, the non-stationary healthcare environment can bring about changes in the data distribution between
model development and deployment2, which can degrade the model’s performance over time3 and consequently
its clinical u
tility4. In this study, we explored temporal distribution shift alongside the suitability of foundation models5—deep neural networks trained on large-scale unlabeled data using self-supervised learning—and
whether they can be adapted via transfer learning to improve the robustness of clinical prediction models in the
presence of temporal distribution shift.
The cause of temporal distribution shift in clinical medicine is often s ubtle6 and the extent of its impact on
model performance is heterogeneous across t asks3,7–9. Nonetheless, the consequence of the impact on patient
care and physician’s trust can be severe. An example is the widely implemented Epic sepsis model developed on
data collected between 2013 and 2015 that performed below expectation when evaluated at Michigan Medicine
on data collected between 2018 and 2019 and resulted in a large number of spurious alerts4.
Recent approaches that mitigate the impact of temporal distribution shift on model performance in clinical medicine largely rely on model monitoring and updating policies that do not leverage the entire patient
1
Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. 2Stanford
Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA. 3Universidad del Norte,
Barranquilla, Colombia. 4Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue,
Toronto, ON M5G1X8, Canada. 5These authors jointly supervised this work: Jason Fries and Lillian Sung. *email:
Scientific Reports |
(2023) 13:3767
| https://doi.org/10.1038/s41598-023-30820-8
1
Vol.:(0123456789)
www.nature.com/scientificreports/
population available10. In addition, proactive approaches using domain generalization and adaptation have shown
little to no success3.
While recent work on medical foundation models has focused on improving sample complexity when finetuning, little-to-no work has measured a pretrained, medical foundation model’s impact on temporal robustness
in clinical prediction tasks. Findings from domains outside of clinical medicine suggest significant p
erformance11
and robustness12,13 benefits to pretraining foundation models, and these benefits tend to increase with s cale14,15.
Another major benefit of foundation models is their ability to generalize to tasks not seen during t raining16.
In this study, we adopt EHR foundation models—deep neural networks pretrained on EHR-derived patient
timelines using self-supervised learning. Patient timelines consist of structured medical codes ordered by time,
where each code (e.g., M32.9 for “lupus erythematosus”) functions as a word drawn from a finite vocabulary
defined by medical ontologies such as ICD10. This formulation enables using autoregressive sequence modeling,
a self-supervised learning objective used in natural language processing, to train an EHR foundation model by
predicting the next day’s codes. The resulting pretrained model is then used to generate feature representations for
downstream tasks. The foundation modeling approach in this study is referred to as clinical language model based
representations (CLMBR)17. Transfer of the structure learned by CLMBR from the entire patient population to
downstream clinical prediction models have demonstrated performance benefits compared to standard baselines
including count-based models, especially when the number of patient records was s mall17. CLMBR’s architecture aligns with other EHR foundation models, such as Med-BERT18 and BEHRT19, but uses an autoregressive
instead of masked language modeling objective for pretraining to match the next-day prediction task. We refer
to CLMBR as an EHR foundation model because of its potential to shift practice in the development of machine
learning models for clinical medicine. We focus specifically on the implications for temporal robustness when
adapting task-specific models from a shared, self-supervised model trained on a patient population. However,
we recognize that scale (both in parameter count and training data size) is a key aspect of modern foundation
models and that structured EHR models are currently much smaller than their counterparts in language and
vision. For example, GPT-316 has 175 billion parameters compared to 42 milli (...truncated)