EHR foundation models improve robustness in the presence of temporal distribution shift (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41598-023-30820-8.pdf

EHR foundation models improve robustness in the presence of temporal distribution shift

www.nature.com/scientificreports OPEN EHR foundation models improve robustness in the presence of temporal distribution shift Lin Lawrence Guo 1, Ethan Steinberg 2, Scott Lanyon Fleming 2, Jose Posada 3, Joshua Lemmon 1, Stephen R. Pfohl 2, Nigam Shah 2, Jason Fries 2,5 & Lillian Sung 1,4,5* Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009–2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiveroperating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformerbased foundation model vs. 7% for count-LR after 5–9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift. The large increase in the adoption of electronic health records (EHR) has enabled the use of machine learning to develop highly performant clinical prediction models that have the potential to improve the care of p atients1. However, the non-stationary healthcare environment can bring about changes in the data distribution between model development and deployment2, which can degrade the model’s performance over time3 and consequently its clinical u tility4. In this study, we explored temporal distribution shift alongside the suitability of foundation models5—deep neural networks trained on large-scale unlabeled data using self-supervised learning—and whether they can be adapted via transfer learning to improve the robustness of clinical prediction models in the presence of temporal distribution shift. The cause of temporal distribution shift in clinical medicine is often s ubtle6 and the extent of its impact on model performance is heterogeneous across t asks3,7–9. Nonetheless, the consequence of the impact on patient care and physician’s trust can be severe. An example is the widely implemented Epic sepsis model developed on data collected between 2013 and 2015 that performed below expectation when evaluated at Michigan Medicine on data collected between 2018 and 2019 and resulted in a large number of spurious alerts4. Recent approaches that mitigate the impact of temporal distribution shift on model performance in clinical medicine largely rely on model monitoring and updating policies that do not leverage the entire patient 1 Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. 2Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA. 3Universidad del Norte, Barranquilla, Colombia. 4Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, Toronto, ON M5G1X8, Canada. 5These authors jointly supervised this work: Jason Fries and Lillian Sung. *email: Scientific Reports | (2023) 13:3767 | https://doi.org/10.1038/s41598-023-30820-8 1 Vol.:(0123456789) www.nature.com/scientificreports/ population available10. In addition, proactive approaches using domain generalization and adaptation have shown little to no success3. While recent work on medical foundation models has focused on improving sample complexity when finetuning, little-to-no work has measured a pretrained, medical foundation model’s impact on temporal robustness in clinical prediction tasks. Findings from domains outside of clinical medicine suggest significant p erformance11 and robustness12,13 benefits to pretraining foundation models, and these benefits tend to increase with s cale14,15. Another major benefit of foundation models is their ability to generalize to tasks not seen during t raining16. In this study, we adopt EHR foundation models—deep neural networks pretrained on EHR-derived patient timelines using self-supervised learning. Patient timelines consist of structured medical codes ordered by time, where each code (e.g., M32.9 for “lupus erythematosus”) functions as a word drawn from a finite vocabulary defined by medical ontologies such as ICD10. This formulation enables using autoregressive sequence modeling, a self-supervised learning objective used in natural language processing, to train an EHR foundation model by predicting the next day’s codes. The resulting pretrained model is then used to generate feature representations for downstream tasks. The foundation modeling approach in this study is referred to as clinical language model based representations (CLMBR)17. Transfer of the structure learned by CLMBR from the entire patient population to downstream clinical prediction models have demonstrated performance benefits compared to standard baselines including count-based models, especially when the number of patient records was s mall17. CLMBR’s architecture aligns with other EHR foundation models, such as Med-BERT18 and BEHRT19, but uses an autoregressive instead of masked language modeling objective for pretraining to match the next-day prediction task. We refer to CLMBR as an EHR foundation model because of its potential to shift practice in the development of machine learning models for clinical medicine. We focus specifically on the implications for temporal robustness when adapting task-specific models from a shared, self-supervised model trained on a patient population. However, we recognize that scale (both in parameter count and training data size) is a key aspect of modern foundation models and that structured EHR models are currently much smaller than their counterparts in language and vision. For example, GPT-316 has 175 billion parameters compared to 42 milli (...truncated)