A multi-center study on the adaptability of a shared foundation model for electronic health records
npj | digital medicine
Article
Published in partnership with Seoul National University Bundang Hospital
https://doi.org/10.1038/s41746-024-01166-w
A multi-center study on the adaptability of
a shared foundation model for electronic
health records
Check for updates
1,7
2,7
2
1234567890():,;
1234567890():,;
Lin Lawrence Guo , Jason Fries
, Ethan Steinberg , Scott Lanyon Fleming
Catherine Aftandilian4, Jose Posada 5, Nigam Shah 2,8 & Lillian Sung 1,6,8
2
, Keith Morse
3
,
Foundation models are transforming artificial intelligence (AI) in healthcare by providing modular
components adaptable for various downstream tasks, making AI development more scalable and
cost-effective. Foundation models for structured electronic health records (EHR), trained on coded
medical records from millions of patients, demonstrated benefits including increased performance
with fewer training labels, and improved robustness to distribution shifts. However, questions remain
on the feasibility of sharing these models across hospitals and their performance in local tasks. This
multi-center study examined the adaptability of a publicly accessible structured EHR foundation
model (FMSM), trained on 2.57 M patient records from Stanford Medicine. Experiments used EHR data
from The Hospital for Sick Children (SickKids) and Medical Information Mart for Intensive Care (MIMICIV). We assessed both adaptability via continued pretraining on local data, and task adaptability
compared to baselines of locally training models from scratch, including a local foundation model.
Evaluations on 8 clinical prediction tasks showed that adapting the off-the-shelf FMSM matched the
performance of gradient boosting machines (GBM) locally trained on all data while providing a 13%
improvement in settings with few task-specific training labels. Continued pretraining on local data
showed FMSM required fewer than 1% of training examples to match the fully trained GBM’s
performance, and was 60 to 90% more sample-efficient than training local foundation models from
scratch. Our findings demonstrate that adapting EHR foundation models across hospitals provides
improved prediction performance at less cost, underscoring the utility of base foundation models as
modular components to streamline the development of healthcare AI.
Foundation models1, large-scale artificial intelligence (AI) models trained on
massive amounts of unlabeled data using self-supervised learning, mark a
paradigm shift for healthcare AI by moving away from bespoke, singlepurpose models to generalist and more easily adaptable medical AI2.
Foundation models open new opportunities to improve diagnostic and
predictive capabilities, enable proactive interventions and improve patient
care using a range of modalities including natural language3,4, imaging5,
genomics6,7, and structured data from electronic health records (EHRs)8–11.
Structured EHR foundation models, trained on tabular, timestamped event
data for procedures, diagnoses, medications, and lab values as examples,
offer distinct representational abilities over other modalities by focusing on
encoding patients’ longitudinal medical history. This enables generating
feature representations that summarize a patient’s entire medical history up
to a specific time point, facilitating downstream tasks such as risk stratification and time-to-event modeling.
Recent EHR foundation models report state-of-the-art accuracy,
require fewer labeled examples for task adaptation, and have demonstrated
improved robustness to distribution shifts across time and patient
subpopulations12,13. With model hubs (centralized repositories for pretrained model weights) playing a key role in modern AI development,
1
Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. 2Stanford Center for Biomedical Informatics Research,
Stanford University, Palo Alto, CA, USA. 3Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, USA. 4Division of
Hematology/Oncology, Department of Pediatrics, Stanford University, Palo Alto, CA, USA. 5Universidad del Norte, Barranquilla, Colombia. 6Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, ON, Canada. 7These authors contributed equally: Lin Lawrence Guo, Jason Fries.8These authors jointly
e-mail:
supervised this work: Nigam Shah, Lillian Sung.
npj Digital Medicine | (2024)7:171
1
Article
https://doi.org/10.1038/s41746-024-01166-w
sharing EHR foundation models across sites offers many practical advantages by providing a less expensive route for local hospitals to adapt a
foundation model for their specific needs. More importantly, key properties
of foundation models, such as their skills, domain knowledge, and biases, are
highly dependent on the specific data used for pretraining14,15. Since largescale EHR datasets (>1 million patients) are challenging to obtain for most
researchers, sharing EHR foundation model weights becomes critical to
advancing research into mitigating biases, improving robustness, and other
properties intrinsic to a specific set of pretrained model weights. Finally,
given recent arguments for regulatory oversight and quality assurance of
healthcare AI models by public-private entities16, access to foundation
model weights that have undergone some certification process may become
a prerequisite for model deployment.
Adapting and improving existing foundation models (rather than pretraining from scratch) is the predominant workflow in domains such as NLP
and computer vision. However, the absence of public structured EHR foundation models has hampered similar progress in EHR settings17. This creates
challenges in advancing label/sample efficiency, few-shot learning, and general
methods to improving EHR foundation models without access to the original
pretraining data18. For example, work in other modalities has found that pretraining on large-scale, heterogeneous data generally improves robustness19
and that continued pretraining of existing models using in-domain data further improves performance in a target domain20. This offers a promising route
to improving existing EHR foundation models at local hospitals but introduces
potential challenges around catastrophic forgetting and other issues that have
been underexplored due to the lack of large-scale, shared EHR models.
Although there is a growing body of work evaluating pretrained models
across different hospital systems (GenHPF21, TransformEHR22) and transfer
from EHR data to insurance claims (Med-BERT9), prior studies have
focused on private foundation models, pretrained from scratch, and the role
architectural choices play in transfer learning performance in downstream
task adaptation. There has been limited exploration of label efficiency in
EHR settings, where encoder-only/BERT-style models perform poorly on
few-shot tasks. For example, Med-BERT required an average of 200–1000
training examples per adapted task (...truncated)