TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records
Article
https://doi.org/10.1038/s41467-023-43715-z
TransformEHR: transformer-based encoderdecoder generative model to enhance
prediction of disease outcomes using
electronic health records
Received: 3 May 2023
1234567890():,;
1234567890():,;
Accepted: 17 November 2023
Check for updates
Zhichao Yang 1, Avijit Mitra1, Weisong Liu2,3, Dan Berlowitz3,4 &
Hong Yu 1,2,3,5
Deep learning transformer-based models using longitudinal electronic health
records (EHRs) have shown a great success in prediction of clinical diseases or
outcomes. Pretraining on a large dataset can help such models map the input
space better and boost their performance on relevant tasks through finetuning
with limited data. In this study, we present TransformEHR, a generative
encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future
visit from previous visits. TransformEHR’s encoder-decoder framework,
paired with the novel pretraining objective, helps it achieve the new state-ofthe-art performance on multiple clinical prediction tasks. Comparing with the
previous model, TransformEHR improves area under the precision–recall
curve by 2% (p < 0.001) for pancreatic cancer onset and by 24% (p = 0.007) for
intentional self-harm in patients with post-traumatic stress disorder. The high
performance in predicting intentional self-harm shows the potential of
TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily finetuned for clinical prediction
tasks with limited data.
The widespread adoption of electronic health records (EHRs) among
the US hospitals has led to the development and adoption of numerous data mining and statistical techniques for EHRs. Longitudinal EHRs
have been successfully used to predict clinical diseases or outcomes1–4.
Early work applied regression and traditional machine learning (ML)
based models (e.g., support vectors machines, random forest, and
gradient boosting) to predict single disease or outcome. Examples
include congestive heart failure5, sepsis mortality6, mechanical
ventilation6, septic shock7, type 2 diabetes8, and development of posttraumatic stress disorder (PTSD)9, among others.
With the availability of large cohorts and computational resources, deep learning based models can outperform traditional ML
models10–16. State-of-the-art (SOTA) models in EHR-based predictive
modeling achieved this through the pretrain-finetune paradigm - a
two-step process where the model is first trained on large-scale longitudinal EHRs to learn the representations of clinical features such as,
International Classification of Diseases (ICD) codes (pretrain) and then
further trained to adapt to specific tasks e.g., outcome prediction
(finetune). Models such as Med-BERT13, BEHRT14, and BRLTM15 fall in
this category. However, their pretraining objectives were limited in
1
College of Information and Computer Science, University of Massachusetts Amherst, Amherst, MA, USA. 2School of Computer & Information Sciences,
University of Massachusetts Lowell, Lowell, MA, USA. 3Center for Healthcare Organization and Implementation Research, VA Bedford Health Care System,
Bedford, MA, USA. 4Department of Public Health, University of Massachusetts Lowell, Lowell, MA, USA. 5Center for Biomedical and Health Research in Data
e-mail:
Sciences, University of Massachusetts Lowell, Lowell, MA, USA.
Nature Communications | (2023)14:7857
1
Article
predicting a fraction of ICD codes within each visit. In reality, most
patients have multiple diseases or outcomes at once17, many of which
are highly correlated (such as obesity, diabetes, and hypertension18–20)
and thus collectively contribute to the disease or outcome trajectories.
Therefore, a novel pretraining strategy, which predicts the complete
set of diseases and outcomes within a visit, might improve clinical
predictive modeling.
In this study, we propose TransformEHR, an innovative denoising
sequence to sequence transformer21 model that was pretrained on 6.5
million patients’ EHRs to predict complete ICD codes of a visit.
TransformEHR can be further finetuned for single disease or outcome
predictions. Unlike previous EHR-based models13–16 which rely on the
bidirectional (left-to-right and right-to-left) encoder representation
from transformers (BERT) framework22, TransformEHR used a
transformer-based encoder-decoder generative framework to predict
future ICD codes during pretraining. The unidirectional (left-to-right)
decoder in such an encoder-decoder framework is more similar to the
use case of future disease or outcome predictions based on history of
past diseases or outcomes (past-to-future) compared to the bidirectional encoder-only framework.
Although the encoder-decoder framework was originally
designed to generate next sentence given previous sentences as
context23,24, we repurposed the framework for TransformEHR to generate the ICD codes of the next visit given previous EHRs (context).
TransformEHR can utilize cross-attention21 by identifying relevant ICD
codes from previous visits to predict future ICD codes. The decoder
then predicts ICD codes one after another by using already predicted
diagnostic ICD codes to predict next ICD codes. Furthermore, TransformEHR includes date of each visit to integrate temporal information,
whereas previous transformer-based predictive models only included
their sequential order13–16. Specific date of each visit is an important
feature in predictive modeling as importance of predictor in a visit can
vary over time1,25–27.
We evaluated TransformEHR for a broad range of disease and
outcome predictions. In addition to predictions of ICD codes, we
evaluated TransformEHR on two challenging and clinically important
disease and outcome prediction tasks: pancreatic cancer prediction
and intentional self-harm prediction among PTSD patients. In summary, our key contributions are as follows:
First, we propose a new pretraining objective that predicts all
diseases or outcomes of a future visit using longitudinal information
from the previous visits. Such a pretraining objective helps TransformEHR uncover the complex interrelations among different diseases
and outcomes.
Second, this is the first study that explored a generative encoderdecoder framework to predict patients’ ICD codes using their longitudinal EHRs. Our encoder-decoder framework outperformed the
encoder-based models in part due to the decoder self-attention and
cross-attention mechanisms. TransformEHR outperformed SOTA
BERT models on both common and uncommon ICD code predictions.
In particular, the improvements for uncommon ICD code predictions
were substantial.
Third, TransformEHR achieved a positive predictive value (PPV) of
8.8% for prediction of intentional self-harm among the top 10% PTSD
patients at high predicted risk. A recent study has shown that a practical suicide prevention tool must achieve above 1.7% (...truncated)