TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-023-43715-z.pdf

TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records

Article https://doi.org/10.1038/s41467-023-43715-z TransformEHR: transformer-based encoderdecoder generative model to enhance prediction of disease outcomes using electronic health records Received: 3 May 2023 1234567890():,; 1234567890():,; Accepted: 17 November 2023 Check for updates Zhichao Yang 1, Avijit Mitra1, Weisong Liu2,3, Dan Berlowitz3,4 & Hong Yu 1,2,3,5 Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through ﬁnetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits. TransformEHR’s encoder-decoder framework, paired with the novel pretraining objective, helps it achieve the new state-ofthe-art performance on multiple clinical prediction tasks. Comparing with the previous model, TransformEHR improves area under the precision–recall curve by 2% (p < 0.001) for pancreatic cancer onset and by 24% (p = 0.007) for intentional self-harm in patients with post-traumatic stress disorder. The high performance in predicting intentional self-harm shows the potential of TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily ﬁnetuned for clinical prediction tasks with limited data. The widespread adoption of electronic health records (EHRs) among the US hospitals has led to the development and adoption of numerous data mining and statistical techniques for EHRs. Longitudinal EHRs have been successfully used to predict clinical diseases or outcomes1–4. Early work applied regression and traditional machine learning (ML) based models (e.g., support vectors machines, random forest, and gradient boosting) to predict single disease or outcome. Examples include congestive heart failure5, sepsis mortality6, mechanical ventilation6, septic shock7, type 2 diabetes8, and development of posttraumatic stress disorder (PTSD)9, among others. With the availability of large cohorts and computational resources, deep learning based models can outperform traditional ML models10–16. State-of-the-art (SOTA) models in EHR-based predictive modeling achieved this through the pretrain-ﬁnetune paradigm - a two-step process where the model is ﬁrst trained on large-scale longitudinal EHRs to learn the representations of clinical features such as, International Classiﬁcation of Diseases (ICD) codes (pretrain) and then further trained to adapt to speciﬁc tasks e.g., outcome prediction (ﬁnetune). Models such as Med-BERT13, BEHRT14, and BRLTM15 fall in this category. However, their pretraining objectives were limited in 1 College of Information and Computer Science, University of Massachusetts Amherst, Amherst, MA, USA. 2School of Computer & Information Sciences, University of Massachusetts Lowell, Lowell, MA, USA. 3Center for Healthcare Organization and Implementation Research, VA Bedford Health Care System, Bedford, MA, USA. 4Department of Public Health, University of Massachusetts Lowell, Lowell, MA, USA. 5Center for Biomedical and Health Research in Data e-mail: Sciences, University of Massachusetts Lowell, Lowell, MA, USA. Nature Communications | (2023)14:7857 1 Article predicting a fraction of ICD codes within each visit. In reality, most patients have multiple diseases or outcomes at once17, many of which are highly correlated (such as obesity, diabetes, and hypertension18–20) and thus collectively contribute to the disease or outcome trajectories. Therefore, a novel pretraining strategy, which predicts the complete set of diseases and outcomes within a visit, might improve clinical predictive modeling. In this study, we propose TransformEHR, an innovative denoising sequence to sequence transformer21 model that was pretrained on 6.5 million patients’ EHRs to predict complete ICD codes of a visit. TransformEHR can be further ﬁnetuned for single disease or outcome predictions. Unlike previous EHR-based models13–16 which rely on the bidirectional (left-to-right and right-to-left) encoder representation from transformers (BERT) framework22, TransformEHR used a transformer-based encoder-decoder generative framework to predict future ICD codes during pretraining. The unidirectional (left-to-right) decoder in such an encoder-decoder framework is more similar to the use case of future disease or outcome predictions based on history of past diseases or outcomes (past-to-future) compared to the bidirectional encoder-only framework. Although the encoder-decoder framework was originally designed to generate next sentence given previous sentences as context23,24, we repurposed the framework for TransformEHR to generate the ICD codes of the next visit given previous EHRs (context). TransformEHR can utilize cross-attention21 by identifying relevant ICD codes from previous visits to predict future ICD codes. The decoder then predicts ICD codes one after another by using already predicted diagnostic ICD codes to predict next ICD codes. Furthermore, TransformEHR includes date of each visit to integrate temporal information, whereas previous transformer-based predictive models only included their sequential order13–16. Speciﬁc date of each visit is an important feature in predictive modeling as importance of predictor in a visit can vary over time1,25–27. We evaluated TransformEHR for a broad range of disease and outcome predictions. In addition to predictions of ICD codes, we evaluated TransformEHR on two challenging and clinically important disease and outcome prediction tasks: pancreatic cancer prediction and intentional self-harm prediction among PTSD patients. In summary, our key contributions are as follows: First, we propose a new pretraining objective that predicts all diseases or outcomes of a future visit using longitudinal information from the previous visits. Such a pretraining objective helps TransformEHR uncover the complex interrelations among different diseases and outcomes. Second, this is the ﬁrst study that explored a generative encoderdecoder framework to predict patients’ ICD codes using their longitudinal EHRs. Our encoder-decoder framework outperformed the encoder-based models in part due to the decoder self-attention and cross-attention mechanisms. TransformEHR outperformed SOTA BERT models on both common and uncommon ICD code predictions. In particular, the improvements for uncommon ICD code predictions were substantial. Third, TransformEHR achieved a positive predictive value (PPV) of 8.8% for prediction of intentional self-harm among the top 10% PTSD patients at high predicted risk. A recent study has shown that a practical suicide prevention tool must achieve above 1.7% (...truncated)