Multimodal masked siamese network improves chest X-ray representation learning (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41598-024-74043-x.pdf

Multimodal masked siamese network improves chest X-ray representation learning

www.nature.com/scientificreports OPEN Multimodal masked siamese network improves chest X-ray representation learning Saeed Shurrab1,2, Alejandro Guerra-Manzanares1,2 & Farah E. Shamout1 Self-supervised learning methods for medical images primarily rely on the imaging modality during pretraining. Although such approaches deliver promising results, they do not take advantage of the associated patient or scan information collected within Electronic Health Records (EHR). This study aims to develop a multimodal pretraining approach for chest radiographs that considers EHR data incorporation as an additional modality that during training. We propose to incorporate EHR data during self-supervised pretraining with a Masked Siamese Network (MSN) to enhance the quality of chest radiograph representations. We investigate three types of EHR data, including demographic, scan metadata, and inpatient stay information. We evaluate the multimodal MSN on three publicly available chest X-ray datasets, MIMIC-CXR, CheXpert, and NIH-14, using two vision transformer (ViT) backbones, specifically ViT-Tiny and ViT-Small. In assessing the quality of the representations through linear evaluation, our proposed method demonstrates significant improvement compared to vanilla MSN and state-of-the-art self-supervised learning baselines. In particular, our proposed method achieves an improvement of of 2% in the Area Under the Receiver Operating Characteristic Curve (AUROC) compared to vanilla MSN and 5% to 8% compared to other baselines, including unimodal ones. Furthermore, our findings reveal that demographic features provide the most significant performance improvement. Our work highlights the potential of EHR-enhanced self-supervised pretraining for medical imaging and opens opportunities for future research to address limitations in existing representation learning methods for other medical imaging modalities, such as neuro-, ophthalmic, and sonar imaging. Supervised training of deep neural networks requires large amounts of quality annotated data1. This is not always straightforward in applications involving clinical tasks, due to the time, cost, effort, and expertise required to collect labeled data2. Self-supervised learning has recently demonstrated great success in leveraging unlabeled data, such as in natural language processing3 and computer vision4. Such frameworks aim to learn useful underlying representations during pretraining, without any labels, which are then used in downstream prediction tasks via supervised linear evaluation. Considering the state-of-the-art performance of self-supervised pretraining with large unlabeled data compared to end-to-end supervised learning, a plethora of recent applications in healthcare sought to harness the power of self-supervised learning by focusing on a specific type of data, usually a single modality5. For example, Xie et al6. applied spatial augmentations for 3D image segmentation, while Azizi et al7. applied transformations to Chest X-Ray (CXR) images and dermatology images to predict radiology labels and skin conditions, respectively. Zhang et al8. preserved time-frequency consistency of time-series data for several tasks, such as detection of epilepsy, while Kiyasseh et al9. leveraged electrocardiogram signals to learn patient-specific representations for the classification of cardiac arrhythmia. Self-supervised learning methods learn task-agnostic feature representations using hand-crafted pretext tasks or joint embedding architectures10,11. Hand-crafted pretext tasks rely on the use of pseudo-labels generated from unlabeled data. Examples of such tasks include rotation prediction12, jigsaw puzzle solving13, colorization14, and in-painting15. Joint embedding methods utilize siamese networks16 to learn useful representations by discriminating between different views of samples based on a specific objective function11, without the need for human annotation or pseudo-labels. Joint embedding methods can be further categorized into contrastive and non-contrastive methods, where the latter encompasses clustering, distillation, and information maximization methods11. Contrastive methods learn representations by maximizing the agreement between positive pairs and minimizing the agreement between negative pairs17. Some prominent examples include SimCLR18, contrastive 1New York University Abu Dhabi, Computer Engineering, Abu Dhabi 129188, UAE. 2These authors contributed equally: Saeed Shurrab and Alejandro Guerra Manzanares. email: Scientific Reports | (2024) 14:22516 | https://doi.org/10.1038/s41598-024-74043-x 1 www.nature.com/scientificreports/ predictive coding17, and MoCo19. Non-contrastive methods focus on optimizing different forms of similarity metrics across the learned embeddings. Examples include BYOL20, SimSiam21, and VICReg11. Although most existing work considers convolutional networks as the backbone of input encoders, recent approaches explore the role of vision transformers (ViT)22for self-supervision, such as DINO23and MSN24. MSN is a state-of-the-art self-supervised learning architecture that operates on the principle of mask-denoising, without reconstruction, as well as transformation invariance with transformers. MSN has limited applications in healthcare-related tasks, and is promising considering its computational scalability. Self-supervised learning methods have shown great promise in learning representations of different types of medical images, such as computed tomography and magnetic resonance imaging2,25,26, optical coherence tomography and fundus photography27–29, and endoscopy images30. Several studies investigated self-supervised learning for applications involving CXR images. For example, Sowrirajan et al31., Chen et al32., and Sriram et al33. utilized MoCo19as a pretraining strategy for chest disease diagnosis and prognosis tasks. Azizi et al7. showed that the initialization of SimCLR18during pretraining with ImageNet weights improves downstream performance in CXR classification. Van et al34. explored the impact of various image augmentation techniques on siamese representation learning for CXR.Medical data is inherently multimodal, encompassing various types of modalities, such as medical images, Electronic Health Records (EHR), clinical notes, and omics35. In clinical practice, health professionals rely on several sources of information including patient history, laboratory results, vital-sign measurements, and imaging exams to make a diagnosis or treatment decisions and to enhance their understanding of various diseases36,37. Mimicking this approach in model training allows the model to learn from the diverse types of information that clinicians use, potentially leading to better representation learning38–40. Additionally, EHR data can provide contextual information that enhances the features extracted from imaging data. For example, demographic information can help the model learn age- or gender-speci (...truncated)