Multimodal masked siamese network improves chest X-ray representation learning
www.nature.com/scientificreports
OPEN
Multimodal masked siamese
network improves chest X-ray
representation learning
Saeed Shurrab1,2, Alejandro Guerra-Manzanares1,2 & Farah E. Shamout1
Self-supervised learning methods for medical images primarily rely on the imaging modality during
pretraining. Although such approaches deliver promising results, they do not take advantage of the
associated patient or scan information collected within Electronic Health Records (EHR). This study
aims to develop a multimodal pretraining approach for chest radiographs that considers EHR data
incorporation as an additional modality that during training. We propose to incorporate EHR data
during self-supervised pretraining with a Masked Siamese Network (MSN) to enhance the quality of
chest radiograph representations. We investigate three types of EHR data, including demographic,
scan metadata, and inpatient stay information. We evaluate the multimodal MSN on three publicly
available chest X-ray datasets, MIMIC-CXR, CheXpert, and NIH-14, using two vision transformer
(ViT) backbones, specifically ViT-Tiny and ViT-Small. In assessing the quality of the representations
through linear evaluation, our proposed method demonstrates significant improvement compared
to vanilla MSN and state-of-the-art self-supervised learning baselines. In particular, our proposed
method achieves an improvement of of 2% in the Area Under the Receiver Operating Characteristic
Curve (AUROC) compared to vanilla MSN and 5% to 8% compared to other baselines, including unimodal ones. Furthermore, our findings reveal that demographic features provide the most significant
performance improvement. Our work highlights the potential of EHR-enhanced self-supervised
pretraining for medical imaging and opens opportunities for future research to address limitations
in existing representation learning methods for other medical imaging modalities, such as neuro-,
ophthalmic, and sonar imaging.
Supervised training of deep neural networks requires large amounts of quality annotated data1. This is not
always straightforward in applications involving clinical tasks, due to the time, cost, effort, and expertise
required to collect labeled data2. Self-supervised learning has recently demonstrated great success in leveraging
unlabeled data, such as in natural language processing3 and computer vision4. Such frameworks aim to learn
useful underlying representations during pretraining, without any labels, which are then used in downstream
prediction tasks via supervised linear evaluation.
Considering the state-of-the-art performance of self-supervised pretraining with large unlabeled data
compared to end-to-end supervised learning, a plethora of recent applications in healthcare sought to harness
the power of self-supervised learning by focusing on a specific type of data, usually a single modality5. For
example, Xie et al6. applied spatial augmentations for 3D image segmentation, while Azizi et al7. applied
transformations to Chest X-Ray (CXR) images and dermatology images to predict radiology labels and skin
conditions, respectively. Zhang et al8. preserved time-frequency consistency of time-series data for several tasks,
such as detection of epilepsy, while Kiyasseh et al9. leveraged electrocardiogram signals to learn patient-specific
representations for the classification of cardiac arrhythmia.
Self-supervised learning methods learn task-agnostic feature representations using hand-crafted pretext tasks
or joint embedding architectures10,11. Hand-crafted pretext tasks rely on the use of pseudo-labels generated from
unlabeled data. Examples of such tasks include rotation prediction12, jigsaw puzzle solving13, colorization14,
and in-painting15. Joint embedding methods utilize siamese networks16 to learn useful representations by
discriminating between different views of samples based on a specific objective function11, without the need for
human annotation or pseudo-labels. Joint embedding methods can be further categorized into contrastive and
non-contrastive methods, where the latter encompasses clustering, distillation, and information maximization
methods11. Contrastive methods learn representations by maximizing the agreement between positive pairs and
minimizing the agreement between negative pairs17. Some prominent examples include SimCLR18, contrastive
1New York University Abu Dhabi, Computer Engineering, Abu Dhabi 129188, UAE. 2These authors contributed
equally: Saeed Shurrab and Alejandro Guerra Manzanares. email:
Scientific Reports |
(2024) 14:22516
| https://doi.org/10.1038/s41598-024-74043-x
1
www.nature.com/scientificreports/
predictive coding17, and MoCo19. Non-contrastive methods focus on optimizing different forms of similarity
metrics across the learned embeddings. Examples include BYOL20, SimSiam21, and VICReg11. Although most
existing work considers convolutional networks as the backbone of input encoders, recent approaches explore
the role of vision transformers (ViT)22for self-supervision, such as DINO23and MSN24. MSN is a state-of-the-art
self-supervised learning architecture that operates on the principle of mask-denoising, without reconstruction,
as well as transformation invariance with transformers. MSN has limited applications in healthcare-related tasks,
and is promising considering its computational scalability.
Self-supervised learning methods have shown great promise in learning representations of different types
of medical images, such as computed tomography and magnetic resonance imaging2,25,26, optical coherence
tomography and fundus photography27–29, and endoscopy images30. Several studies investigated self-supervised
learning for applications involving CXR images. For example, Sowrirajan et al31., Chen et al32., and Sriram
et al33. utilized MoCo19as a pretraining strategy for chest disease diagnosis and prognosis tasks. Azizi et al7.
showed that the initialization of SimCLR18during pretraining with ImageNet weights improves downstream
performance in CXR classification. Van et al34. explored the impact of various image augmentation techniques
on siamese representation learning for CXR.Medical data is inherently multimodal, encompassing various
types of modalities, such as medical images, Electronic Health Records (EHR), clinical notes, and omics35. In
clinical practice, health professionals rely on several sources of information including patient history, laboratory
results, vital-sign measurements, and imaging exams to make a diagnosis or treatment decisions and to enhance
their understanding of various diseases36,37. Mimicking this approach in model training allows the model to
learn from the diverse types of information that clinicians use, potentially leading to better representation
learning38–40. Additionally, EHR data can provide contextual information that enhances the features extracted
from imaging data. For example, demographic information can help the model learn age- or gender-speci (...truncated)