CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images
www.nature.com/scientificdata
CheXmask: a large-scale dataset of
Data Descriptor anatomical segmentation masks
for multi-center chest x-ray images
OPEN
Nicolás Gaggion 1, Candelaria Mosquera2,3, Lucas Mansilla1, Julia Mariel Saidman4,
Martina Aineseder4, Diego H. Milone1 & Enzo Ferrante 1 ✉
The development of successful artificial intelligence models for chest X-ray analysis relies on large,
diverse datasets with high-quality annotations. While several databases of chest X-ray images
have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical
segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center
segmentation dataset with uniform and fine-grain anatomical annotations for images coming from
five well-known publicly available databases: ChestX-ray8, CheXpert, MIMIC-CXR-JPG, Padchest,
and VinDr-CXR, resulting in 657,566 segmentation masks. Our methodology utilizes the HybridGNet
model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation,
including expert physician evaluation and automatic quality control, was conducted to validate the
resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality
estimation per dataset. This dataset serves as a valuable resource for the broader scientific community,
streamlining the development and assessment of innovative methodologies in chest X-ray analysis.
Background & Summary
Chest radiography is a pivotal imaging technique used to diagnose a variety of lung diseases, including pneumonia, tuberculosis, and lung cancer. The significant role of chest X-rays (CXR) in clinical practice is ascribed
to their non-invasive nature, relatively low cost, and rapid diagnostic potential. However, the interpretation of
these images poses a considerable challenge due to the intricate and overlapping structures within the thoracic
cavity, and the subtle manifestations of certain pathological conditions. The high demand for chest radiography
and the global shortage of radiologists accentuate the need for efficient and reliable automated analysis systems.
In recent years, methods based on deep learning (DL) have demonstrated exceptional prowess in interpreting medical images, rivaling and occasionally surpassing expert human performance1,2. Convolutional neural
networks (CNN) have been particularly instrumental in facilitating such computer-aided diagnosis (CADx) systems3,4. Nonetheless, the success of these algorithms is closely tethered to the availability of accurately annotated
data, with sufficient quantity and diversity, to train the models.
An essential task within this framework is segmentation - the delineation of specific anatomical structures
or pathological lesions within an image. In the context of CXR, this might involve the demarcation of anatomical structures such as lungs or heart, or the location of disease abnormalities5. Accurate and robust segmentation can serve as a precursor to other downstream tasks, for example providing significant information
about the location and size of specific organs or detected abnormalities. However, manual segmentation is a
time-consuming process, demanding substantial expertise, and thus, does not scale well to the size of large
databases required for DL model training6.
HybridGNet, a deep learning model for realistic organ contouring, offers a solution for the generation of
anatomically plausible CXR segmentations7,8. Utilizing a hybrid approach, it combines conventional convolution
operations for image encoding with graph generative models for the anatomically-guided delineation of organ
contours. The HybridGNet model was initially introduced with a small CXR landmark dataset to demonstrate
its efficacy. In this work, we leverage this model to accomplish our main objective: introducing a large-scale
1
Institute for Signals, Systems and Computational Intelligence, sinc(i) CONICET-UNL, Santa Fe, CP 3002, Argentina.
Health Informatics Department at Hospital Italiano de Buenos Aires, Buenos Aires, CP 1199, Argentina. 3Universidad
Tecnológica Nacional, Buenos Aires, CP 1179, Argentina. 4Radiology Department, Hospital Italiano de Buenos Aires,
Buenos Aires, CP 1199, Argentina. ✉e-mail:
2
Scientific Data |
(2024) 11:511 | https://doi.org/10.1038/s41597-024-03358-1
1
www.nature.com/scientificdata
www.nature.com/scientificdata/
Data Preparation
Input
Dataset
InclusionExclusion
Criteria
Study
Dataset
Image
Preprocessing
Preprocessed
Dataset
Data Processing
Annotated dataset
with automatic
quality assesment
Quality assesment
of masks
via RCA framework
Landmark and
mask annotations
Landmark-based
segmentation
via HybridGNet
Technical Validation
Stratified histogram
sampling of masks
Manual image
segmentation using
LabelStudio
Statistical
validation of the
RCA results
Annotated dataset
with physician
validated quality
assesment
Fig. 1 Data processing flowchart depicting the main steps involved in the building of the CheXmask dataset.
segmentation dataset, named CheXmask, which provides anatomical masks with their corresponding quality index, for 5 extensive chest X-ray databases: Chest x-ray89, Chexpert2, MIMIC-CXR-JPG10, Padchest11 and
VinDr-CXR12. These databases collectively represent a wide variety of geographical locations, patient demographics, and disease spectra, enabling the development of a broad, diverse segmentation dataset.
As the original databases lack manually curated ground-truth segmentations, we perform quality control by
implementing our own Reverse Classification Accuracy (RCA) framework13. RCA allows to estimate the accuracy of a segmentation method for an individual image with no ground-truth (GT) masks, which is particularly
valuable for large-scale image analysis studies like ours. The fundamental concept behind RCA involves training
an auxiliary model (known as the reverse classifier) solely on the individual image, using its predicted segmentation as pseudo-GT. This model is then evaluated on a reference database that contains GT data to obtain a performance metric, which is expected to correlate with the performance that would be measured for the individual
image if its GT was available. We validated this method by comparing it to traditional performance evaluation on
a subset of test images with masks manually segmented by an expert physician. Additionally, since large-public
CXR databases built from automatic analysis of electronic health records (EHR) are subject to errors both in
image selection and image annotation, we found that RCA is a useful tool to detect out-of-distribution samples
(e.g. poor-quality images). Thus, the RCA metrics for HybridGNet segmentations stand out as a powerful quality
metric to handle large databases for downstream tasks, by detecting not only low quality segmentation masks,
but also images that should be filtered out.
Our comprehensive analysis underscores the capa (...truncated)