CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41597-024-03358-1.pdf

CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

www.nature.com/scientificdata CheXmask: a large-scale dataset of Data Descriptor anatomical segmentation masks for multi-center chest x-ray images OPEN Nicolás Gaggion 1, Candelaria Mosquera2,3, Lucas Mansilla1, Julia Mariel Saidman4, Martina Aineseder4, Diego H. Milone1 & Enzo Ferrante 1 ✉ The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from five well-known publicly available databases: ChestX-ray8, CheXpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 657,566 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. Background & Summary Chest radiography is a pivotal imaging technique used to diagnose a variety of lung diseases, including pneumonia, tuberculosis, and lung cancer. The significant role of chest X-rays (CXR) in clinical practice is ascribed to their non-invasive nature, relatively low cost, and rapid diagnostic potential. However, the interpretation of these images poses a considerable challenge due to the intricate and overlapping structures within the thoracic cavity, and the subtle manifestations of certain pathological conditions. The high demand for chest radiography and the global shortage of radiologists accentuate the need for efficient and reliable automated analysis systems. In recent years, methods based on deep learning (DL) have demonstrated exceptional prowess in interpreting medical images, rivaling and occasionally surpassing expert human performance1,2. Convolutional neural networks (CNN) have been particularly instrumental in facilitating such computer-aided diagnosis (CADx) systems3,4. Nonetheless, the success of these algorithms is closely tethered to the availability of accurately annotated data, with sufficient quantity and diversity, to train the models. An essential task within this framework is segmentation - the delineation of specific anatomical structures or pathological lesions within an image. In the context of CXR, this might involve the demarcation of anatomical structures such as lungs or heart, or the location of disease abnormalities5. Accurate and robust segmentation can serve as a precursor to other downstream tasks, for example providing significant information about the location and size of specific organs or detected abnormalities. However, manual segmentation is a time-consuming process, demanding substantial expertise, and thus, does not scale well to the size of large databases required for DL model training6. HybridGNet, a deep learning model for realistic organ contouring, offers a solution for the generation of anatomically plausible CXR segmentations7,8. Utilizing a hybrid approach, it combines conventional convolution operations for image encoding with graph generative models for the anatomically-guided delineation of organ contours. The HybridGNet model was initially introduced with a small CXR landmark dataset to demonstrate its efficacy. In this work, we leverage this model to accomplish our main objective: introducing a large-scale 1 Institute for Signals, Systems and Computational Intelligence, sinc(i) CONICET-UNL, Santa Fe, CP 3002, Argentina. Health Informatics Department at Hospital Italiano de Buenos Aires, Buenos Aires, CP 1199, Argentina. 3Universidad Tecnológica Nacional, Buenos Aires, CP 1179, Argentina. 4Radiology Department, Hospital Italiano de Buenos Aires, Buenos Aires, CP 1199, Argentina. ✉e-mail: 2 Scientific Data | (2024) 11:511 | https://doi.org/10.1038/s41597-024-03358-1 1 www.nature.com/scientificdata www.nature.com/scientificdata/ Data Preparation Input Dataset InclusionExclusion Criteria Study Dataset Image Preprocessing Preprocessed Dataset Data Processing Annotated dataset with automatic quality assesment Quality assesment of masks via RCA framework Landmark and mask annotations Landmark-based segmentation via HybridGNet Technical Validation Stratified histogram sampling of masks Manual image segmentation using LabelStudio Statistical validation of the RCA results Annotated dataset with physician validated quality assesment Fig. 1 Data processing flowchart depicting the main steps involved in the building of the CheXmask dataset. segmentation dataset, named CheXmask, which provides anatomical masks with their corresponding quality index, for 5 extensive chest X-ray databases: Chest x-ray89, Chexpert2, MIMIC-CXR-JPG10, Padchest11 and VinDr-CXR12. These databases collectively represent a wide variety of geographical locations, patient demographics, and disease spectra, enabling the development of a broad, diverse segmentation dataset. As the original databases lack manually curated ground-truth segmentations, we perform quality control by implementing our own Reverse Classification Accuracy (RCA) framework13. RCA allows to estimate the accuracy of a segmentation method for an individual image with no ground-truth (GT) masks, which is particularly valuable for large-scale image analysis studies like ours. The fundamental concept behind RCA involves training an auxiliary model (known as the reverse classifier) solely on the individual image, using its predicted segmentation as pseudo-GT. This model is then evaluated on a reference database that contains GT data to obtain a performance metric, which is expected to correlate with the performance that would be measured for the individual image if its GT was available. We validated this method by comparing it to traditional performance evaluation on a subset of test images with masks manually segmented by an expert physician. Additionally, since large-public CXR databases built from automatic analysis of electronic health records (EHR) are subject to errors both in image selection and image annotation, we found that RCA is a useful tool to detect out-of-distribution samples (e.g. poor-quality images). Thus, the RCA metrics for HybridGNet segmentations stand out as a powerful quality metric to handle large databases for downstream tasks, by detecting not only low quality segmentation masks, but also images that should be filtered out. Our comprehensive analysis underscores the capa (...truncated)