Categorized contrast enhanced mammography dataset for diagnostic and artificial intelligence research
www.nature.com/scientificdata
Categorized contrast enhanced
Data Descriptor mammography dataset for
diagnostic and artificial intelligence
research
OPEN
Rana Khaled 1 ✉, Maha Helal1, Omar Alfarghaly
Hebatalla El Kassas1 & Aly Fahmy2 ✉
2✉
, Omnia Mokhtar1, Abeer Elkorany2 ✉,
Contrast-enhanced spectral mammography (CESM) is a relatively recent imaging modality with
increased diagnostic accuracy compared to digital mammography (DM). New deep learning (DL)
models were developed that have accuracies equal to that of an average radiologist. However, most
studies trained the DL models on DM images as no datasets exist for CESM images. We aim to resolve
this limitation by releasing a Categorized Digital Database for Low energy and Subtracted Contrast
Enhanced Spectral Mammography images (CDD-CESM) to evaluate decision support systems. The
dataset includes 2006 images, with an average resolution of 2355 × 1315, consisting of 310 mass
images, 48 architectural distortion images, 222 asymmetry images, 238 calcifications images, 334
mass enhancement images, 184 non-mass enhancement images, 159 postoperative images, 8 post
neoadjuvant chemotherapy images, and 751 normal images, with 248 images having more than one
finding. This is the first dataset to incorporate data selection, segmentation annotation, medical
reports, and pathological diagnosis for all cases. Moreover, we propose and evaluate a DL-based
technique to automatically segment abnormal findings in images.
Background & Summary
Digital mammography (DM) is the gold standard imaging modality for early detection of breast cancer.
However, limitations exist in patients with dense breasts as its overall sensitivity decreases1. Contrast-enhanced
spectral mammography (CESM) is a contrast-based digital mammogram that has been approved by the Food
and Drug Administration (FDA) in 2011 to be used as an adjunct to DM and ultrasound examinations for
localization and characterization of occult or inconclusive lesions. Dual-energy image acquisition is performed
where low and high-energy images are obtained. Several studies proved that low-energy images obtained appear
like the standard DM images and are non-inferior to them2. High-energy images are non-interpretable; to overcome this, low and high-energy images are recombined and subtracted through appropriate image processing
to suppress the background breast parenchyma after the acquisition. Figure 1 shows the resulting subtracted
images obtained for interpretation, revealing contrast enhancement areas in a suppressed breast tissue background. Findings could be identified according to their density, morphologic, and enhancement characteristics3.
However, estimating whether a lesion is benign or malignant without being seen by a radiologist is challenging
due to the significant variation in the lesions’ visual characteristics4.
Computer-aided detection (CAD) systems were introduced in the early 2000’s to help radiologists interpret
mammography images. However, this proved to be challenging in clinical practice due to the increased rate of
false positives marked by the CAD systems, which can distract the radiologists5. Currently, the use of artificial
intelligence (AI) in radiology is still in its early stages. Nonetheless, algorithms that analyze pixel data distinguish
patterns from images that might not have been previously identified even by expert radiologists6. Deep learning
(DL) has a promising potential in performing many tasks such as automatically detecting lesions and helping
radiologists provide a more accurate diagnosis. Moreover, new multimodal DL models like the perceiver7 make
1
Cairo University, National Institute of Cancer, Radiology Department, Cairo, 11796, Egypt. 2Cairo University,
Computers and Artificial Intelligence, Computer Science Department, Cairo, 12613, Egypt. ✉e-mail: r_hkhaled@
hotmail.com; ; ;
Scientific Data |
(2022) 9:122 | https://doi.org/10.1038/s41597-022-01238-0
1
www.nature.com/scientificdata/
www.nature.com/scientificdata
Fig. 1 (a) Low-energy, (b) High-energy, and (c) Subtracted image.
it feasible to train on large datasets and extract good unsupervised image representations that can be used on
a wide range of tasks. However, fully annotated and large-sized datasets are required and will be crucial for
training new DL networks or fine-tuning existing pre-trained DL networks and evaluating them. This is why it is
important for radiologists to understand the impact of these machine-learning (ML) based analytical tools and
recognize how they might influence and change the radiological practice soon8.
In the past couple of years, a small number of public mammography datasets were released, including the
Digital Database for Screening Mammography (DDSM)9, the Image Retrieval in Medical Applications (IRMA)
project10, the Mammographic Imaging Analysis Society (MIAS) database11, and the Curated Breast Imaging
Subset of DDSM (CBIS-DDSM)012. These datasets contain DM images only, and none include CESM images.
In this paper, we present a CESM categorized dataset that provides easily-accessible low energy images with
corresponding subtracted CESM images, abnormality segmentation annotation, verified medical reports, and
pathological diagnosis for all cases. It will add to the ongoing advancements in future mammography DL-based
systems. We also propose a new DL-based technique to automatically segment the abnormal findings in the
images without intervention from radiologists, as segmentation annotation is a time-consuming task.
Methods
We collected and reformatted the data into an easily-accessible format. Figure 2 displays the flow diagram of
the process to prepare our dataset: image preprocessing, manual annotations, and the automatic segmentation.
Technique of contrast enhanced mammography examination. CESM is done using the standard
DM equipment but with additional software that performs dual-energy image acquisition. Two minutes after
intravenously injecting the patient with non-ionic low-osmolar iodinated contrast material (dose: 1.5 mL/kg),
craniocaudal (CC) and mediolateral oblique (MLO) views are obtained. Each view comprises two exposures, one
with low energy (peak kilo-voltage values ranging from 26 to 31kVp) and one with high energy (45 to 49 kVp). A
complete examination is carried out in about 5–6 minutes.
Description of dataset. The dataset is a collection of low-energy images with their corresponding subtracted CESM images gathered from the Radiology Department of the National Cancer Institute, Cairo
University, Egypt over the period from January 2019 to February 2021. The images are all high resolution with
an average of 2355 × 1315 pixels. Institutional review board approval and patient informed consent to carry out
and publish data were obtained from 326 female patients aged from 18 to 90 years. The dataset contains 2006
images with CC and MLO views (1003 low energy images and 1003 subtracted CESM images), samples of low (...truncated)