Benchmarking the Robustness of Semantic Segmentation Models with Respect to Common Corruptions
International Journal of Computer Vision
https://doi.org/10.1007/s11263-020-01383-2
Benchmarking the Robustness of Semantic Segmentation Models with
Respect to Common Corruptions
Christoph Kamann1
· Carsten Rother1
Received: 14 April 2020 / Accepted: 7 September 2020
© The Author(s) 2020
Abstract
When designing a semantic segmentation model for a real-world application, such as autonomous driving, it is crucial to
understand the robustness of the network with respect to a wide range of image corruptions. While there are recent robustness
studies for full-image classification, we are the first to present an exhaustive study for semantic segmentation, based on
many established neural network architectures. We utilize almost 400,000 images generated from the Cityscapes dataset,
PASCAL VOC 2012, and ADE20K. Based on the benchmark study, we gain several new insights. Firstly, many networks
perform well with respect to real-world image corruptions, such as a realistic PSF blur. Secondly, some architecture properties
significantly affect robustness, such as a Dense Prediction Cell, designed to maximize performance on clean data only. Thirdly,
the generalization capability of semantic segmentation models depends strongly on the type of image corruption. Models
generalize well for image noise and image blur, however, not with respect to digitally corrupted data or weather corruptions.
Keywords Semantic segmentation · Corruption robustness · Common image corruptions · Realistic image corruptions
1 Introduction
In recent years, deep convolutional neural networks
(DCNNs) have set the state-of-the-art on a broad range of
computer vision tasks (Krizhevsky et al. 2012; He et al.
2016; Simonyan and Zisserman 2015; Szegedy et al. 2015;
LeCun et al. 1998; Redmon et al. 2016; Chen et al. 2015;
Goodfellow et al. 2016). The performance of CNN models
is generally measured using benchmarks of publicly available datasets, which often consist of clean and post-processed
images (Cordts et al. 2016; Everingham et al. 2010). However, it has been shown that model performance is prone to
image corruptions (Zhou et al. 2017; Vasiljevic et al. 2016;
Hendrycks and Dietterich 2019; Geirhos et al. 2018; Dodge
and Karam 2016; Gilmer et al. 2019; Azulay and Weiss 2019;
Kamann and Rother 2020), especially image noise decreases
the performance significantly.
Communicated by Daniel Scharstein.
B Christoph Kamann
Carsten Rother
1
Visual Learning Lab, HCI/IWR, Heidelberg University,
Heidelberg, Germany
Image quality depends on environmental factors such as
illumination and weather conditions, ambient temperature,
and camera motion since they directly affect the optical
and electrical properties of a camera. Image quality is also
affected by optical aberrations of the camera lenses, causing,
e.g., image blur. Thus, in safety-critical applications, such
as autonomous driving, models must be robust towards such
inherently present image corruptions (Hasirlioglu et al. 2016;
Kamann et al. 2017; Janai et al. 2020).
In this work, we present an extensive evaluation of the
robustness of semantic segmentation models towards a broad
range of real-world image corruptions. Here, the term robustness refers to training a model on clean data and then
validating it on corrupted data. We choose the task of
semantic image segmentation for two reasons. Firstly, image
segmentation is often applied in safety-critical applications,
where robustness is essential. Secondly, a rigorous evaluation
for real-world image corruptions has, in recent years, only
been conducted for full-image classification and object detection, e.g., most recently Geirhos et al. (2018), Hendrycks and
Dietterich (2019), and Michaelis et al. (2019).
When benchmarking semantic segmentation models, there
are, in general, different choices such as: (i) comparing different architectures, or (ii) conducting a detailed ablation study
of a state-of-the-art architecture. In contrast to Geirhos et al.
123
International Journal of Computer Vision
(a) Corrupted validation image (left: noise, right: blur)
(b) Prediction of best-performing architecture on clean image
(c) Prediction of best-performing architecture on corrupted im- (d) Prediction of ablated architecture on the corrupted image
age
Fig. 1 Results of our ablation study. Here we train the state-of-the-art
semantic segmentation model DeepLabv3+ on clean Cityscapes data
and test it on corrupted data. a A validation image from Cityscapes,
where the left-hand side is corrupted by shot noise and the right-hand
side by defocus blur. b Prediction of the best-performing model-variant
on the corresponding clean image. c Prediction of the same architecture
on the corrupted image (a). d Prediction of an ablated architecture on
the corrupted image (a). We clearly see that prediction (d) is superior
to (c), hence the corresponding model is more robust with respect to
this image corruption, although it performs worse on the clean image.
We present a study of various architectural choices and various image
corruptions for three datasets: Cityscapes, PASCAL VOC 2012, and
ADE20K
(2018) and Hendrycks and Dietterich (2019), which focused
on aspect (i), we perform both options. We believe that an
ablation study (option ii) is important since knowledge about
architectural choices are likely helpful when designing a
practical system, where types of image corruptions are known
beforehand. For example, Geirhos et al. (2018) showed that
ResNet-152 (He et al. 2016) is more robust to image noise
than GoogLeNet (Szegedy et al. 2015). Is the latter architecture more prone to noise due to missing skip-connections,
shallower architecture, or other architectural design choices?
When the overarching goal is to develop robust convolutional
neural networks, we believe that it is important to learn about
the robustness capabilities of architectural properties.
We use the state-of-the-art DeepLabv3+ architecture
(Chen et al. 2018b) with multiple network backbones as
reference and consider many ablations of it. Based on our
evaluation, we are able to conclude three main findings:
(1) Many networks perform well with respect to real-world
image corruptions, such as a realistic PSF blur. (2) Architectural properties can affect the robustness of a model
significantly. Our results show that atrous (i.e., dilated) convolutions and long-range link naturally aid the robustness
against many types of image corruptions. However, an archi-
tecture with a Dense Prediction Cell (Chen et al. 2018a),
which was designed to maximize performance on clean
data, hampers the performance for corrupted images significantly (see Fig. 1). (3) The generalization capability of
DeepLabv3+ model, using a ResNet-backbone, depends
strongly on the type of image corruption.
In summary, we give the following contributions:
123
– We benchmark the robustness of many architectural properties of the state-of-the-art semantic segmentation model
DeepLabv3+ for a wi (...truncated)