Benchmarking the Robustness of Semantic Segmentation Models with Respect to Common Corruptions (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11263-020-01383-2.pdf

Benchmarking the Robustness of Semantic Segmentation Models with Respect to Common Corruptions

International Journal of Computer Vision https://doi.org/10.1007/s11263-020-01383-2 Benchmarking the Robustness of Semantic Segmentation Models with Respect to Common Corruptions Christoph Kamann1 · Carsten Rother1 Received: 14 April 2020 / Accepted: 7 September 2020 © The Author(s) 2020 Abstract When designing a semantic segmentation model for a real-world application, such as autonomous driving, it is crucial to understand the robustness of the network with respect to a wide range of image corruptions. While there are recent robustness studies for full-image classification, we are the first to present an exhaustive study for semantic segmentation, based on many established neural network architectures. We utilize almost 400,000 images generated from the Cityscapes dataset, PASCAL VOC 2012, and ADE20K. Based on the benchmark study, we gain several new insights. Firstly, many networks perform well with respect to real-world image corruptions, such as a realistic PSF blur. Secondly, some architecture properties significantly affect robustness, such as a Dense Prediction Cell, designed to maximize performance on clean data only. Thirdly, the generalization capability of semantic segmentation models depends strongly on the type of image corruption. Models generalize well for image noise and image blur, however, not with respect to digitally corrupted data or weather corruptions. Keywords Semantic segmentation · Corruption robustness · Common image corruptions · Realistic image corruptions 1 Introduction In recent years, deep convolutional neural networks (DCNNs) have set the state-of-the-art on a broad range of computer vision tasks (Krizhevsky et al. 2012; He et al. 2016; Simonyan and Zisserman 2015; Szegedy et al. 2015; LeCun et al. 1998; Redmon et al. 2016; Chen et al. 2015; Goodfellow et al. 2016). The performance of CNN models is generally measured using benchmarks of publicly available datasets, which often consist of clean and post-processed images (Cordts et al. 2016; Everingham et al. 2010). However, it has been shown that model performance is prone to image corruptions (Zhou et al. 2017; Vasiljevic et al. 2016; Hendrycks and Dietterich 2019; Geirhos et al. 2018; Dodge and Karam 2016; Gilmer et al. 2019; Azulay and Weiss 2019; Kamann and Rother 2020), especially image noise decreases the performance significantly. Communicated by Daniel Scharstein. B Christoph Kamann Carsten Rother 1 Visual Learning Lab, HCI/IWR, Heidelberg University, Heidelberg, Germany Image quality depends on environmental factors such as illumination and weather conditions, ambient temperature, and camera motion since they directly affect the optical and electrical properties of a camera. Image quality is also affected by optical aberrations of the camera lenses, causing, e.g., image blur. Thus, in safety-critical applications, such as autonomous driving, models must be robust towards such inherently present image corruptions (Hasirlioglu et al. 2016; Kamann et al. 2017; Janai et al. 2020). In this work, we present an extensive evaluation of the robustness of semantic segmentation models towards a broad range of real-world image corruptions. Here, the term robustness refers to training a model on clean data and then validating it on corrupted data. We choose the task of semantic image segmentation for two reasons. Firstly, image segmentation is often applied in safety-critical applications, where robustness is essential. Secondly, a rigorous evaluation for real-world image corruptions has, in recent years, only been conducted for full-image classification and object detection, e.g., most recently Geirhos et al. (2018), Hendrycks and Dietterich (2019), and Michaelis et al. (2019). When benchmarking semantic segmentation models, there are, in general, different choices such as: (i) comparing different architectures, or (ii) conducting a detailed ablation study of a state-of-the-art architecture. In contrast to Geirhos et al. 123 International Journal of Computer Vision (a) Corrupted validation image (left: noise, right: blur) (b) Prediction of best-performing architecture on clean image (c) Prediction of best-performing architecture on corrupted im- (d) Prediction of ablated architecture on the corrupted image age Fig. 1 Results of our ablation study. Here we train the state-of-the-art semantic segmentation model DeepLabv3+ on clean Cityscapes data and test it on corrupted data. a A validation image from Cityscapes, where the left-hand side is corrupted by shot noise and the right-hand side by defocus blur. b Prediction of the best-performing model-variant on the corresponding clean image. c Prediction of the same architecture on the corrupted image (a). d Prediction of an ablated architecture on the corrupted image (a). We clearly see that prediction (d) is superior to (c), hence the corresponding model is more robust with respect to this image corruption, although it performs worse on the clean image. We present a study of various architectural choices and various image corruptions for three datasets: Cityscapes, PASCAL VOC 2012, and ADE20K (2018) and Hendrycks and Dietterich (2019), which focused on aspect (i), we perform both options. We believe that an ablation study (option ii) is important since knowledge about architectural choices are likely helpful when designing a practical system, where types of image corruptions are known beforehand. For example, Geirhos et al. (2018) showed that ResNet-152 (He et al. 2016) is more robust to image noise than GoogLeNet (Szegedy et al. 2015). Is the latter architecture more prone to noise due to missing skip-connections, shallower architecture, or other architectural design choices? When the overarching goal is to develop robust convolutional neural networks, we believe that it is important to learn about the robustness capabilities of architectural properties. We use the state-of-the-art DeepLabv3+ architecture (Chen et al. 2018b) with multiple network backbones as reference and consider many ablations of it. Based on our evaluation, we are able to conclude three main findings: (1) Many networks perform well with respect to real-world image corruptions, such as a realistic PSF blur. (2) Architectural properties can affect the robustness of a model significantly. Our results show that atrous (i.e., dilated) convolutions and long-range link naturally aid the robustness against many types of image corruptions. However, an archi- tecture with a Dense Prediction Cell (Chen et al. 2018a), which was designed to maximize performance on clean data, hampers the performance for corrupted images significantly (see Fig. 1). (3) The generalization capability of DeepLabv3+ model, using a ResNet-backbone, depends strongly on the type of image corruption. In summary, we give the following contributions: 123 – We benchmark the robustness of many architectural properties of the state-of-the-art semantic segmentation model DeepLabv3+ for a wi (...truncated)