DEEP SEMANTIC SEGMENTATION FOR THE OFF-ROAD AUTONOMOUS DRIVING (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XLIII-B2-2020/617/2020/isprs-archives-XLIII-B2-2020-617-2020.pdf

DEEP SEMANTIC SEGMENTATION FOR THE OFF-ROAD AUTONOMOUS DRIVING

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) DEEP SEMANTIC SEGMENTATION FOR THE OFF-ROAD AUTONOMOUS DRIVING I. Sgibnev *, A. Sorokin , B. Vishnyakov, Y. Vizilter FGUP «State Research Institute of Aviation Systems», Russia, 125319, Moscow, Viktorenko street, 7 - (sgibnev, ans, vishnyakov, viz)@gosniias.ru KEY WORDS: Semantic segmentation, DCNN, off-road, autonomous driving, lightweight architectures ABSTRACT: This paper is devoted to the problem of image semantic segmentation for machine vision system of off-road autonomous robotic vehicle. Most modern convolutional neural networks require large computing resources that go beyond the capabilities of many robotic platforms. Therefore, the main drawback of such models is extremely high complexity of the convolutional neural network used, whereas tasks in real applications must be performed on devices with limited resources in real-time. This paper focuses on the practical application of modern lightweight architectures as applied to the task of semantic segmentation on mobile robotic systems. The article discusses backbones based on ResNet18, ResNet34, MobileNetV2, ShuffleNetV2, EfficientNet-B0 and decoders based on U-Net and DeepLabV3 as well as additional components that can increase the accuracy of segmentation and reduce the inference time. In this paper we propose a model using ResNet34 and DeepLabV3 decoding with Squeeze & Excitation blocks that was optimal in terms of inference time and accuracy. We also demonstrate our off-road dataset and simulated dataset for semantic segmentation. Furthermore, we present that using pre-trained weights on simulated dataset achieves to increase 2.7% mIoU on our off-road dataset compared pretrained weights on the Cityscapes. Moreover, we achieve 75.6% mIoU on the Cityscapes validation set and 85.2% mIoU on our offroad validation set with a speed of 37 FPS for a 1,024×1,024 input on one NVIDIA GeForce RTX 2080 card using NVIDIA TensorRT. 1. INTRODUCTION Reliable and stable semantic model of the surrounding scene, detection of objects and all kinds of obstacles that may appear in the path of an autonomous car is a difficult task for any machine vision system. Object detection is a two-step approach. At first, we need to localize the instances of interest in the image, then to classify them. Using deep convolutional neural networks, we can build a bounding box for each object in the image. However, this approach does not convey the exact shape of the object and does not consider the entire context of the image because the bounding boxes are rectangular. Therefore, object detection does not provide a complete understanding of the surrounding scene. Semantic segmentation is essentially a pixel-by-pixel classification, so it gives a more detailed view of the shape of objects in an image and provides a much more complete understanding of the surrounding scene compared to the detection methods. Today we can see an increasing number of applications of semantic segmentation, such as autonomous vehicles, robotic systems and virtual reality for which an understanding of the scene is necessary. Image semantic segmentation is crucially important for the automatic control system of modern autonomous vehicles. An accurate understanding of the surrounding scene is important for navigation and decision-making by control system of robotic platform. A vision system based on semantic segmentation algorithms is one of the key elements of an off-road autonomous robotic vehicle. Its characteristics largely determine the efficiency of the robotic complex, as it directly affects such problems as recognition of the underlying surface type, calculation of patency map, accuracy of detection, recognition and tracking of objects and obstacles. The imposition of semantic segmentation on a three-dimensional model or point cloud gives us the class of each point and adjust the patency map of the robotic vehicle. Currently, the task of semantic segmentation is being generally solved by using convolutional neural networks, which can take an image of arbitrary size as an input and output an appropriate predict. New methods that are based on deep convolution neural networks significantly outperform old methods, based on clustering, histogram and color, compression, edge detection, etc. 2. RESEARCH OVERVIEW 2.1 Lightweight backbones In (Kaiming He et al., 2015) there was presented ResNet, which was able to solve the problem of a vanishing gradient in the process of training deep neural networks by adding shortcut connections. Scientists were given a way to train deeper neural networks than was previously possible. The authors in numerous experiments demonstrated the possibility of effective training of deep neural networks. The results obtained at various competitions made ResNet one of the most popular architectures for solving various problems of computer vision. MobileNetV2 (Mark Sandler et al., 2018) was designed specifically for mobile devices. The authors sought to create a model that would provide high accuracy with a minimum number of parameters and FLOPs. It was necessary to apply this model to solve various computer vision tasks on devices with limited resources. MobileNetV2 bottleneck with expansion layer block is based on depthwise and pointwise convolutions, which allowed authors to significantly reduce the number of parameters and calculations. In ShuffleNetV2 (Ningning Ma et al., 2018) there were added pointwise group convolution and channel shuffling used to exchange information between channels of feature maps. This neural network focuses on maintaining maximum accuracy with significant computational limitations (<200 MFLOPS), thereby focusing on applications for mobile phones, robots, drones, etc. In (Mingxing Tan et al., 2019) the authors created basic neural * Corresponding author This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-617-2020 | © Authors 2020. CC BY 4.0 License. 617 The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) Figure 1. Sample images from our off-road datasets network by doing a Neural Architecture Search then scaled it along different dimensions using proposed scaling. Thus, there were presented several models with balancing network width, depth, and resolution to any resource constraints with maintaining model efficiency – one of such CNNs was EfficientNet-B0. convolutional neural network. Using this module, recalibration of feature maps is carried out, which increases the components of the strong features and reduces the components of the weak ones. Moreover, a slight increase in the complexity of the model is accompanied by a significant increase in the accuracy of segmentation. Models that are ba (...truncated)