DEEP SEMANTIC SEGMENTATION FOR THE OFF-ROAD AUTONOMOUS DRIVING
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020
XXIV ISPRS Congress (2020 edition)
DEEP SEMANTIC SEGMENTATION FOR THE OFF-ROAD AUTONOMOUS DRIVING
I. Sgibnev *, A. Sorokin , B. Vishnyakov, Y. Vizilter
FGUP «State Research Institute of Aviation Systems», Russia, 125319, Moscow, Viktorenko street, 7 - (sgibnev, ans, vishnyakov,
viz)@gosniias.ru
KEY WORDS: Semantic segmentation, DCNN, off-road, autonomous driving, lightweight architectures
ABSTRACT:
This paper is devoted to the problem of image semantic segmentation for machine vision system of off-road autonomous robotic
vehicle. Most modern convolutional neural networks require large computing resources that go beyond the capabilities of many robotic
platforms. Therefore, the main drawback of such models is extremely high complexity of the convolutional neural network used,
whereas tasks in real applications must be performed on devices with limited resources in real-time. This paper focuses on the practical
application of modern lightweight architectures as applied to the task of semantic segmentation on mobile robotic systems. The article
discusses backbones based on ResNet18, ResNet34, MobileNetV2, ShuffleNetV2, EfficientNet-B0 and decoders based on U-Net and
DeepLabV3 as well as additional components that can increase the accuracy of segmentation and reduce the inference time. In this
paper we propose a model using ResNet34 and DeepLabV3 decoding with Squeeze & Excitation blocks that was optimal in terms of
inference time and accuracy. We also demonstrate our off-road dataset and simulated dataset for semantic segmentation. Furthermore,
we present that using pre-trained weights on simulated dataset achieves to increase 2.7% mIoU on our off-road dataset compared pretrained weights on the Cityscapes. Moreover, we achieve 75.6% mIoU on the Cityscapes validation set and 85.2% mIoU on our offroad validation set with a speed of 37 FPS for a 1,024×1,024 input on one NVIDIA GeForce RTX 2080 card using NVIDIA TensorRT.
1. INTRODUCTION
Reliable and stable semantic model of the surrounding scene,
detection of objects and all kinds of obstacles that may appear in
the path of an autonomous car is a difficult task for any machine
vision system.
Object detection is a two-step approach. At first, we need to
localize the instances of interest in the image, then to classify
them. Using deep convolutional neural networks, we can build a
bounding box for each object in the image. However, this
approach does not convey the exact shape of the object and does
not consider the entire context of the image because the bounding
boxes are rectangular. Therefore, object detection does not
provide a complete understanding of the surrounding scene.
Semantic segmentation is essentially a pixel-by-pixel
classification, so it gives a more detailed view of the shape of
objects in an image and provides a much more complete
understanding of the surrounding scene compared to the
detection methods. Today we can see an increasing number of
applications of semantic segmentation, such as autonomous
vehicles, robotic systems and virtual reality for which an
understanding of the scene is necessary. Image semantic
segmentation is crucially important for the automatic control
system of modern autonomous vehicles.
An accurate
understanding of the surrounding scene is important for
navigation and decision-making by control system of robotic
platform.
A vision system based on semantic segmentation algorithms is
one of the key elements of an off-road autonomous robotic
vehicle. Its characteristics largely determine the efficiency of the
robotic complex, as it directly affects such problems as
recognition of the underlying surface type, calculation of patency
map, accuracy of detection, recognition and tracking of objects
and obstacles. The imposition of semantic segmentation on a
three-dimensional model or point cloud gives us the class of each
point and adjust the patency map of the robotic vehicle.
Currently, the task of semantic segmentation is being generally
solved by using convolutional neural networks, which can take
an image of arbitrary size as an input and output an appropriate
predict. New methods that are based on deep convolution neural
networks significantly outperform old methods, based on
clustering, histogram and color, compression, edge detection, etc.
2. RESEARCH OVERVIEW
2.1 Lightweight backbones
In (Kaiming He et al., 2015) there was presented ResNet, which
was able to solve the problem of a vanishing gradient in the
process of training deep neural networks by adding shortcut
connections. Scientists were given a way to train deeper neural
networks than was previously possible. The authors in numerous
experiments demonstrated the possibility of effective training of
deep neural networks. The results obtained at various
competitions made ResNet one of the most popular architectures
for solving various problems of computer vision. MobileNetV2
(Mark Sandler et al., 2018) was designed specifically for mobile
devices. The authors sought to create a model that would provide
high accuracy with a minimum number of parameters and
FLOPs. It was necessary to apply this model to solve various
computer vision tasks on devices with limited resources.
MobileNetV2 bottleneck with expansion layer block is based on
depthwise and pointwise convolutions, which allowed authors to
significantly reduce the number of parameters and calculations.
In ShuffleNetV2 (Ningning Ma et al., 2018) there were added
pointwise group convolution and channel shuffling used to
exchange information between channels of feature maps. This
neural network focuses on maintaining maximum accuracy with
significant computational limitations (<200 MFLOPS), thereby
focusing on applications for mobile phones, robots, drones, etc.
In (Mingxing Tan et al., 2019) the authors created basic neural
*
Corresponding author
This contribution has been peer-reviewed.
https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-617-2020 | © Authors 2020. CC BY 4.0 License.
617
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020
XXIV ISPRS Congress (2020 edition)
Figure 1. Sample images from our off-road datasets
network by doing a Neural Architecture Search then scaled it
along different dimensions using proposed scaling. Thus, there
were presented several models with balancing network width,
depth, and resolution to any resource constraints with
maintaining model efficiency – one of such CNNs was
EfficientNet-B0.
convolutional neural network. Using this module, recalibration
of feature maps is carried out, which increases the components of
the strong features and reduces the components of the weak ones.
Moreover, a slight increase in the complexity of the model is
accompanied by a significant increase in the accuracy of
segmentation.
Models that are ba (...truncated)