Bilateral network with rich semantic extractor for real-time semantic segmentation
Complex & Intelligent Systems
https://doi.org/10.1007/s40747-023-01242-w
ORIGINAL ARTICLE
Bilateral network with rich semantic extractor for real-time semantic
segmentation
Shan Zhao1 · Xuan Wu1
· Kaiwen Tian1 · Yang Yuan1
Received: 19 April 2023 / Accepted: 9 September 2023
© The Author(s) 2023
Abstract
Recently, owing to the requirements of inference speed, most real-time semantic segmentation networks often have shallow
network depth, which limits the receptive field size of the model, leading to the limited acquisition of semantic information and
resulting in intraclass inconsistency and ultimately a decrease in segmentation accuracy. Additionally, the shallow network
depth also restricts the feature extraction capability of the network, reducing its robustness and ability to adapt to complex
scenes. To address these issues, a bilateral network with a rich semantic extractor (RSE) for real-time semantic segmentation
(BRSeNet) is presented to perform real-time semantic segmentation. First, to solve the problem of insufficient semantic feature
information extraction, an RSE is proposed, which includes a multiscale global semantic extraction module (MGSEM) and
a semantic fusion module (SFM). The MGSEM can extract rich global semantics and expand the effective receptive field.
Simultaneously, the SFM efficiently integrates multiscale local semantics with multiscale global semantics, resulting in more
comprehensive semantic information for the network. Finally, based on the characteristics of detail and semantic branches,
a bilateral reconstruction aggregation module is designed to reconstruct the contextual information of detail features, model
the interdependencies on semantic feature channels, and enhance feature representation. Comprehensive experiments on the
challenging Cityscapes and ADE20K datasets are conducted. The experimental results show that the proposed BRSeNet
achieves mean intersection over union of 74.9% and 35.7% at inference speeds of 74 and 65 frames per second, respectively,
and ensures a favorable balance between segmentation accuracy and inference speed.
Keywords Semantic segmentation · Real time · Vision transformer · Multiscale feature
Introduction
Semantic segmentation is a classical problem in computer
vision that aims to assign pixel-level labels to images. A
fully convolutional network (FCN) [1] first accomplished the
Shan Zhao, Xuan Wu, Kaiwen Tian, and Yang Yuan have contributed
equally to this work.
B Xuan Wu
Shan Zhao
Kaiwen Tian
Yang Yuan
1
School of Software, Henan Polytechnic University, 2001
Century Avenue, Jiaozuo 454000, Henan, China
semantic segmentation task in a fully convolutional manner
with VGG [2] as the backbone network, and most subsequent studies have been based on its improvements. In
the past few decades, owing to the excellent performance
of the deep convolutional neural network (DCNN), many
semantic segmentation methods [3–5] have been proposed.
To achieve a significant improvement in segmentation accuracy, complex backbone networks (e.g., Xception [6] and
ResNet [7]) are adopted to capture high-level contextual
semantics. However, these networks are usually computationally intensive and slow in inference. For some special
fields, such as autonomous driving, video surveillance, and
human–computer interaction, the inference speed of semantic segmentation cannot meet the requirements of these
applications.
To meet the requirements of inference speed, many realtime semantic segmentation networks are designed with
lightweight classification networks (e.g., mobilenet [8] and
shufflenet [9]) to achieve low latency and good segmentation
123
Complex & Intelligent Systems
Fig. 1 Accuracy (mIoU) and inference speed (FPS) obtained by several state-of-the-art semantic segmentation methods on the Cityscapes
validation set
accuracy. In addition, some methods are used to construct
special network architectures. For example, the Image Cascade Network (ICNet) [10] proposed a three-level cascade
architecture that balances segmentation accuracy and efficiency by leveraging multiresolution processing. BiSeNet
[11] divides the network into two branches, capable of
extracting deep semantics while preserving detailed information, and introduces an attention mechanism to further
enhance network robustness. BiSeNetv2 [12] also employs
a two-stage structure and simultaneously utilizes attention
and self-attention mechanisms to improve the perception of
important features and spatial structures, thereby enhancing the quality of segmentation results and the preservation
of details. Figure 1 shows the accuracy [mean intersection over union (mIoU)] and inference speed [frames per
second (FPS)] achieved by several state-of-the-art semantic segmentation methods on the Cityscapes validation set.
These networks have been proven to be excellent real-time
segmentation networks. However, due to the limitation of
parameters, the network layer of the real-time semantic segmentation network is shallow and the receptive field is small,
resulting in insufficient feature extraction ability. This deficiency mainly includes two aspects: (1) The number of
network layers is shallow, which leads to a small receptive
field for feature extraction. When segmenting large objects,
there are certain differences in the features corresponding
to pixels of the same label, and these differences introduce
intraclass inconsistency, resulting in a decrease in accuracy.
(2) The scale of the extracted feature information is single,
and there are multiple scales of segmentation targets in the
image segmentation process. When a single scale is used for
pixel-level classification, the robustness of the network will
be reduced.
To effectively expand the receptive field of the network,
DeepLabv3 [13] and PSPNet [3] utilize dilated convolutions
123
without increasing computational costs, leading to improved
accuracy of the segmentation network. However, the selection of dilation rates poses a challenge, and it can also
introduce grid artifacts. Additionally, accurately capturing
multiscale object information while maintaining the fast
inference speed of the network is also a significant challenge. Previous approaches have attempted to address this
problem in various ways. Chen et al. [14] have resized the
input image to multiple ratios to extract semantic information at various scales. DeepLabv3 [13] introduced a pyramid
pooling module to complete the fusion of multiscale features. DeepLabv3+ [15] aggregated contextual information
at multiple scales based on the pyramid structure. EncNet
[16] proposed a context encoding module to capture global
context information and segment multiscale objects. These
methods can extract certain multiscale information, but they
typically incur significant computational costs.
To solve the problems of a real-time semantic segmentation network with a small receptive field that leads to
intraclass inconsistency, a single semantic (...truncated)