Bilateral network with rich semantic extractor for real-time semantic segmentation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s40747-023-01242-w.pdf

Bilateral network with rich semantic extractor for real-time semantic segmentation

Complex & Intelligent Systems https://doi.org/10.1007/s40747-023-01242-w ORIGINAL ARTICLE Bilateral network with rich semantic extractor for real-time semantic segmentation Shan Zhao1 · Xuan Wu1 · Kaiwen Tian1 · Yang Yuan1 Received: 19 April 2023 / Accepted: 9 September 2023 © The Author(s) 2023 Abstract Recently, owing to the requirements of inference speed, most real-time semantic segmentation networks often have shallow network depth, which limits the receptive field size of the model, leading to the limited acquisition of semantic information and resulting in intraclass inconsistency and ultimately a decrease in segmentation accuracy. Additionally, the shallow network depth also restricts the feature extraction capability of the network, reducing its robustness and ability to adapt to complex scenes. To address these issues, a bilateral network with a rich semantic extractor (RSE) for real-time semantic segmentation (BRSeNet) is presented to perform real-time semantic segmentation. First, to solve the problem of insufficient semantic feature information extraction, an RSE is proposed, which includes a multiscale global semantic extraction module (MGSEM) and a semantic fusion module (SFM). The MGSEM can extract rich global semantics and expand the effective receptive field. Simultaneously, the SFM efficiently integrates multiscale local semantics with multiscale global semantics, resulting in more comprehensive semantic information for the network. Finally, based on the characteristics of detail and semantic branches, a bilateral reconstruction aggregation module is designed to reconstruct the contextual information of detail features, model the interdependencies on semantic feature channels, and enhance feature representation. Comprehensive experiments on the challenging Cityscapes and ADE20K datasets are conducted. The experimental results show that the proposed BRSeNet achieves mean intersection over union of 74.9% and 35.7% at inference speeds of 74 and 65 frames per second, respectively, and ensures a favorable balance between segmentation accuracy and inference speed. Keywords Semantic segmentation · Real time · Vision transformer · Multiscale feature Introduction Semantic segmentation is a classical problem in computer vision that aims to assign pixel-level labels to images. A fully convolutional network (FCN) [1] first accomplished the Shan Zhao, Xuan Wu, Kaiwen Tian, and Yang Yuan have contributed equally to this work. B Xuan Wu Shan Zhao Kaiwen Tian Yang Yuan 1 School of Software, Henan Polytechnic University, 2001 Century Avenue, Jiaozuo 454000, Henan, China semantic segmentation task in a fully convolutional manner with VGG [2] as the backbone network, and most subsequent studies have been based on its improvements. In the past few decades, owing to the excellent performance of the deep convolutional neural network (DCNN), many semantic segmentation methods [3–5] have been proposed. To achieve a significant improvement in segmentation accuracy, complex backbone networks (e.g., Xception [6] and ResNet [7]) are adopted to capture high-level contextual semantics. However, these networks are usually computationally intensive and slow in inference. For some special fields, such as autonomous driving, video surveillance, and human–computer interaction, the inference speed of semantic segmentation cannot meet the requirements of these applications. To meet the requirements of inference speed, many realtime semantic segmentation networks are designed with lightweight classification networks (e.g., mobilenet [8] and shufflenet [9]) to achieve low latency and good segmentation 123 Complex & Intelligent Systems Fig. 1 Accuracy (mIoU) and inference speed (FPS) obtained by several state-of-the-art semantic segmentation methods on the Cityscapes validation set accuracy. In addition, some methods are used to construct special network architectures. For example, the Image Cascade Network (ICNet) [10] proposed a three-level cascade architecture that balances segmentation accuracy and efficiency by leveraging multiresolution processing. BiSeNet [11] divides the network into two branches, capable of extracting deep semantics while preserving detailed information, and introduces an attention mechanism to further enhance network robustness. BiSeNetv2 [12] also employs a two-stage structure and simultaneously utilizes attention and self-attention mechanisms to improve the perception of important features and spatial structures, thereby enhancing the quality of segmentation results and the preservation of details. Figure 1 shows the accuracy [mean intersection over union (mIoU)] and inference speed [frames per second (FPS)] achieved by several state-of-the-art semantic segmentation methods on the Cityscapes validation set. These networks have been proven to be excellent real-time segmentation networks. However, due to the limitation of parameters, the network layer of the real-time semantic segmentation network is shallow and the receptive field is small, resulting in insufficient feature extraction ability. This deficiency mainly includes two aspects: (1) The number of network layers is shallow, which leads to a small receptive field for feature extraction. When segmenting large objects, there are certain differences in the features corresponding to pixels of the same label, and these differences introduce intraclass inconsistency, resulting in a decrease in accuracy. (2) The scale of the extracted feature information is single, and there are multiple scales of segmentation targets in the image segmentation process. When a single scale is used for pixel-level classification, the robustness of the network will be reduced. To effectively expand the receptive field of the network, DeepLabv3 [13] and PSPNet [3] utilize dilated convolutions 123 without increasing computational costs, leading to improved accuracy of the segmentation network. However, the selection of dilation rates poses a challenge, and it can also introduce grid artifacts. Additionally, accurately capturing multiscale object information while maintaining the fast inference speed of the network is also a significant challenge. Previous approaches have attempted to address this problem in various ways. Chen et al. [14] have resized the input image to multiple ratios to extract semantic information at various scales. DeepLabv3 [13] introduced a pyramid pooling module to complete the fusion of multiscale features. DeepLabv3+ [15] aggregated contextual information at multiple scales based on the pyramid structure. EncNet [16] proposed a context encoding module to capture global context information and segment multiscale objects. These methods can extract certain multiscale information, but they typically incur significant computational costs. To solve the problems of a real-time semantic segmentation network with a small receptive field that leads to intraclass inconsistency, a single semantic (...truncated)