ResLMFFNet: a real-time semantic segmentation network for precision agriculture (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11554-024-01474-0.pdf

ResLMFFNet: a real-time semantic segmentation network for precision agriculture

Journal of Real-Time Image Processing (2024) 21:101 https://doi.org/10.1007/s11554-024-01474-0 RESEARCH ResLMFFNet: a real‑time semantic segmentation network for precision agriculture Irem Ulku1 Received: 7 January 2024 / Accepted: 7 May 2024 © The Author(s) 2024 Abstract Lightweight multiscale-feature-fusion network (LMFFNet), a proficient real-time CNN architecture, adeptly achieves a balance between inference time and accuracy. Capturing the intricate details of precision agriculture target objects in remote sensing images requires deep SEM-B blocks in the LMFFNet model design. However, employing numerous SEM-B units leads to instability during backward gradient flow. This work proposes the novel residual-LMFFNet (ResLMFFNet) model for ensuring smooth gradient flow within SEM-B blocks. By incorporating residual connections, ResLMFFNet achieves improved accuracy without affecting the inference speed and the number of trainable parameters. The results of the experiments demonstrate that this architecture has achieved superior performance compared to other real-time architectures across diverse precision agriculture applications involving UAV and satellite images. Compared to LMFFNet, the ResLMFFNet architecture enhances the Jaccard Index values by 2.1% for tree detection, 1.4% for crop detection, and 11.2% for wheatyellow rust detection. Achieving these remarkable accuracy levels involves maintaining almost identical inference time and computational complexity as the LMFFNet model. The source code is available on GitHub: https://github.com/iremulku/ Semantic-Segmentation-in-Precision-Agriculture. Keywords Real-time semantic segmentation · Remote sensing · Precision agriculture 1 Introduction Precision agriculture is a technique that aims to increase crop productivity while reducing costs and environmental impact [1]. Sensing technology is a tool for achieving this goal by monitoring vast lands. With the advancement of convolutional neural networks (CNNs), this technology has become even more powerful [2]. CNN models are used in early disease detection, leading to reduced yield losses by applying fungicides at the right time [1]. Additionally, CNN architectures can identify trees and crops to maximize agricultural efficiency [3]. However, CNN architectures [4] have high inference time measured in frames per second (fps), which makes them impractical for real-time applications. Precision agriculture faces the challenge of balancing high accuracy with fast inference speed. Existing research on using CNN models for real-time precision agriculture * Irem Ulku 1 Department of Computer Engineering, Ankara University, 06830 Ankara, Turkey focuses on a single specialized application [5–7] and does not provide sufficient accuracy [8–10]. Therefore, it is essential to adapt recent real-time models to provide high accuracy in various precision agriculture applications [1]. Real-time CNN architectures generally adhere to an encoder–decoder framework. Architectures like SegNet [11] employ encoders based on established backbone networks. In contrast, ENet [12], LEDNet [13], and FSFNet [14] use lightweight modules to build efficient encoders, resulting in fewer parameters. These models, however, lack accuracy compared to others. The decoder parts of real-time semantic segmentation models may also have different designs. SegNet and ESNet [15] have symmetrically designed decoders. In contrast, DFANet [16], and FASSD-Net [17] architectures have adopted asymmetric decoder structures to enhance inference speed. Recent transformer-based models such as UNetFormer [18] achieve good performance without sacrificing real-time speed. Remarkably, by introducing a split-extract-merge bottleneck (SEM-B) in its backbone network, the real-time LMFFNet [19] architecture achieves high accuracy with Vol.:(0123456789) 101 Page 2 of 13 fewer model parameters. A lightweight asymmetric decoder is used in the LMFFNet model to process multi-scale features, which improves inference time. However, with the challenging low latency and high accuracy requirements for various precision agriculture tasks, LMFFNet still needs improvement. Realizing precision agriculture practices with high accuracy in real-time is challenging. In real-world remote-sensing images with high spatial resolution, capturing the intricate details of precision agriculture target objects poses considerable difficulties. This paper proposes the ResLMFFNet architecture to increase prediction accuracy and achieve a decent trade-off between high accuracy and fast inference speed. ResLMFFNet introduces the following novelties: • LMFFNet is the base model since it already achieves an adequate trade-off between accuracy and efficiency. Journal of Real-Time Image Processing (2024) 21:101 However, residual connections are added to the SEM-B blocks in this study to further increase accuracy without affecting inference speed (Fig. 1). By preserving low-level features lost through deep SEM-B blocks, these connections further enhance the performance of LMFFNet. Residual connections are preferred to dense or attention connections since the element-wise summation operation does not introduce trainable weights. • Before upsampling, the dropout layer is used in the decoder, which allows the model to show higher generalization ability, making it better suited to a wide range of precision agriculture practices. In the remainder of this paper, the details of the proposed architecture ResLMFFNet are described in Sect. 2. Section 3 presents the experimental results. Conclusions are given in Sect. 4. 2 Methods Fig. 1 Complexity-accuracy trade-off comparison on the DSTL image set in terms of Jaccard Index JI, Giga floating point operations (GFLOPs), and model parameters. The circle size indicates the number of the model parameters Fig. 2 ResLMFFNet architecture The ResLMFFNET model, an improved version of the LMFFNET architecture, emerges to increase accuracy while preserving real-time capabilities, as depicted in Fig. 2. Similar to the LMFFNET design, the ResLMFFNET model is composed of three core components: SEM-B block, feature fusion module (FFM), and multiscale attention decoder (MAD). The ResLMFFNET architecture achieves its novel contribution by implementing residual connections within the SEM-B blocks, as illustrated in Fig. 2. Furthermore, the accuracy is further boosted by the inclusion of a dropout layer in the decoder design. This section provides a detailed explanation of the essential components within the ResLMFFNET architecture. Journal of Real-Time Image Processing 2.1 SEM‑B block The SEM-B block is built upon the split-extract-merge bottleneck shown in Fig. 3. SEM-B applies a 3×3 convolution, then splits the feature map into two branches, each with 1/4 channels of the input. One branch undergoes depthwise convolution, while the other employs depthwise dilated convolution so that SEM-B effectively captures fine spatial det (...truncated)