ResLMFFNet: a real-time semantic segmentation network for precision agriculture
Journal of Real-Time Image Processing
(2024) 21:101
https://doi.org/10.1007/s11554-024-01474-0
RESEARCH
ResLMFFNet: a real‑time semantic segmentation network for precision
agriculture
Irem Ulku1
Received: 7 January 2024 / Accepted: 7 May 2024
© The Author(s) 2024
Abstract
Lightweight multiscale-feature-fusion network (LMFFNet), a proficient real-time CNN architecture, adeptly achieves a balance between inference time and accuracy. Capturing the intricate details of precision agriculture target objects in remote
sensing images requires deep SEM-B blocks in the LMFFNet model design. However, employing numerous SEM-B units
leads to instability during backward gradient flow. This work proposes the novel residual-LMFFNet (ResLMFFNet) model
for ensuring smooth gradient flow within SEM-B blocks. By incorporating residual connections, ResLMFFNet achieves
improved accuracy without affecting the inference speed and the number of trainable parameters. The results of the experiments demonstrate that this architecture has achieved superior performance compared to other real-time architectures across
diverse precision agriculture applications involving UAV and satellite images. Compared to LMFFNet, the ResLMFFNet
architecture enhances the Jaccard Index values by 2.1% for tree detection, 1.4% for crop detection, and 11.2% for wheatyellow rust detection. Achieving these remarkable accuracy levels involves maintaining almost identical inference time and
computational complexity as the LMFFNet model. The source code is available on GitHub: https://github.com/iremulku/
Semantic-Segmentation-in-Precision-Agriculture.
Keywords Real-time semantic segmentation · Remote sensing · Precision agriculture
1 Introduction
Precision agriculture is a technique that aims to increase
crop productivity while reducing costs and environmental
impact [1]. Sensing technology is a tool for achieving this
goal by monitoring vast lands. With the advancement of
convolutional neural networks (CNNs), this technology has
become even more powerful [2]. CNN models are used in
early disease detection, leading to reduced yield losses by
applying fungicides at the right time [1]. Additionally, CNN
architectures can identify trees and crops to maximize agricultural efficiency [3]. However, CNN architectures [4] have
high inference time measured in frames per second (fps),
which makes them impractical for real-time applications.
Precision agriculture faces the challenge of balancing
high accuracy with fast inference speed. Existing research
on using CNN models for real-time precision agriculture
* Irem Ulku
1
Department of Computer Engineering, Ankara University,
06830 Ankara, Turkey
focuses on a single specialized application [5–7] and does
not provide sufficient accuracy [8–10]. Therefore, it is essential to adapt recent real-time models to provide high accuracy in various precision agriculture applications [1].
Real-time CNN architectures generally adhere to an
encoder–decoder framework. Architectures like SegNet [11]
employ encoders based on established backbone networks.
In contrast, ENet [12], LEDNet [13], and FSFNet [14] use
lightweight modules to build efficient encoders, resulting
in fewer parameters. These models, however, lack accuracy
compared to others.
The decoder parts of real-time semantic segmentation
models may also have different designs. SegNet and ESNet
[15] have symmetrically designed decoders. In contrast,
DFANet [16], and FASSD-Net [17] architectures have
adopted asymmetric decoder structures to enhance inference speed. Recent transformer-based models such as UNetFormer [18] achieve good performance without sacrificing
real-time speed.
Remarkably, by introducing a split-extract-merge bottleneck (SEM-B) in its backbone network, the real-time
LMFFNet [19] architecture achieves high accuracy with
Vol.:(0123456789)
101
Page 2 of 13
fewer model parameters. A lightweight asymmetric decoder
is used in the LMFFNet model to process multi-scale features, which improves inference time. However, with the
challenging low latency and high accuracy requirements for
various precision agriculture tasks, LMFFNet still needs
improvement.
Realizing precision agriculture practices with high accuracy in real-time is challenging. In real-world remote-sensing
images with high spatial resolution, capturing the intricate
details of precision agriculture target objects poses considerable difficulties. This paper proposes the ResLMFFNet
architecture to increase prediction accuracy and achieve a
decent trade-off between high accuracy and fast inference
speed. ResLMFFNet introduces the following novelties:
• LMFFNet is the base model since it already achieves
an adequate trade-off between accuracy and efficiency.
Journal of Real-Time Image Processing
(2024) 21:101
However, residual connections are added to the SEM-B
blocks in this study to further increase accuracy without affecting inference speed (Fig. 1). By preserving
low-level features lost through deep SEM-B blocks,
these connections further enhance the performance of
LMFFNet. Residual connections are preferred to dense or
attention connections since the element-wise summation
operation does not introduce trainable weights.
• Before upsampling, the dropout layer is used in the
decoder, which allows the model to show higher generalization ability, making it better suited to a wide range
of precision agriculture practices.
In the remainder of this paper, the details of the proposed
architecture ResLMFFNet are described in Sect. 2. Section 3
presents the experimental results. Conclusions are given in
Sect. 4.
2 Methods
Fig. 1 Complexity-accuracy trade-off comparison on the DSTL
image set in terms of Jaccard Index JI, Giga floating point operations
(GFLOPs), and model parameters. The circle size indicates the number of the model parameters
Fig. 2 ResLMFFNet architecture
The ResLMFFNET model, an improved version of the
LMFFNET architecture, emerges to increase accuracy while
preserving real-time capabilities, as depicted in Fig. 2. Similar to the LMFFNET design, the ResLMFFNET model is
composed of three core components: SEM-B block, feature
fusion module (FFM), and multiscale attention decoder
(MAD).
The ResLMFFNET architecture achieves its novel contribution by implementing residual connections within the
SEM-B blocks, as illustrated in Fig. 2. Furthermore, the
accuracy is further boosted by the inclusion of a dropout
layer in the decoder design. This section provides a detailed
explanation of the essential components within the ResLMFFNET architecture.
Journal of Real-Time Image Processing
2.1 SEM‑B block
The SEM-B block is built upon the split-extract-merge bottleneck shown in Fig. 3. SEM-B applies a 3×3 convolution,
then splits the feature map into two branches, each with 1/4
channels of the input. One branch undergoes depthwise convolution, while the other employs depthwise dilated convolution so that SEM-B effectively captures fine spatial det (...truncated)