A generic deep learning architecture optimization method for edge device based on start-up latency reduction
Journal of Real-Time Image Processing
(2024) 21:116
https://doi.org/10.1007/s11554-024-01496-8
RESEARCH
A generic deep learning architecture optimization method for edge
device based on start‑up latency reduction
Qi Li1 · Hengyi Li2 · Lin Meng3
Received: 30 January 2024 / Accepted: 10 June 2024
© The Author(s) 2024
Abstract
In the promising Artificial Intelligence of Things technology, deep learning algorithms are implemented on edge devices
to process data locally. However, high-performance deep learning algorithms are accompanied by increased computation
and parameter storage costs, leading to difficulties in implementing huge deep learning algorithms on memory and power
constrained edge devices, such as smartphones and drones. Thus various compression methods are proposed, such as channel
pruning. According to the analysis of low-level operations on edge devices, existing channel pruning methods have limited
effect on latency optimization. Due to data processing operations, the pruned residual blocks still result in significant latency,
which hinders real-time processing of CNNs on edge devices. Hence, we propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The network is optimized in two stages, Global Constraint and
Start-up Latency Reduction, and pruning of both channels and residual blocks is achieved. Optimized networks are evaluated
on desktop CPU, FPGA, ARM CPU, and PULP platforms. The experimental results show that the latency is reduced by up to
70.40%, which is 13.63% higher than only applying channel pruning and achieving real-time processing in the edge device.
Keywords Neural network compression · Edge device · PULP platform · FPGA
1 Introduction
The Artificial Intelligence of Things (AIoT), a promising
integrated technology that combines artificial intelligence
and Intelligence of Things, is drawing significant interest
[1]. However, feedback from AIoT systems usually has unacceptable latency due to the limited bandwidth of the network
and instability of communication [2, 3]. The current trend is
* Lin Meng
Qi Li
Hengyi Li
1
College of Science and Engineering, Ritsumeikan
University, Noji‑higashi, Kusatsu, Shiga 525‑8577, Japan
2
Research Organization of Science and Technology,
Ritsumeikan University, Noji‑higashi, Kusatsu,
Shiga 525‑8577, Japan
3
Department of Electronic and Computer Engineering,
Ritsumeikan University, Noji‑higashi, Kusatsu,
Shiga 525‑8577, Japan
to implement deep learning algorithms on edge devices that
process the raw data close to the data source [4].
Convolutional neural networks (CNNs) are highly
regarded among deep learning technologies due to their
impressive performance in a variety of applications such as
object recognition [5–7], healthcare [8, 9], image generation [10] and anomaly detection [11, 12]. The development
of CNNs nowadays is accompanied by increasing memory
usage and computational complexity. Whereas edge devices
are heavily constrained in computation power, memory
bandwidth, and power consumption [13]. As a result, the
latency of the full CNN algorithm on an edge device is normally unacceptable.
Compression and quantization are commonly required
before deploying CNN-based applications onto edge devices
[14]. The CNN compression methods include: channel pruning [15], knowledge distillation [16, 17], matrix decomposition and so on. The channel pruning methods aim to identify
less important channels (i.e., filters) and remove them.
Despite the channel pruning working well on reducing
Floating Point Operations (FLOPs), such method has limited effect on latency optimization. We divide the low-level
operations of the convolutional layer into matrix–vector
Vol.:(0123456789)
116
Page 2 of 12
multiplication (MVM) operations and data processing operations, as shown in Fig. 1. The MVM operations are mainly
from the convolution between the filters and the inputs. They
take up most of the FLOPs [18], therefore are the main compression target of channel pruning. Data processing operations are performed before and after the MVM operations,
including padding of feature maps, rearrangement of the
input feature map (Im2col), re-quantization of the output,
and storage of results [19]. The latency due to data processing operations is defined as start-up latency.
Figure 2 demonstrates the variation in latency and FLOPs
when pruning the output channel of convolutional layer. It
could be seen that reduction in latency is not as significant
as FLOPs. Even when there is only one output channel left,
80% latency remains, which is the limit of channel pruning. Pruning the output channels effectively reduces MVM
operations, but does not optimize data processing operations
at the inputs. The limitation of pruning results in significant
start-up latency remaining.
To effectively optimize start-up latency, the data processing operations on both the input and output sides should
be reduced. However, mainstream pruning strategies have
difficulty in achieving this goal. In recent years, a significant number of networks have adopted the design of residual
blocks [20]. Residual block consists of two parts: the main
path and the residual connection. The main path is composed
of multiple weight layers, including convolution, batch normalization (BN), and activation layers. The residual connection adds the input directly to the output of the main path,
which requires that the input and output tensor shapes are the
same. Consequently, the pruning strategy for residual blocks
adopted by most studies [21, 22] is to keep the entire input
channels of the first layer and the output channels of the last
layer. To further the latency reduction, it is worthwhile to
improve this pruning strategy to achieve a further reduction
in latency.
This fact encourages us to propose a generic deep learning architecture optimization method to achieve further
acceleration on edge devices. The CNNs are optimized in
Journal of Real-Time Image Processing
(2024) 21:116
Fig. 2 Variation in latency and FLOPs when pruning the convolutional layer with 8 output channels
two stages: Global Constraint (GC) and Start-up Latency
Reduction (SLR). The GC stage aims to achieve lossless
channel pruning. The main paths are constrained by the
adjunct layers, and the expression of redundant channels
is blocked. Then, the adjunct layers are equivalently converted into the BN layers to achieve channel pruning. Next,
the SLR stage aims to optimize start-up latency. Residual
blocks that do not function efficiently due to constraints
are identified and pruned in SLR stage. Finally, the optimized network is implemented on multiple platforms, and
the reduction in latency and FLOPs is evaluated.
The main contributions of the paper are as follows:
• We improve the mainstream pruning strategy to fur-
ther reduce latency. Experimental results show that this
approach optimizes more latency than channel pruning
alone.
• (...truncated)