A generic deep learning architecture optimization method for edge device based on start-up latency reduction (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11554-024-01496-8.pdf

A generic deep learning architecture optimization method for edge device based on start-up latency reduction

Journal of Real-Time Image Processing (2024) 21:116 https://doi.org/10.1007/s11554-024-01496-8 RESEARCH A generic deep learning architecture optimization method for edge device based on start‑up latency reduction Qi Li1 · Hengyi Li2 · Lin Meng3 Received: 30 January 2024 / Accepted: 10 June 2024 © The Author(s) 2024 Abstract In the promising Artificial Intelligence of Things technology, deep learning algorithms are implemented on edge devices to process data locally. However, high-performance deep learning algorithms are accompanied by increased computation and parameter storage costs, leading to difficulties in implementing huge deep learning algorithms on memory and power constrained edge devices, such as smartphones and drones. Thus various compression methods are proposed, such as channel pruning. According to the analysis of low-level operations on edge devices, existing channel pruning methods have limited effect on latency optimization. Due to data processing operations, the pruned residual blocks still result in significant latency, which hinders real-time processing of CNNs on edge devices. Hence, we propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The network is optimized in two stages, Global Constraint and Start-up Latency Reduction, and pruning of both channels and residual blocks is achieved. Optimized networks are evaluated on desktop CPU, FPGA, ARM CPU, and PULP platforms. The experimental results show that the latency is reduced by up to 70.40%, which is 13.63% higher than only applying channel pruning and achieving real-time processing in the edge device. Keywords Neural network compression · Edge device · PULP platform · FPGA 1 Introduction The Artificial Intelligence of Things (AIoT), a promising integrated technology that combines artificial intelligence and Intelligence of Things, is drawing significant interest [1]. However, feedback from AIoT systems usually has unacceptable latency due to the limited bandwidth of the network and instability of communication [2, 3]. The current trend is * Lin Meng Qi Li Hengyi Li 1 College of Science and Engineering, Ritsumeikan University, Noji‑higashi, Kusatsu, Shiga 525‑8577, Japan 2 Research Organization of Science and Technology, Ritsumeikan University, Noji‑higashi, Kusatsu, Shiga 525‑8577, Japan 3 Department of Electronic and Computer Engineering, Ritsumeikan University, Noji‑higashi, Kusatsu, Shiga 525‑8577, Japan to implement deep learning algorithms on edge devices that process the raw data close to the data source [4]. Convolutional neural networks (CNNs) are highly regarded among deep learning technologies due to their impressive performance in a variety of applications such as object recognition [5–7], healthcare [8, 9], image generation [10] and anomaly detection [11, 12]. The development of CNNs nowadays is accompanied by increasing memory usage and computational complexity. Whereas edge devices are heavily constrained in computation power, memory bandwidth, and power consumption [13]. As a result, the latency of the full CNN algorithm on an edge device is normally unacceptable. Compression and quantization are commonly required before deploying CNN-based applications onto edge devices [14]. The CNN compression methods include: channel pruning [15], knowledge distillation [16, 17], matrix decomposition and so on. The channel pruning methods aim to identify less important channels (i.e., filters) and remove them. Despite the channel pruning working well on reducing Floating Point Operations (FLOPs), such method has limited effect on latency optimization. We divide the low-level operations of the convolutional layer into matrix–vector Vol.:(0123456789) 116 Page 2 of 12 multiplication (MVM) operations and data processing operations, as shown in Fig. 1. The MVM operations are mainly from the convolution between the filters and the inputs. They take up most of the FLOPs [18], therefore are the main compression target of channel pruning. Data processing operations are performed before and after the MVM operations, including padding of feature maps, rearrangement of the input feature map (Im2col), re-quantization of the output, and storage of results [19]. The latency due to data processing operations is defined as start-up latency. Figure 2 demonstrates the variation in latency and FLOPs when pruning the output channel of convolutional layer. It could be seen that reduction in latency is not as significant as FLOPs. Even when there is only one output channel left, 80% latency remains, which is the limit of channel pruning. Pruning the output channels effectively reduces MVM operations, but does not optimize data processing operations at the inputs. The limitation of pruning results in significant start-up latency remaining. To effectively optimize start-up latency, the data processing operations on both the input and output sides should be reduced. However, mainstream pruning strategies have difficulty in achieving this goal. In recent years, a significant number of networks have adopted the design of residual blocks [20]. Residual block consists of two parts: the main path and the residual connection. The main path is composed of multiple weight layers, including convolution, batch normalization (BN), and activation layers. The residual connection adds the input directly to the output of the main path, which requires that the input and output tensor shapes are the same. Consequently, the pruning strategy for residual blocks adopted by most studies [21, 22] is to keep the entire input channels of the first layer and the output channels of the last layer. To further the latency reduction, it is worthwhile to improve this pruning strategy to achieve a further reduction in latency. This fact encourages us to propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The CNNs are optimized in Journal of Real-Time Image Processing (2024) 21:116 Fig. 2 Variation in latency and FLOPs when pruning the convolutional layer with 8 output channels two stages: Global Constraint (GC) and Start-up Latency Reduction (SLR). The GC stage aims to achieve lossless channel pruning. The main paths are constrained by the adjunct layers, and the expression of redundant channels is blocked. Then, the adjunct layers are equivalently converted into the BN layers to achieve channel pruning. Next, the SLR stage aims to optimize start-up latency. Residual blocks that do not function efficiently due to constraints are identified and pruned in SLR stage. Finally, the optimized network is implemented on multiple platforms, and the reduction in latency and FLOPs is evaluated. The main contributions of the paper are as follows: • We improve the mainstream pruning strategy to fur- ther reduce latency. Experimental results show that this approach optimizes more latency than channel pruning alone. • (...truncated)