FPGA architecture-based front-end processing for SLAM applications

Journal of Real-Time Image Processing, Mar 2025

Simultaneous Localization and Mapping is intended for robotic and autonomous vehicle applications. These targets require an optimal embedded implementation that respects real-time constraints, limited hardware resources, and energy consumption. SLAM algorithms are computationally intensive to run on embedded targets, and often, the algorithms are deployed on CPUs or CPU–GPGPU architectures. With the growth of embedded heterogeneous computing systems, research work is increasingly interested in the algorithm–architecture mapping of existing SLAM algorithms. The latest trend is pushing processing closer to the sensor. FPGAs constitute the perfect architecture for designing smart sensors by providing low latency suitable for real-time applications, such as video streaming, as they supply data directly into the FPGA without needing a CPU. In this work, we propose the implementation of the HOOFR-SLAM front end on a CPU–FPGA architecture, including both feature extraction and matching processing blocks. A high-level synthesis (HLS) approach based on OpenCL paradigm has been used to design a new system architecture. The performance of the FPGA-based architecture was compared to a high-performance CPU. This innovative architecture delivers superior performance compared to existing state-of-the-art systems.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11554-025-01650-w.pdf

FPGA architecture-based front-end processing for SLAM applications

Journal of Real-Time Image Processing (2025) 22:73 https://doi.org/10.1007/s11554-025-01650-w RESEARCH FPGA architecture‑based front‑end processing for SLAM applications Imad El Bouazzaoui1 · Sergio Rodríguez Flórez1 · Abdelhafid El Ouardi1 Received: 16 February 2024 / Accepted: 15 February 2025 © The Author(s) 2025 Abstract Simultaneous Localization and Mapping is intended for robotic and autonomous vehicle applications. These targets require an optimal embedded implementation that respects real-time constraints, limited hardware resources, and energy consumption. SLAM algorithms are computationally intensive to run on embedded targets, and often, the algorithms are deployed on CPUs or CPU–GPGPU architectures. With the growth of embedded heterogeneous computing systems, research work is increasingly interested in the algorithm–architecture mapping of existing SLAM algorithms. The latest trend is pushing processing closer to the sensor. FPGAs constitute the perfect architecture for designing smart sensors by providing low latency suitable for real-time applications, such as video streaming, as they supply data directly into the FPGA without needing a CPU. In this work, we propose the implementation of the HOOFR-SLAM front end on a CPU–FPGA architecture, including both feature extraction and matching processing blocks. A high-level synthesis (HLS) approach based on OpenCL paradigm has been used to design a new system architecture. The performance of the FPGA-based architecture was compared to a highperformance CPU. This innovative architecture delivers superior performance compared to existing state-of-the-art systems. Keywords Image processing · Features’ extraction and matching · FPGA implementation · Embedded systems 1 Introduction Feature-based SLAM systems are becoming increasingly popular due to their performance and robustness. Several feature extractors are used in various SLAM systems, such as ORB [1], SIFT [2], and SURF [3]. Although these extractors yield good matching results, the computational complexity of feature extraction and matching is a significant hurdle when embedding such SLAM algorithms on lowpower architectures. Recently, Nguyen et al. [4] proposed an FPGA implementation of the HOOFR extractor while maintaining the same accuracy. However, the matching task remains the most time-consuming task in the processing flow. The design of an accelerated architecture for this functional block is mandatory to achieve on-the-fly processing on * Sergio Rodríguez Flórez a system-on-chip. Our challenge is to boost the algorithm’s performance on low-power architectures to ensure on-thefly processing. FPGAs are considered the best choice for stream processing. Compared to GPUs, which only provides parallelism for data processing and acceleration, FPGAs can provide pipeline parallelism and on-the-fly data processing which makes them more suitable for stream processing [5, 6], especially for embedded systems. We achieve our objective through an algorithm architecture mapping applied to CPU–FPGA architectures. In practice, an algorithm is broken down into functional blocks, and each block is assigned to the appropriate processing unit, ensuring optimal performance. In this paper, we evaluate the performance of the matching block on different architectures, since it is the bottleneck of performance, concluding with the proposition of an optimal CPU–FPGA mapping for the RGB-D HOOFR-SLAM front-end.1 Imad El Bouazzaoui Abdelhafid El Ouardi 1 Université Paris‑Saclay, ENS Paris-Saclay, CNRS, SATIE, Gif‑sur‑Yvette 91190, France 1 The source code is available at https://github.com/Imel23/FPGA- Based-HOOFR-Front-End. Vol.:(0123456789) 73 Page 2 of 11 2 Related works The visual SLAM comprises two main blocks: a front-end block that takes charge of all the data processing and poses calculation and a back-end block that optimizes the map and the trajectory. There are two approaches which aim at embedding complex algorithms on dedicated architectures in the literature. The first focuses on embedding the frontend processing [4, 7, 8], and the second is concerned with the back-end [9, 10]. To bring the processing as near as possible to the sensor, we will focus on works performed on the SLAM front-end. 2.1 CPU–GPU‑based SLAM The CPU–GPU architectures are widely used in robotics, especially in computer vision, since a GPU can offer many cores for parallel Single Instruction, Multiple Data (SIMD) processing. Based on DTAM [11], Ondrúška et al. [12] exploited the GPU of various mobile phones to implement a pipeline that creates a connected 3D surface model directly on the device in real time. They assigned sequential tasks, including keyframe selection and dense camera alignment, to the CPU, as the camera alignment requires an accumulation of errors across the entire input image. Also, they used SIMD instructions which led to a processing of 4 pixels at a time. On the other hand, GPU was run in parallel carrying stereo depth computation, model update, and raycasting. Even though this architecture allows volumetric surface reconstruction and dense 6DoF camera tracking in real time, the GPU hardware constraints limit the voxel resolution. In the category of indirect approaches, Aldegheri et al. [13] modified ORBSLAM2 to operate on the Nvidia Jetson TX2 in real time. Besides the current parallelization of the algorithm on CPU [Parallel PThreads on shared-memory multi-core CPUs and automatic parallel implementation (i.e., through OpenMP directives) of the bundle adjustment sub-block], the authors have added two layers of parallelism. The first one consists of a parallel implementation of the tracking sub-blocks on GPU. The second is the implementation of an 8-stage pipeline of such sub-blocks. The acceleration targets the feature extraction block as it is the bottleneck of the processing flow. For this purpose, they modeled the extraction block with a directed acyclic graph (DAG), adopting the OpenVX standard. Ma et al. [14] presented a front-end processing parallelization of the ORB-SLAM2 algorithm on the Jetson TX2. As front-end processing consumes more than half of computing resources and operates on images, so this part is well suitable for parallelization. The parallelization involves the feature extractor and the Journal of Real-Time Image Processing (2025) 22:73 local point selection. This latter is used to reduce the input data of the matching block. The feature extraction parallelization involves constructing the Gaussian pyramid of the image on the GPU, then feature detection, orientation calculation, and description are performed on the GPU. Because of the asynchronous operation of CPU and GPU, task and thread allocation were adjusted to decrease the idle time of the GPU and to increase the usage of the streaming multiprocessor (SM). Nguyen et al. [7] found that the features’ matching block has a high computational cost and low data dependence, so th (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007/s11554-025-01650-w.pdf
Article home page: https://link.springer.com/article/10.1007/s11554-025-01650-w

El Bouazzaoui, Imad, Rodríguez Flórez, Sergio, El Ouardi, Abdelhafid. FPGA architecture-based front-end processing for SLAM applications, Journal of Real-Time Image Processing, 2025, pp. 1-11, Volume 22, Issue 2, DOI: 10.1007/s11554-025-01650-w