FPGA architecture-based front-end processing for SLAM applications
Journal of Real-Time Image Processing
(2025) 22:73
https://doi.org/10.1007/s11554-025-01650-w
RESEARCH
FPGA architecture‑based front‑end processing for SLAM applications
Imad El Bouazzaoui1
· Sergio Rodríguez Flórez1
· Abdelhafid El Ouardi1
Received: 16 February 2024 / Accepted: 15 February 2025
© The Author(s) 2025
Abstract
Simultaneous Localization and Mapping is intended for robotic and autonomous vehicle applications. These targets require an
optimal embedded implementation that respects real-time constraints, limited hardware resources, and energy consumption.
SLAM algorithms are computationally intensive to run on embedded targets, and often, the algorithms are deployed on CPUs
or CPU–GPGPU architectures. With the growth of embedded heterogeneous computing systems, research work is increasingly interested in the algorithm–architecture mapping of existing SLAM algorithms. The latest trend is pushing processing
closer to the sensor. FPGAs constitute the perfect architecture for designing smart sensors by providing low latency suitable
for real-time applications, such as video streaming, as they supply data directly into the FPGA without needing a CPU. In
this work, we propose the implementation of the HOOFR-SLAM front end on a CPU–FPGA architecture, including both
feature extraction and matching processing blocks. A high-level synthesis (HLS) approach based on OpenCL paradigm has
been used to design a new system architecture. The performance of the FPGA-based architecture was compared to a highperformance CPU. This innovative architecture delivers superior performance compared to existing state-of-the-art systems.
Keywords Image processing · Features’ extraction and matching · FPGA implementation · Embedded systems
1 Introduction
Feature-based SLAM systems are becoming increasingly
popular due to their performance and robustness. Several
feature extractors are used in various SLAM systems, such
as ORB [1], SIFT [2], and SURF [3]. Although these extractors yield good matching results, the computational complexity of feature extraction and matching is a significant
hurdle when embedding such SLAM algorithms on lowpower architectures. Recently, Nguyen et al. [4] proposed
an FPGA implementation of the HOOFR extractor while
maintaining the same accuracy. However, the matching task
remains the most time-consuming task in the processing
flow. The design of an accelerated architecture for this functional block is mandatory to achieve on-the-fly processing on
* Sergio Rodríguez Flórez
a system-on-chip. Our challenge is to boost the algorithm’s
performance on low-power architectures to ensure on-thefly processing. FPGAs are considered the best choice for
stream processing. Compared to GPUs, which only provides
parallelism for data processing and acceleration, FPGAs can
provide pipeline parallelism and on-the-fly data processing
which makes them more suitable for stream processing [5,
6], especially for embedded systems.
We achieve our objective through an algorithm architecture mapping applied to CPU–FPGA architectures. In practice, an algorithm is broken down into functional blocks, and
each block is assigned to the appropriate processing unit,
ensuring optimal performance. In this paper, we evaluate
the performance of the matching block on different architectures, since it is the bottleneck of performance, concluding
with the proposition of an optimal CPU–FPGA mapping for
the RGB-D HOOFR-SLAM front-end.1
Imad El Bouazzaoui
Abdelhafid El Ouardi
1
Université Paris‑Saclay, ENS Paris-Saclay, CNRS, SATIE,
Gif‑sur‑Yvette 91190, France
1
The source code is available at https://github.com/Imel23/FPGA-
Based-HOOFR-Front-End.
Vol.:(0123456789)
73
Page 2 of 11
2 Related works
The visual SLAM comprises two main blocks: a front-end
block that takes charge of all the data processing and poses
calculation and a back-end block that optimizes the map
and the trajectory. There are two approaches which aim at
embedding complex algorithms on dedicated architectures
in the literature. The first focuses on embedding the frontend processing [4, 7, 8], and the second is concerned with
the back-end [9, 10]. To bring the processing as near as
possible to the sensor, we will focus on works performed
on the SLAM front-end.
2.1 CPU–GPU‑based SLAM
The CPU–GPU architectures are widely used in robotics, especially in computer vision, since a GPU can offer
many cores for parallel Single Instruction, Multiple Data
(SIMD) processing. Based on DTAM [11], Ondrúška
et al. [12] exploited the GPU of various mobile phones to
implement a pipeline that creates a connected 3D surface
model directly on the device in real time. They assigned
sequential tasks, including keyframe selection and dense
camera alignment, to the CPU, as the camera alignment
requires an accumulation of errors across the entire input
image. Also, they used SIMD instructions which led to a
processing of 4 pixels at a time. On the other hand, GPU
was run in parallel carrying stereo depth computation,
model update, and raycasting. Even though this architecture allows volumetric surface reconstruction and dense
6DoF camera tracking in real time, the GPU hardware
constraints limit the voxel resolution. In the category of
indirect approaches, Aldegheri et al. [13] modified ORBSLAM2 to operate on the Nvidia Jetson TX2 in real time.
Besides the current parallelization of the algorithm on
CPU [Parallel PThreads on shared-memory multi-core
CPUs and automatic parallel implementation (i.e., through
OpenMP directives) of the bundle adjustment sub-block],
the authors have added two layers of parallelism. The first
one consists of a parallel implementation of the tracking
sub-blocks on GPU. The second is the implementation of
an 8-stage pipeline of such sub-blocks. The acceleration
targets the feature extraction block as it is the bottleneck
of the processing flow. For this purpose, they modeled
the extraction block with a directed acyclic graph (DAG),
adopting the OpenVX standard. Ma et al. [14] presented a
front-end processing parallelization of the ORB-SLAM2
algorithm on the Jetson TX2. As front-end processing consumes more than half of computing resources and operates
on images, so this part is well suitable for parallelization.
The parallelization involves the feature extractor and the
Journal of Real-Time Image Processing
(2025) 22:73
local point selection. This latter is used to reduce the input
data of the matching block. The feature extraction parallelization involves constructing the Gaussian pyramid
of the image on the GPU, then feature detection, orientation calculation, and description are performed on the
GPU. Because of the asynchronous operation of CPU and
GPU, task and thread allocation were adjusted to decrease
the idle time of the GPU and to increase the usage of the
streaming multiprocessor (SM). Nguyen et al. [7] found
that the features’ matching block has a high computational
cost and low data dependence, so th (...truncated)