FPGA implementation of double-head SalsaNext: a CNN-based model for LiDAR point cloud segmentation
Journal of Real-Time Image Processing
(2025) 22:78
https://doi.org/10.1007/s11554-025-01643-9
RESEARCH
FPGA implementation of double‑head SalsaNext: a CNN‑based model
for LiDAR point cloud segmentation
Muhammed Yasin Adiyaman1
· Faik Baskaya1
Received: 22 December 2024 / Accepted: 6 February 2025
© The Author(s) 2025
Abstract
This study details the adaptation and deployment of a customized SalsaNext model for semantic segmentation of LiDAR
point clouds on edge devices, benchmarked using the SemanticKITTI and Waymo Open datasets. We introduce an innovative
multi-dataset training framework designed specifically for range image-based segmentation models. Central to this approach
is our double-head SalsaNext model, which features two output heads to facilitate simultaneous training and inference on
the Waymo and SemanticKITTI datasets. Following training, the model is streamlined by removing the head dedicated to
Waymo, resulting in a compact, single-headed version optimized for SemanticKITTI. This simplified model is then quantized
to employ fixed-point arithmetic, significantly enhancing computational efficiency and enabling real-time operation on the
Xilinx Kria KV260 board. The quantization process markedly reduces resource consumption while preserving competitive
accuracy. Our deployment on this low-power, FPGA-based platform underscores the potential of energy-efficient systems
for advanced 3D semantic segmentation, with promising applications in autonomous systems and robotics. Experimental
results validate the effectiveness of our training schema and the success of the optimized implementation of the double-head
model on resource-constrained hardware.
Keywords LIDAR · FPGA · Realtime · Semantic segmentation
Point cloud segmentation plays a vital role in vision-based
applications such as autonomous driving and robotics. Over
time, this field has shifted from traditional hand-crafted techniques to deep neural networks (DNNs), which deliver significantly better performance. This progress has been driven
by advances in computational power, enabling the design of
more sophisticated models.
However, deploying complex models on edge devices
presents significant challenges due to limited computational
resources and strict constraints on power consumption and
latency. To address these limitations, DNN models must be
optimized for hardware efficiency. One effective approach is
quantization, which reduces the precision of model weights
to lower bit resolutions. This significantly decreases computational requirements, improves power efficiency, and
* Muhammed Yasin Adiyaman
Faik Baskaya
1
Electrical and Electronics Engineering Department, Bogazici
University, Bebek, 34342 Istanbul, Turkey
enhances inference speed, making it well-suited for edge
deployment.
In this study, we adapted, customized, and trained the
lightweight CNN-based SalsaNext model using our proposed multi-head training mechanism. Subsequently, we
optimized and quantized the model to 8-bit fixed-point precision before deploying it on a Xilinx Kria KV260 FPGA,
leveraging the deep learning processing unit (DPU) architecture within its programmable logic (PL).
The main contributions of this work are outlined below:
• Real-time model for outdoor robotics: Developed a
robust LIDAR segmentation model optimized for realtime performance in outdoor robotics, ensuring stable
and reliable operation across diverse environments.
• Merging two of the largest LiDAR segmentation datasets: Proposed a domain adjustment schema to merge two
of the most common and largest datasets, SemanticKITTI
and Open Waymo Dataset, for autonomous driving.
• Multiple-head CNN model based on SalsaNext: Proposed a multi-head SalsaNext model to train multiple
LiDAR segmentation datasets with different characterVol.:(0123456789)
78
Page 2 of 11
istics simultaneously without degrading the main head’s
performance.
• State-of-the-art results on FPGA: Achieved superior
mean IoU and accuracy for point cloud segmentation
on the SemanticKITTI dataset, surpassing other FPGAbased solutions.
• Cost-effective FPGA deployment: Demonstrated that
advanced point cloud segmentation models can be effectively deployed on an affordable FPGA platform.
• Open-source implementation: The implementation will be released as an open-source Python-based
GitHub repository, allowing the community to adapt the
approach for other CNN-based models compatible with
FPGA systems.
The paper is organized as follows: Section 1 reviews related
work on point cloud segmentation for edge devices, providing context and background. Section 2 provides an overview
of the model architecture, details the dataset domain adjustments, explains the training process, and also elaborates on
the quantization and deployment process on the Kria KV260
board. Finally, Sect. 3 presents the experimental results from
deploying the proposed model on a Xilinx FPGA card,
showcasing its performance and effectiveness.
1 Related work
Point cloud segmentation for edge devices faces several critical challenges: managing unstructured data as part of input
data representation; balancing complexity and stability in
model architectures; achieving hardware efficiency for edge
device optimization; and addressing domain shift during
multi-dataset training.
1.1 Input data representation
Existing segmentation methods for point clouds typically fall
into four categories based on how they structure the input
data:
1. Point-set-based methods [1, 2],
2. Voxel-based methods [3],
3. Projection-based methods, such as those using range
images or bird’s-eye-view (BEV) representations [4–7],
and
4. Hybrid approaches [8, 9].
Among these, projection-based techniques-especially those
employing range images [5, 6]-are particularly attractive
for edge deployment because they offer a compact, computationally efficient representation and can leverage mature
2D CNN architectures. Although BEV methods are a viable
Journal of Real-Time Image Processing
(2025) 22:78
option, their extensive field of view in outdoor robotics often
leads to a significant increase in input size, making them
less ideal. In contrast, range images from single-sweep point
clouds provide a more concise alternative, which is why our
work focuses on range-image-based methods.
1.2 Model architectures
The literature identifies four primary model architectures for
point cloud segmentation:
1.
2.
3.
4.
Graph-based models [10–12],
Transformer-based models [1, 2, 13],
3D CNNs [14–16], and
2D CNNs [4–7].
Transformer-based models-such as Point Transformer [1]
and Range Former [13]-use attention mechanisms to achieve
high performance; however, their complexity and reliance
on heterogeneous operations can hinder their deployment on
edge devices. Similarly, while graph-based approaches like
PointNet [17, 18] and KPConv [10] are adept at handling
unstructured data, their high computational cost or limited
accuracy often makes them unsuitable for edge applicatio (...truncated)