FPGA implementation of double-head SalsaNext: a CNN-based model for LiDAR point cloud segmentation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11554-025-01643-9.pdf

FPGA implementation of double-head SalsaNext: a CNN-based model for LiDAR point cloud segmentation

Journal of Real-Time Image Processing (2025) 22:78 https://doi.org/10.1007/s11554-025-01643-9 RESEARCH FPGA implementation of double‑head SalsaNext: a CNN‑based model for LiDAR point cloud segmentation Muhammed Yasin Adiyaman1 · Faik Baskaya1 Received: 22 December 2024 / Accepted: 6 February 2025 © The Author(s) 2025 Abstract This study details the adaptation and deployment of a customized SalsaNext model for semantic segmentation of LiDAR point clouds on edge devices, benchmarked using the SemanticKITTI and Waymo Open datasets. We introduce an innovative multi-dataset training framework designed specifically for range image-based segmentation models. Central to this approach is our double-head SalsaNext model, which features two output heads to facilitate simultaneous training and inference on the Waymo and SemanticKITTI datasets. Following training, the model is streamlined by removing the head dedicated to Waymo, resulting in a compact, single-headed version optimized for SemanticKITTI. This simplified model is then quantized to employ fixed-point arithmetic, significantly enhancing computational efficiency and enabling real-time operation on the Xilinx Kria KV260 board. The quantization process markedly reduces resource consumption while preserving competitive accuracy. Our deployment on this low-power, FPGA-based platform underscores the potential of energy-efficient systems for advanced 3D semantic segmentation, with promising applications in autonomous systems and robotics. Experimental results validate the effectiveness of our training schema and the success of the optimized implementation of the double-head model on resource-constrained hardware. Keywords LIDAR · FPGA · Realtime · Semantic segmentation Point cloud segmentation plays a vital role in vision-based applications such as autonomous driving and robotics. Over time, this field has shifted from traditional hand-crafted techniques to deep neural networks (DNNs), which deliver significantly better performance. This progress has been driven by advances in computational power, enabling the design of more sophisticated models. However, deploying complex models on edge devices presents significant challenges due to limited computational resources and strict constraints on power consumption and latency. To address these limitations, DNN models must be optimized for hardware efficiency. One effective approach is quantization, which reduces the precision of model weights to lower bit resolutions. This significantly decreases computational requirements, improves power efficiency, and * Muhammed Yasin Adiyaman Faik Baskaya 1 Electrical and Electronics Engineering Department, Bogazici University, Bebek, 34342 Istanbul, Turkey enhances inference speed, making it well-suited for edge deployment. In this study, we adapted, customized, and trained the lightweight CNN-based SalsaNext model using our proposed multi-head training mechanism. Subsequently, we optimized and quantized the model to 8-bit fixed-point precision before deploying it on a Xilinx Kria KV260 FPGA, leveraging the deep learning processing unit (DPU) architecture within its programmable logic (PL). The main contributions of this work are outlined below: • Real-time model for outdoor robotics: Developed a robust LIDAR segmentation model optimized for realtime performance in outdoor robotics, ensuring stable and reliable operation across diverse environments. • Merging two of the largest LiDAR segmentation datasets: Proposed a domain adjustment schema to merge two of the most common and largest datasets, SemanticKITTI and Open Waymo Dataset, for autonomous driving. • Multiple-head CNN model based on SalsaNext: Proposed a multi-head SalsaNext model to train multiple LiDAR segmentation datasets with different characterVol.:(0123456789) 78 Page 2 of 11 istics simultaneously without degrading the main head’s performance. • State-of-the-art results on FPGA: Achieved superior mean IoU and accuracy for point cloud segmentation on the SemanticKITTI dataset, surpassing other FPGAbased solutions. • Cost-effective FPGA deployment: Demonstrated that advanced point cloud segmentation models can be effectively deployed on an affordable FPGA platform. • Open-source implementation: The implementation will be released as an open-source Python-based GitHub repository, allowing the community to adapt the approach for other CNN-based models compatible with FPGA systems. The paper is organized as follows: Section 1 reviews related work on point cloud segmentation for edge devices, providing context and background. Section 2 provides an overview of the model architecture, details the dataset domain adjustments, explains the training process, and also elaborates on the quantization and deployment process on the Kria KV260 board. Finally, Sect. 3 presents the experimental results from deploying the proposed model on a Xilinx FPGA card, showcasing its performance and effectiveness. 1 Related work Point cloud segmentation for edge devices faces several critical challenges: managing unstructured data as part of input data representation; balancing complexity and stability in model architectures; achieving hardware efficiency for edge device optimization; and addressing domain shift during multi-dataset training. 1.1 Input data representation Existing segmentation methods for point clouds typically fall into four categories based on how they structure the input data: 1. Point-set-based methods [1, 2], 2. Voxel-based methods [3], 3. Projection-based methods, such as those using range images or bird’s-eye-view (BEV) representations [4–7], and 4. Hybrid approaches [8, 9]. Among these, projection-based techniques-especially those employing range images [5, 6]-are particularly attractive for edge deployment because they offer a compact, computationally efficient representation and can leverage mature 2D CNN architectures. Although BEV methods are a viable Journal of Real-Time Image Processing (2025) 22:78 option, their extensive field of view in outdoor robotics often leads to a significant increase in input size, making them less ideal. In contrast, range images from single-sweep point clouds provide a more concise alternative, which is why our work focuses on range-image-based methods. 1.2 Model architectures The literature identifies four primary model architectures for point cloud segmentation: 1. 2. 3. 4. Graph-based models [10–12], Transformer-based models [1, 2, 13], 3D CNNs [14–16], and 2D CNNs [4–7]. Transformer-based models-such as Point Transformer [1] and Range Former [13]-use attention mechanisms to achieve high performance; however, their complexity and reliance on heterogeneous operations can hinder their deployment on edge devices. Similarly, while graph-based approaches like PointNet [17, 18] and KPConv [10] are adept at handling unstructured data, their high computational cost or limited accuracy often makes them unsuitable for edge applicatio (...truncated)