Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11263-024-02320-3.pdf

Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

International Journal of Computer Vision https://doi.org/10.1007/s11263-024-02320-3 Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation Antonin Vobecky1,2,3 Josef Sivic1 · David Hurych2 · Oriane Siméoni2 · Spyros Gidaris2 · Andrei Bursuc2 · Patrick Pérez4 · Received: 29 September 2023 / Accepted: 29 November 2024 © The Author(s) 2025 Abstract Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s density, supervised finetuning as well as additional qualitative results and their analysis. Keywords Autonomous driving · Unsupervised semantic segmentation · Multimodal learning 1 Introduction In this work, we investigate whether it is possible to learn pixel-wise semantic image segmentation of urban scenes without the need for any manual annotation, just from the Communicated by Dengxin Dai. B Antonin Vobecky ; 1 Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Prague, Czech Republic 2 valeo.ai, Paris, France 3 Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, France 4 Kyutai, Paris, France raw non-curated data collected by cars equipped with cameras and LiDAR sensors while driving in town. This topic is essential, as current methods require large amounts of pixel-wise annotations over various driving conditions and situations. Such a manual segmentation of images in large scale is very expensive, time-consuming, and prone to biases. Currently, the best methods for unsupervised learning of semantic segmentation assume that images contain centered objects (Van et al., 2021) rather than whole scenes or use spatial self-supervision available in the image domain (Cho et al., 2021). They do not leverage additional modalities, such as the LiDAR data, available for urban scenes in autonomous driving setups. In this work, we develop an approach for unsupervised semantic segmentation that learns to segment complex scenes containing many objects, including thin structures such as pedestrians or traffic lights, without the need for any manual annotation. Instead, it lever- 123 International Journal of Computer Vision Fig. 1 Proposed fully-unsupervised approach. From uncurated images and LiDAR data (left), our Drive&Segment approach learns a semantic image segmentation model with no manual annotations. The resulting model performs unsupervised semantic segmentation of new unseen datasets (right) without any human labeling. It can segment complex scenes with many objects, including thin structures such as people, bicycles, poles or traffic lights. The black color denotes the ignored areas (Color figure online) ages cross-modal information available in (aligned) LiDAR point clouds and images; see Fig. 1. Exploiting point clouds as a form of supervision is, however, not straightforward: data from LiDAR and camera are rarely perfectly synchronized; moreover, point clouds are unstructured and of much lower resolution compared to images; finally, extracting useful semantic information from LiDAR is still a very hard problem. In this work, we overcome these issues and show that extracting useful pixel-wise semantic supervision from LiDAR data is possible. The contributions of our work are threefold. First, we propose a novel method for unsupervised cross-modal learning of semantic image segmentation by leveraging synchronized LiDAR and image data. The critical ingredient is a module that analyzes the LiDAR point cloud to obtain proposals for spatially consistent objects that can be clearly separated from each other and the ground plane in the 3D scene. Second, we show that these 3D object proposals can be aligned with input images and reliably clustered into semantically meaningful pseudo-classes by using image features from a network trained without supervision. We demonstrate that this approach is robust to noise in point clouds and delivers, without the need for any manual annotation, pseudo-classes with pixel-wise segmentation for various objects present in driving scenes. These classes include objects such as pedestrians or traffic lights that are notoriously hard to segment automatically in the image domain. Third, we develop a novel cross-modal distillation approach that trains a teacher network with the available partial pseudo labels and then exploits its predictions to train the student with pixel-wise pseudo annotations covering the whole image. Additionally, our approach exploits geometric constraints extracted from the LiDAR point cloud during the teacher-student learning process to refine teacher predictions distilled into the student network. Implemented with transformer-based networks, this cross-modal distillation approach results in a trained student model that performs well in various challenging conditions, such as day, night, fog, or rain, outside the domain of the original training dataset, as shown in Fig. 1. We train the proposed vanilla unsupervised semantic segmentation method (Vobecky et al., 2022) on two datasets, Waymo Open (Sun et al., 2020) and nuScenes (Caesar et al., 2020), and test it on four different datasets in the autonomous driving domain, Cityscapes (Cordts et al., 2016), DarkZurich (Sakaridis (...truncated)