Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation
International Journal of Computer Vision
https://doi.org/10.1007/s11263-024-02320-3
Unsupervised Semantic Segmentation of Urban Scenes via
Cross-Modal Distillation
Antonin Vobecky1,2,3
Josef Sivic1
· David Hurych2 · Oriane Siméoni2 · Spyros Gidaris2 · Andrei Bursuc2 · Patrick Pérez4 ·
Received: 29 September 2023 / Accepted: 29 November 2024
© The Author(s) 2025
Abstract
Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone
to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose
a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras
and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a
novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and
image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR
point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals
can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we
introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to
train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements
of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and
incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different
testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth
experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel
accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s
density, supervised finetuning as well as additional qualitative results and their analysis.
Keywords Autonomous driving · Unsupervised semantic segmentation · Multimodal learning
1 Introduction
In this work, we investigate whether it is possible to learn
pixel-wise semantic image segmentation of urban scenes
without the need for any manual annotation, just from the
Communicated by Dengxin Dai.
B
Antonin Vobecky
;
1
Czech Institute of Informatics, Robotics and Cybernetics,
Czech Technical University in Prague, Prague, Czech
Republic
2
valeo.ai, Paris, France
3
Faculty of Electrical Engineering, Czech Technical University
in Prague, Prague, France
4
Kyutai, Paris, France
raw non-curated data collected by cars equipped with cameras and LiDAR sensors while driving in town. This topic
is essential, as current methods require large amounts of
pixel-wise annotations over various driving conditions and
situations. Such a manual segmentation of images in large
scale is very expensive, time-consuming, and prone to biases.
Currently, the best methods for unsupervised learning
of semantic segmentation assume that images contain centered objects (Van et al., 2021) rather than whole scenes or
use spatial self-supervision available in the image domain
(Cho et al., 2021). They do not leverage additional modalities, such as the LiDAR data, available for urban scenes
in autonomous driving setups. In this work, we develop
an approach for unsupervised semantic segmentation that
learns to segment complex scenes containing many objects,
including thin structures such as pedestrians or traffic lights,
without the need for any manual annotation. Instead, it lever-
123
International Journal of Computer Vision
Fig. 1 Proposed fully-unsupervised approach. From uncurated images
and LiDAR data (left), our Drive&Segment approach learns a semantic
image segmentation model with no manual annotations. The resulting
model performs unsupervised semantic segmentation of new unseen
datasets (right) without any human labeling. It can segment complex
scenes with many objects, including thin structures such as people, bicycles, poles or traffic lights. The black color denotes the ignored areas
(Color figure online)
ages cross-modal information available in (aligned) LiDAR
point clouds and images; see Fig. 1. Exploiting point clouds
as a form of supervision is, however, not straightforward:
data from LiDAR and camera are rarely perfectly synchronized; moreover, point clouds are unstructured and of much
lower resolution compared to images; finally, extracting useful semantic information from LiDAR is still a very hard
problem. In this work, we overcome these issues and show
that extracting useful pixel-wise semantic supervision from
LiDAR data is possible.
The contributions of our work are threefold. First, we propose a novel method for unsupervised cross-modal learning
of semantic image segmentation by leveraging synchronized
LiDAR and image data. The critical ingredient is a module that analyzes the LiDAR point cloud to obtain proposals
for spatially consistent objects that can be clearly separated
from each other and the ground plane in the 3D scene. Second, we show that these 3D object proposals can be aligned
with input images and reliably clustered into semantically
meaningful pseudo-classes by using image features from a
network trained without supervision. We demonstrate that
this approach is robust to noise in point clouds and delivers,
without the need for any manual annotation, pseudo-classes
with pixel-wise segmentation for various objects present in
driving scenes. These classes include objects such as pedestrians or traffic lights that are notoriously hard to segment
automatically in the image domain. Third, we develop a
novel cross-modal distillation approach that trains a teacher
network with the available partial pseudo labels and then
exploits its predictions to train the student with pixel-wise
pseudo annotations covering the whole image. Additionally,
our approach exploits geometric constraints extracted from
the LiDAR point cloud during the teacher-student learning
process to refine teacher predictions distilled into the student
network. Implemented with transformer-based networks, this
cross-modal distillation approach results in a trained student
model that performs well in various challenging conditions,
such as day, night, fog, or rain, outside the domain of the
original training dataset, as shown in Fig. 1.
We train the proposed vanilla unsupervised semantic segmentation method (Vobecky et al., 2022) on two datasets,
Waymo Open (Sun et al., 2020) and nuScenes (Caesar et al.,
2020), and test it on four different datasets in the autonomous
driving domain, Cityscapes (Cordts et al., 2016), DarkZurich
(Sakaridis (...truncated)