Temporal Consistency as Pretext Task in Unsupervised Domain Adaptation for Semantic Segmentation
Journal of Intelligent & Robotic Systems (2025) 111:37
https://doi.org/10.1007/s10846-025-02220-9
REGULAR PAPER
Temporal Consistency as Pretext Task in Unsupervised Domain
Adaptation for Semantic Segmentation
Felipe Barbosa1
· Fernando Osório1
Received: 28 April 2024 / Accepted: 30 December 2024 / Published online: 19 March 2025
© The Author(s) 2025
Abstract
Intelligent and autonomous robots (and vehicles) largely adopt computer vision systems to help in localization, navigation
and obstacle avoidance tasks. By integrating different deep learning techniques, such as Object Detection and Image Semantic
Segmentation, these systems achieve high accuracy in the domain they were trained on. Nonetheless, robustly operating in
different domains still poses a major challenge to vision-based perception. In this sense, Unsupervised Domain Adaptation
(UDA) has recently gained momentum due to its importance to real-world applications. Specifically, it leverages the prompt
availability of unlabeled data to design auxiliary sources of supervision to guide the knowledge transfer between domains. The
advantages of such an approach are two-fold: avoiding going through exhaustive labeling processes, and enhancing adaptation
performance. In this scenario, exploring temporal correlations in unlabeled video data stands as an interesting alternative, which
has not yet been explored to its full potential. In this work, we propose a Self-supervised learning framework that employs
Temporal Consistency from unlabeled video sequences as a pretext task for improving UDA for Semantic Segmentation
(UDASS). A simple yet effective strategy, it has shown promising results in a real-to-real adaptation setting. Our results and
discussions are expected to benefit both new and experienced researchers on the subject.
Keywords Semantic segmentation · Unsupervised domain adaptation · Temporal consistency · Self-supervised learning ·
Review
1 Introduction
Intelligent and Autonomous Robots/Vehicles should be able
to navigate in safe zones and avoid obstacles and dangerous
zones. Therefore, it is very important for these systems to
recognize the road (navigable zone), and the other elements
present in the scene—“semantic elements” (e.g.: road, cars,
pedestrians, trees, constructions, buildings, sidewalk, grass,
animals, etc). Therefore, Semantic Segmentation is a task of
utmost importance for visual perception in urban environments. It provides a summarized representation of a given
scene, where elements are classified pixel-wise according to
the set of categories under consideration.
B
Felipe Barbosa
Fernando Osório
1
Institute of Mathematics and Computer Science,
University of São Paulo, São Paulo, Brazil
The field has historically evolved towards increasingly
precise models, reaching Intersection over Union (IoU)
values—the standard metric—of up to 90%. Nonetheless,
these highly specialized models are prone to suffer with
adapting to real-world scenarios, where the target data usually presents the so-called domain shift. This phenomenon
is often caused by differences in appearance—illumination,
textures, and so on—between the source domain the model
was trained on and the target/application domain.
In this context, transfer-learning and fine-tuning techniques, usually associated with the presence of some sort
of labels in the target domain, could be useful. However,
the labeling process involves high human effort. This is even
more critical for Semantic Segmentation tasks, which require
dense labels—the “the curse of data labeling” [1]. Ultimately,
it is impractical to obtain labeled data for all possible target
domains.
In this sense, Unsupervised Domain Adaptation for
Semantic Segmentation (UDASS) methods emerge as a
promising new research direction, in the search for leveraging
the promptly-available unlabeled data in domain adaptation.
0123456789().: V,-vol
123
37 Page 2 of 15
Its practical relevance explains the increasing number of publications devoted to the subject.
Aligned with that, video streams are a great source of large
amounts of unlabeled data. Despite that, temporal correlations among frames have rarely been explored in UDASS,
thus leaving much room for improvements.
In light of that, we propose to explore Temporal Consistency in videos as a source of additional supervision to guide
UDASS. On the one hand, it is simple to implement, since it
does not require modifications to the base model’s structure.
On the other hand, precision and temporal stability can be
simultaneously motivated in the target domain. Specifically,
we aim at a cross-city real-to-real adaptation scenario, where
such an approach has not yet been explored.
First, in Section 2, we conceptualize Domain Shift and
(Unsupervised) Domain Adaptation. Section 3 compiles
recent State-of-the-Art (SOTA) UDASS approaches that take
into account temporal information from unlabeled video data.
In Section 4, we present the proposed method. In Section 5
we share our findings from a real-to-real adaptation experiment, validating the employment of temporal data in UDASS.
Finally, we draw our main conclusions in Section 6.
2 Domain Shift and Domain Adaptation
The field of Deep Learning has experienced large advances
in the last decade, mainly fueled by the proposition of
large annotated datasets [2–4]. Particularly, Semantic Segmentation is a well-developed research field, with recent
contributions reaching up to 90% mean Intersection over
Union (mIoU) in datasets such as Cityscapes [5].
However, the labeling process of such real-scenes datasets
is labor-intensive: for example, the Cityscapes annotation
took around 90 minutes per image. As an alternative to this
scenario, a recent trend is to leverage synthetic data for
model training. The main advantages of this approach are
Journal of Intelligent & Robotic Systems (2025) 111:37
the possibility of simulating diverse scenarios, weather and
illumination conditions, as well as sensor readings, all of that
together with the associated labels.
Nonetheless, when trying to employ these models (trained
on either real or synthetic data) in real-world applications, we
will likely face a certain amount of performance degradation
(Fig. 1). This can be caused by the so-called Domain Shift:
differences between the source and target domains, such as
illumination, textures, types of elements in the scenes, and
so on. To deal with that, Domain Adaptation techniques try
to transfer the knowledge from a given source domain to the
target domain at hand.
To make the problem even worse, the adaptation process is
not straightforward, since real-world target datasets usually
lack annotations.
As a workaround, Unsupervised Domain Adaptation
(UDA) was proposed to leverage the large availability of
unlabeled data to boost the adaptation process without the
need for labels.
According to the nature of source and target datasets, we
can broadly define two categories of Domain Adaptation:
synthetic-t (...truncated)