Temporal Consistency as Pretext Task in Unsupervised Domain Adaptation for Semantic Segmentation

Journal of Intelligent & Robotic Systems, Mar 2025

Intelligent and autonomous robots (and vehicles) largely adopt computer vision systems to help in localization, navigation and obstacle avoidance tasks. By integrating different deep learning techniques, such as Object Detection and Image Semantic Segmentation, these systems achieve high accuracy in the domain they were trained on. Nonetheless, robustly operating in different domains still poses a major challenge to vision-based perception. In this sense, Unsupervised Domain Adaptation (UDA) has recently gained momentum due to its importance to real-world applications. Specifically, it leverages the prompt availability of unlabeled data to design auxiliary sources of supervision to guide the knowledge transfer between domains. The advantages of such an approach are two-fold: avoiding going through exhaustive labeling processes, and enhancing adaptation performance. In this scenario, exploring temporal correlations in unlabeled video data stands as an interesting alternative, which has not yet been explored to its full potential. In this work, we propose a Self-supervised learning framework that employs Temporal Consistency from unlabeled video sequences as a pretext task for improving UDA for Semantic Segmentation (UDASS). A simple yet effective strategy, it has shown promising results in a real-to-real adaptation setting. Our results and discussions are expected to benefit both new and experienced researchers on the subject.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s10846-025-02220-9.pdf

Temporal Consistency as Pretext Task in Unsupervised Domain Adaptation for Semantic Segmentation

Journal of Intelligent & Robotic Systems (2025) 111:37 https://doi.org/10.1007/s10846-025-02220-9 REGULAR PAPER Temporal Consistency as Pretext Task in Unsupervised Domain Adaptation for Semantic Segmentation Felipe Barbosa1 · Fernando Osório1 Received: 28 April 2024 / Accepted: 30 December 2024 / Published online: 19 March 2025 © The Author(s) 2025 Abstract Intelligent and autonomous robots (and vehicles) largely adopt computer vision systems to help in localization, navigation and obstacle avoidance tasks. By integrating different deep learning techniques, such as Object Detection and Image Semantic Segmentation, these systems achieve high accuracy in the domain they were trained on. Nonetheless, robustly operating in different domains still poses a major challenge to vision-based perception. In this sense, Unsupervised Domain Adaptation (UDA) has recently gained momentum due to its importance to real-world applications. Specifically, it leverages the prompt availability of unlabeled data to design auxiliary sources of supervision to guide the knowledge transfer between domains. The advantages of such an approach are two-fold: avoiding going through exhaustive labeling processes, and enhancing adaptation performance. In this scenario, exploring temporal correlations in unlabeled video data stands as an interesting alternative, which has not yet been explored to its full potential. In this work, we propose a Self-supervised learning framework that employs Temporal Consistency from unlabeled video sequences as a pretext task for improving UDA for Semantic Segmentation (UDASS). A simple yet effective strategy, it has shown promising results in a real-to-real adaptation setting. Our results and discussions are expected to benefit both new and experienced researchers on the subject. Keywords Semantic segmentation · Unsupervised domain adaptation · Temporal consistency · Self-supervised learning · Review 1 Introduction Intelligent and Autonomous Robots/Vehicles should be able to navigate in safe zones and avoid obstacles and dangerous zones. Therefore, it is very important for these systems to recognize the road (navigable zone), and the other elements present in the scene—“semantic elements” (e.g.: road, cars, pedestrians, trees, constructions, buildings, sidewalk, grass, animals, etc). Therefore, Semantic Segmentation is a task of utmost importance for visual perception in urban environments. It provides a summarized representation of a given scene, where elements are classified pixel-wise according to the set of categories under consideration. B Felipe Barbosa Fernando Osório 1 Institute of Mathematics and Computer Science, University of São Paulo, São Paulo, Brazil The field has historically evolved towards increasingly precise models, reaching Intersection over Union (IoU) values—the standard metric—of up to 90%. Nonetheless, these highly specialized models are prone to suffer with adapting to real-world scenarios, where the target data usually presents the so-called domain shift. This phenomenon is often caused by differences in appearance—illumination, textures, and so on—between the source domain the model was trained on and the target/application domain. In this context, transfer-learning and fine-tuning techniques, usually associated with the presence of some sort of labels in the target domain, could be useful. However, the labeling process involves high human effort. This is even more critical for Semantic Segmentation tasks, which require dense labels—the “the curse of data labeling” [1]. Ultimately, it is impractical to obtain labeled data for all possible target domains. In this sense, Unsupervised Domain Adaptation for Semantic Segmentation (UDASS) methods emerge as a promising new research direction, in the search for leveraging the promptly-available unlabeled data in domain adaptation. 0123456789().: V,-vol 123 37 Page 2 of 15 Its practical relevance explains the increasing number of publications devoted to the subject. Aligned with that, video streams are a great source of large amounts of unlabeled data. Despite that, temporal correlations among frames have rarely been explored in UDASS, thus leaving much room for improvements. In light of that, we propose to explore Temporal Consistency in videos as a source of additional supervision to guide UDASS. On the one hand, it is simple to implement, since it does not require modifications to the base model’s structure. On the other hand, precision and temporal stability can be simultaneously motivated in the target domain. Specifically, we aim at a cross-city real-to-real adaptation scenario, where such an approach has not yet been explored. First, in Section 2, we conceptualize Domain Shift and (Unsupervised) Domain Adaptation. Section 3 compiles recent State-of-the-Art (SOTA) UDASS approaches that take into account temporal information from unlabeled video data. In Section 4, we present the proposed method. In Section 5 we share our findings from a real-to-real adaptation experiment, validating the employment of temporal data in UDASS. Finally, we draw our main conclusions in Section 6. 2 Domain Shift and Domain Adaptation The field of Deep Learning has experienced large advances in the last decade, mainly fueled by the proposition of large annotated datasets [2–4]. Particularly, Semantic Segmentation is a well-developed research field, with recent contributions reaching up to 90% mean Intersection over Union (mIoU) in datasets such as Cityscapes [5]. However, the labeling process of such real-scenes datasets is labor-intensive: for example, the Cityscapes annotation took around 90 minutes per image. As an alternative to this scenario, a recent trend is to leverage synthetic data for model training. The main advantages of this approach are Journal of Intelligent & Robotic Systems (2025) 111:37 the possibility of simulating diverse scenarios, weather and illumination conditions, as well as sensor readings, all of that together with the associated labels. Nonetheless, when trying to employ these models (trained on either real or synthetic data) in real-world applications, we will likely face a certain amount of performance degradation (Fig. 1). This can be caused by the so-called Domain Shift: differences between the source and target domains, such as illumination, textures, types of elements in the scenes, and so on. To deal with that, Domain Adaptation techniques try to transfer the knowledge from a given source domain to the target domain at hand. To make the problem even worse, the adaptation process is not straightforward, since real-world target datasets usually lack annotations. As a workaround, Unsupervised Domain Adaptation (UDA) was proposed to leverage the large availability of unlabeled data to boost the adaptation process without the need for labels. According to the nature of source and target datasets, we can broadly define two categories of Domain Adaptation: synthetic-t (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007/s10846-025-02220-9.pdf
Article home page: https://link.springer.com/article/10.1007/s10846-025-02220-9

Barbosa, Felipe, Osório, Fernando. Temporal Consistency as Pretext Task in Unsupervised Domain Adaptation for Semantic Segmentation, Journal of Intelligent & Robotic Systems, 2025, pp. 1-15, Volume 111, Issue 1, DOI: 10.1007/s10846-025-02220-9