JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s42979-023-02499-1.pdf

JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

SN Computer Science (2024) 5:187 https://doi.org/10.1007/s42979-023-02499-1 ORIGINAL RESEARCH JPPF: Multi‑task Fusion for Consistent Panoptic‑Part Segmentation Shishir Muralidhara1 · Sravan Kumar Jagadeesh1 · René Schuster1,2 · Didier Stricker1,2 Received: 3 June 2023 / Accepted: 17 November 2023 © The Author(s) 2024 Abstract Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our joint panoptic part fusion (JPPF) that combines the three individual segmentations effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: first, a unified model for the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameterfree and dynamically balances its input. The method is evaluated and compared on the Cityscapes panoptic parts (CPP) and Pascal panoptic parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets. Keywords Semantic · Panoptic-part · Segmentation · Fusion Introduction Humans are able to perceive various levels of detail and abstraction of a scene. We can not only understand different semantic categories such as bus, car, and sky, but we can also distinguish between individual entities (instances) and their components (parts), such as windows or wheels. In computer vision, the estimation of these parallel layers of This article is part of the topical collection “Recent Trends on Pattern Recognition Applications and Methods” guest edited by Ana Fred, Maria De Marsico and Gabriella Sanniti di Baja. * René Schuster Shishir Muralidhara Sravan Kumar Jagadeesh Didier Stricker 1 Augmented Vision, German Research Center for Artificial Intelligence-DFKI, Trippstadter Straße 122, 67663 Kaiserslautern, Germany 2 Augmented Vision, University of Kaiserslautern-Landau, Gottlieb‑Daimler‑Straße 47, 67663 Kaiserslautern, Germany abstraction has recently been introduced as panoptic-part segmentation [10]. Yet, there exists no completely unified and joint approach for this problem. According to [6], the two pieces that make up a scene are stuff and things. Things are countable objects such as persons, cars, or buses, whereas stuff, like the sky or road, is usually amorphous and innumerable. Those two categories are identified in the well studied tasks of semantic segmentation and instance segmentation. However, both tasks are incapable of describing the entirety of the scene. To fill this gap, panoptic segmentation [21] was presented, which recognizes and segments both, stuff and things. After this, several approaches for panoptic segmentation have been proposed [5, 20, 27, 42, 46, 57]. Part segmentation, or part parsing, on the other hand, seeks to semantically analyze the image based on part-level. There has been some effort in this area, where part segmentation is often treated as a semantic segmentation problem [12, 18, 19, 25, 35, 38]. A few methods are instance-aware [11, 25, 63] and even fewer handle multi-class part objects [41, 64]. With the release of datasets for panoptic-part segmentation [10, 40], the first methods for this problem have been proposed [17, 28, 29]. In [10], a baseline approach is presented in which two networks for panoptic and part SN Computer Science Vol.:(0123456789) 187 Page 2 of 16 SN Computer Science segmentation are used. These two networks are trained independently and the results of both are combined using a uni-directional (top-down) merging strategy. This technique of independent training has significant drawbacks. Due to the use of two different networks, there is a computational overhead. As the authors employ different networks, there will be no consistency in their predictions, making the merging process ineffective. Also, the independent training strategy leads to learning redundancy since they could potentially share semantic information between segmentation heads. Afterwards, Panoptic-PartFormer (PPF) [28] has been proposed, in which the authors present a unified, combined transformer for things, stuff, and parts that iteratively refines the individual segmentations to achieve consistency. In this design, redundancies are avoided and similarities between tasks are exploited, but we argue that an explicit modeling of multi-task fusion can produce more accurate results. To this end, and to overcome the limitations of the topdown merging, we have presented a joint panoptic-part fusion (JPPF) for panoptic-part segmentation in [17], in which each sub-task is treated equally to allow for mutual benefits and maximal consistency (c.f Fig. 1). By sharing a backbone for all three tasks, the joint fusion is outperforming the top-down baseline, while being more efficient at the same time. In this work, we re-present our JPPF [17] and extend the experiments, validation, and discussion. In short, • we present a single neural network that uses a shared encoder to perform semantic, instance, and part segmentation and fuses them efficiently to produce panoptic-part segmentation. • we propose a parameter-free joint panoptic-part fusion (JPPF) module that dynamically considers the logits from Semanc Segmentaon Instance Segmentaon Part Segmentaon Joint Panopc-Part Fusion Panopc-Part Segmentaon Fig. 1 Our joint panoptic-part fusion (JPPF) combines individual predictions into a consistent panoptic-part segmentation SN Computer Science (2024) 5:187 the semantic, instance, and part head and consistently integrates the three predictions. • we conduct a thorough analysis of our approach and demonstrate the efficacy, accuracy, and consistency of the joint fusion strategy. • we obtain state-of-the-art results for panoptic-part segmentation on various datasets and metrics, surpassing our previous work [17], the top-down baseline [10], and the transformer-based competitor PPF [28]. • we demonstrate that our approach generalizes to many other datasets without fine-tuning. Related Work Towards Panoptic‑Part Segmentation Part-aware panoptic segmentation [10] is a recently introduced problem that brings semantic, instance, and part segmentation together. There have been several methods proposed for these individual tasks, including panoptic segmentation, which is a blend of semantic and instance segmentation. Semantic segmentation PSPnet [62] introduced the pyramid pooling module, which focuses on the importance of mult (...truncated)