JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation
SN Computer Science
(2024) 5:187
https://doi.org/10.1007/s42979-023-02499-1
ORIGINAL RESEARCH
JPPF: Multi‑task Fusion for Consistent Panoptic‑Part Segmentation
Shishir Muralidhara1 · Sravan Kumar Jagadeesh1 · René Schuster1,2
· Didier Stricker1,2
Received: 3 June 2023 / Accepted: 17 November 2023
© The Author(s) 2024
Abstract
Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene
at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our joint panoptic part fusion (JPPF) that combines the three individual segmentations
effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: first, a unified model for
the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing
the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameterfree and dynamically balances its input. The method is evaluated and compared on the Cityscapes panoptic parts (CPP) and
Pascal panoptic parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify
the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and
demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets.
Keywords Semantic · Panoptic-part · Segmentation · Fusion
Introduction
Humans are able to perceive various levels of detail and
abstraction of a scene. We can not only understand different
semantic categories such as bus, car, and sky, but we can
also distinguish between individual entities (instances) and
their components (parts), such as windows or wheels. In
computer vision, the estimation of these parallel layers of
This article is part of the topical collection “Recent Trends on
Pattern Recognition Applications and Methods” guest edited by Ana
Fred, Maria De Marsico and Gabriella Sanniti di Baja.
* René Schuster
Shishir Muralidhara
Sravan Kumar Jagadeesh
Didier Stricker
1
Augmented Vision, German Research Center
for Artificial Intelligence-DFKI, Trippstadter Straße 122,
67663 Kaiserslautern, Germany
2
Augmented Vision, University of Kaiserslautern-Landau,
Gottlieb‑Daimler‑Straße 47, 67663 Kaiserslautern, Germany
abstraction has recently been introduced as panoptic-part
segmentation [10]. Yet, there exists no completely unified
and joint approach for this problem.
According to [6], the two pieces that make up a scene
are stuff and things. Things are countable objects such as
persons, cars, or buses, whereas stuff, like the sky or road,
is usually amorphous and innumerable. Those two categories are identified in the well studied tasks of semantic segmentation and instance segmentation. However, both tasks
are incapable of describing the entirety of the scene. To fill
this gap, panoptic segmentation [21] was presented, which
recognizes and segments both, stuff and things. After this,
several approaches for panoptic segmentation have been proposed [5, 20, 27, 42, 46, 57].
Part segmentation, or part parsing, on the other hand,
seeks to semantically analyze the image based on part-level.
There has been some effort in this area, where part segmentation is often treated as a semantic segmentation problem
[12, 18, 19, 25, 35, 38]. A few methods are instance-aware
[11, 25, 63] and even fewer handle multi-class part objects
[41, 64].
With the release of datasets for panoptic-part segmentation [10, 40], the first methods for this problem have
been proposed [17, 28, 29]. In [10], a baseline approach
is presented in which two networks for panoptic and part
SN Computer Science
Vol.:(0123456789)
187
Page 2 of 16
SN Computer Science
segmentation are used. These two networks are trained
independently and the results of both are combined using
a uni-directional (top-down) merging strategy. This technique of independent training has significant drawbacks.
Due to the use of two different networks, there is a computational overhead. As the authors employ different networks, there will be no consistency in their predictions,
making the merging process ineffective. Also, the independent training strategy leads to learning redundancy
since they could potentially share semantic information
between segmentation heads.
Afterwards, Panoptic-PartFormer (PPF) [28] has been
proposed, in which the authors present a unified, combined transformer for things, stuff, and parts that iteratively
refines the individual segmentations to achieve consistency. In this design, redundancies are avoided and similarities between tasks are exploited, but we argue that an
explicit modeling of multi-task fusion can produce more
accurate results.
To this end, and to overcome the limitations of the topdown merging, we have presented a joint panoptic-part
fusion (JPPF) for panoptic-part segmentation in [17], in
which each sub-task is treated equally to allow for mutual
benefits and maximal consistency (c.f Fig. 1). By sharing
a backbone for all three tasks, the joint fusion is outperforming the top-down baseline, while being more efficient
at the same time.
In this work, we re-present our JPPF [17] and extend the
experiments, validation, and discussion. In short,
• we present a single neural network that uses a shared
encoder to perform semantic, instance, and part segmentation and fuses them efficiently to produce panoptic-part
segmentation.
• we propose a parameter-free joint panoptic-part fusion
(JPPF) module that dynamically considers the logits from
Semanc Segmentaon Instance Segmentaon
Part Segmentaon
Joint
Panopc-Part
Fusion
Panopc-Part Segmentaon
Fig. 1 Our joint panoptic-part fusion (JPPF) combines individual predictions into a consistent panoptic-part segmentation
SN Computer Science
(2024) 5:187
the semantic, instance, and part head and consistently
integrates the three predictions.
• we conduct a thorough analysis of our approach and demonstrate the efficacy, accuracy, and consistency of the
joint fusion strategy.
• we obtain state-of-the-art results for panoptic-part segmentation on various datasets and metrics, surpassing
our previous work [17], the top-down baseline [10], and
the transformer-based competitor PPF [28].
• we demonstrate that our approach generalizes to many
other datasets without fine-tuning.
Related Work
Towards Panoptic‑Part Segmentation
Part-aware panoptic segmentation [10] is a recently introduced problem that brings semantic, instance, and part
segmentation together. There have been several methods
proposed for these individual tasks, including panoptic
segmentation, which is a blend of semantic and instance
segmentation.
Semantic segmentation PSPnet [62] introduced the pyramid pooling module, which focuses on the importance of
mult (...truncated)