Human-like scene graph generation and evaluation
Multimedia Tools and Applications
(2026) 85:552
https://doi.org/10.1007/s11042-026-21166-0
Human-like scene graph generation and evaluation
Victor Milewski1 · Marie-Francine Moens1 · Maria Mihaela Trusca2
Received: 18 January 2025 / Revised: 15 September 2025 / Accepted: 15 December 2025
© The Author(s) 2026
Abstract
Current methods that generate a scene graph of a given image create an overdense graph
regarding the number of objects and relationships found in the image, making the graph
less effective in describing what is relevant in the image and, consequently, in downstream
tasks such as cross-modal retrieval and mining. In this work, we propose a novel method
that generates a scene graph that reflects what humans find important when describing an
image. During training, our scene generation method is guided by human-drafted captions
describing the images, which we assume will focus on essential scene elements. This
guidance is realized by properly designed loss functions. During inference, scene graphs
are generated solely from images. We evaluate the resulting scene graphs by comparing
them with the ground-truth scene graphs of Visual Genome that are created by humans.
Evaluation is done with recall- and precision-oriented metrics and graph edit distances. In
the first set of experiments, we benchmark existing scene graph generation models, then
we add the newly proposed loss functions leading to improved performance, especially in
terms of the graph edit distance. Extra experiments show that the correct recognition of
unimportant background objects and their relationships is crucial when generating humanlike scene graphs. The codebase is released on github: https://github.com/VSJMilewski/
relevance_graphs
Keywords Scene graphs · Weak supervision · Visual object and relationship detection
Victor Milewski
Marie-Francine Moens
Maria Mihaela Trusca
1
Department of Computer Science, KU Leuven, Celestijnenlaan 200A, Leuven 3001, Belgium
2
Faculty of Arts, KU Leuven, Blijde Inkomststraat 21, Leuven 3000, Belgium
552
Page 2 of 39
Multimedia Tools and Applications
(2026) 85:552
1 Introduction
Scene graphs are a popular way to describe images by explicitly annotating the present
objects and their semantic relationships. A scene graph is a structured representation that
encodes the object labels as vertices and their semantic relationships as edges.1 Scene
graphs have value for various tasks (see the survey by [3] for an overview). For example,
in Visual Question Answering (VQA), a model generates an answer solely based on the
scene graph of an image (i.e., without the image itself) [46], in cross-modal retrieval tasks,
the graph can aid in querying for precise information [14, 42], and in image captioning, the
caption can better describe relationships and interactions between objects when relying on
a high-quality scene graph of the image [25, 28]. Moreover, scene graphs can guide video
captioning [13, 26], and image and video generation [5, 6].
The current generation of scene graphs emphasizes obtaining a high recall of objects and
relationships, which results in over-dense graphs. Such graphs are inadequate for human
inspection and less likely to focus on information humans deem relevant in the image. When
using such over-dense graphs in downstream tasks, the added noise from the many irrelevant relationships harms the performance by making it harder to make correct predictions,
for instance, in image captioning [28]. The necessity of differentiating between relevant
and irrelevant relationships is illustrated in Fig. 1. While the numerous objects detected in
Fig. 1a. can generate cluttered scene graphs, guiding scene graph generation to select only
the relevant objects and relationships of an image results in non-dense graphs that resemble
human perception of an image (Fig. 1b.). We define dense graphs as graphs that include all
object-to-object relationships in an image, without filtering out redundant, weak, or noisy
edges that do not add meaningful information. In contrast, non-dense graphs align with
how humans perceive an image [43], containing only the relevant connections between its
objects.
In this work, we propose a novel method that generates a scene graph that reflects what
humans find important when describing an image. The scene graph can be considered a
surrogate summary of the image’s content. This comparison to summarization allows us to
draw inspiration from definitions made by [30]. Peyrard defines summarization using the
concepts of avoidance of redundancy, relevance, and informativeness, which are essential
in human-generated and machine-generated summaries. We hypothesize that when training
the scene graph generation (SGG) method guided by human-drafted captions of the images,
Fig. 1 Comparison between dense (a) and non-dense scene graphs (b). The non-dense scene graph aligns
with the human image caption "A moped parked near a tree on a sidewalk near a street in front of some
cars and a building"
1
Optionally, the attributes of objects are also encoded in the graph.
13
Multimedia Tools and Applications
(2026) 85:552
Page 3 of 39
552
it will result in scene graphs that are relevant and informative. More specifically, we propose
several novel loss functions that are guided by natural language captions. The losses take
advantage of the alignment of objects and relationships in the image and their counterpart
in the caption. Moreover, we use the grammatical subjects of the language captions to guide
the selection of central objects in the scene graph. Finally, we investigate the influence of
correctly identifying unimportant background objects in the image on the scene graph generation process (SGG).
We intrinsically evaluate the obtained scene graphs by comparing them with the handcrafted graphs of of Visual Genome (VG) [16], and specifically with the curated VG graphs
that contain the 150 most frequently occurring objects classes and 50 most occurring relationship predicates. This standard VG subset, often referred to as VG150 in the literature,
was introduced by [48] and is commonly used for evaluating scene graph generation [11].
In this paper, we use the standard VG150 split to do the evaluation. While our experiments
focus on VG150, our method can be easily generalized to other datasets that provide, in
addition to images and scene graphs, high-quality image captions.
Regarding evaluation metrics, in addition to the commonly used recall-oriented measures, we also propose precision-oriented metrics and metrics that measure structural correspondences between the ground-truth and generated graphs utilizing graph-edit distances
[1]. Furthermore, we look into the model’s capabilities to distinguish between foreground
and background objects and relationships. The goal is to assess how well the generated
graphs correspond with human-created scene graphs to an extent not done before in the
literature.
Although using image capti (...truncated)