Human-like scene graph generation and evaluation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11042-026-21166-0.pdf

Human-like scene graph generation and evaluation

Multimedia Tools and Applications (2026) 85:552 https://doi.org/10.1007/s11042-026-21166-0 Human-like scene graph generation and evaluation Victor Milewski1 · Marie-Francine Moens1 · Maria Mihaela Trusca2 Received: 18 January 2025 / Revised: 15 September 2025 / Accepted: 15 December 2025 © The Author(s) 2026 Abstract Current methods that generate a scene graph of a given image create an overdense graph regarding the number of objects and relationships found in the image, making the graph less effective in describing what is relevant in the image and, consequently, in downstream tasks such as cross-modal retrieval and mining. In this work, we propose a novel method that generates a scene graph that reflects what humans find important when describing an image. During training, our scene generation method is guided by human-drafted captions describing the images, which we assume will focus on essential scene elements. This guidance is realized by properly designed loss functions. During inference, scene graphs are generated solely from images. We evaluate the resulting scene graphs by comparing them with the ground-truth scene graphs of Visual Genome that are created by humans. Evaluation is done with recall- and precision-oriented metrics and graph edit distances. In the first set of experiments, we benchmark existing scene graph generation models, then we add the newly proposed loss functions leading to improved performance, especially in terms of the graph edit distance. Extra experiments show that the correct recognition of unimportant background objects and their relationships is crucial when generating humanlike scene graphs. The codebase is released on github: https://github.com/VSJMilewski/ relevance_graphs Keywords Scene graphs · Weak supervision · Visual object and relationship detection Victor Milewski Marie-Francine Moens Maria Mihaela Trusca 1 Department of Computer Science, KU Leuven, Celestijnenlaan 200A, Leuven 3001, Belgium 2 Faculty of Arts, KU Leuven, Blijde Inkomststraat 21, Leuven 3000, Belgium 552 Page 2 of 39 Multimedia Tools and Applications (2026) 85:552 1 Introduction Scene graphs are a popular way to describe images by explicitly annotating the present objects and their semantic relationships. A scene graph is a structured representation that encodes the object labels as vertices and their semantic relationships as edges.1 Scene graphs have value for various tasks (see the survey by [3] for an overview). For example, in Visual Question Answering (VQA), a model generates an answer solely based on the scene graph of an image (i.e., without the image itself) [46], in cross-modal retrieval tasks, the graph can aid in querying for precise information [14, 42], and in image captioning, the caption can better describe relationships and interactions between objects when relying on a high-quality scene graph of the image [25, 28]. Moreover, scene graphs can guide video captioning [13, 26], and image and video generation [5, 6]. The current generation of scene graphs emphasizes obtaining a high recall of objects and relationships, which results in over-dense graphs. Such graphs are inadequate for human inspection and less likely to focus on information humans deem relevant in the image. When using such over-dense graphs in downstream tasks, the added noise from the many irrelevant relationships harms the performance by making it harder to make correct predictions, for instance, in image captioning [28]. The necessity of differentiating between relevant and irrelevant relationships is illustrated in Fig. 1. While the numerous objects detected in Fig. 1a. can generate cluttered scene graphs, guiding scene graph generation to select only the relevant objects and relationships of an image results in non-dense graphs that resemble human perception of an image (Fig. 1b.). We define dense graphs as graphs that include all object-to-object relationships in an image, without filtering out redundant, weak, or noisy edges that do not add meaningful information. In contrast, non-dense graphs align with how humans perceive an image [43], containing only the relevant connections between its objects. In this work, we propose a novel method that generates a scene graph that reflects what humans find important when describing an image. The scene graph can be considered a surrogate summary of the image’s content. This comparison to summarization allows us to draw inspiration from definitions made by [30]. Peyrard defines summarization using the concepts of avoidance of redundancy, relevance, and informativeness, which are essential in human-generated and machine-generated summaries. We hypothesize that when training the scene graph generation (SGG) method guided by human-drafted captions of the images, Fig. 1 Comparison between dense (a) and non-dense scene graphs (b). The non-dense scene graph aligns with the human image caption "A moped parked near a tree on a sidewalk near a street in front of some cars and a building" 1 Optionally, the attributes of objects are also encoded in the graph. 13 Multimedia Tools and Applications (2026) 85:552 Page 3 of 39 552 it will result in scene graphs that are relevant and informative. More specifically, we propose several novel loss functions that are guided by natural language captions. The losses take advantage of the alignment of objects and relationships in the image and their counterpart in the caption. Moreover, we use the grammatical subjects of the language captions to guide the selection of central objects in the scene graph. Finally, we investigate the influence of correctly identifying unimportant background objects in the image on the scene graph generation process (SGG). We intrinsically evaluate the obtained scene graphs by comparing them with the handcrafted graphs of of Visual Genome (VG) [16], and specifically with the curated VG graphs that contain the 150 most frequently occurring objects classes and 50 most occurring relationship predicates. This standard VG subset, often referred to as VG150 in the literature, was introduced by [48] and is commonly used for evaluating scene graph generation [11]. In this paper, we use the standard VG150 split to do the evaluation. While our experiments focus on VG150, our method can be easily generalized to other datasets that provide, in addition to images and scene graphs, high-quality image captions. Regarding evaluation metrics, in addition to the commonly used recall-oriented measures, we also propose precision-oriented metrics and metrics that measure structural correspondences between the ground-truth and generated graphs utilizing graph-edit distances [1]. Furthermore, we look into the model’s capabilities to distinguish between foreground and background objects and relationships. The goal is to assess how well the generated graphs correspond with human-created scene graphs to an extent not done before in the literature. Although using image capti (...truncated)