International Journal of Computer Vision

International Journal of Computer Vision (IJCV) details the science and engineering of this rapidly growing field. Regular articles present major technical ...

List of Papers (Total 490)

HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving

Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive...

A Closer Look at Benchmarking Self-supervised Pre-training with Image Classification

Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data’s inherent structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels...

NU-AIR: A Neuromorphic Urban Aerial Dataset for Detection and Localization of Pedestrians and Vehicles

This paper presents an open-source aerial neuromorphic dataset that captures pedestrians and vehicles moving in an urban environment. The dataset, titled NU-AIR, features over 70 min of event footage acquired with a 640 $$\times $$ 480 resolution neuromorphic sensor mounted on a quadrotor operating in an urban environment. Crowds of pedestrians, different types of vehicles, and...

Advances in 3D Neural Stylization: A Survey

Modern artificial intelligence offers a novel and transformative approach to creating digital art across diverse styles and modalities like images, videos and 3D data, unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. This paper reports on recent advances in stylized 3D asset creation and manipulation with the...

FlowSDF: Flow Matching for Medical Image Segmentation Using Distance Transforms

Medical image segmentation plays an important role in accurately identifying and isolating regions of interest within medical images. Generative approaches are particularly effective in modeling the statistical properties of segmentation masks that are closely related to the respective structures. In this work we introduce FlowSDF, an image-guided conditional flow matching...

LMD: Light-Weight Prediction Quality Estimation for Object Detection in Lidar Point Clouds

Object detection on Lidar point cloud data is a promising technology for autonomous driving and robotics which has seen a significant rise in performance and accuracy during recent years. Particularly uncertainty estimation is a crucial component for down-stream tasks and deep neural networks remain error-prone even for predictions with high confidence. Previously proposed...

Realistic Evaluation of Deep Active Learning for Image Classification and Semantic Segmentation

Active learning aims to reduce the high labeling cost involved in training machine learning models on large datasets by efficiently labeling only the most informative samples. Recently, deep active learning has shown success on various tasks. However, the conventional evaluation schemes are either incomplete or below par. This study critically assesses various active learning...

A Survey on Deep Stereo Matching in the Twenties

Stereo matching is close to hitting a half-century of history, yet witnessed a rapid evolution in the last decade thanks to deep learning. While previous surveys in the late 2010s covered the first stage of this revolution, the last five years of research brought further ground-breaking advancements to the field. This paper aims to fill this gap in a two-fold manner: first, we...

DustNet++: Deep Learning-Based Visual Regression for Dust Density Estimation

Detecting airborne dust in standard RGB images presents significant challenges. Nevertheless, the monitoring of airborne dust holds substantial potential benefits for climate protection, environmentally sustainable construction, scientific research, and various other fields. To develop an efficient and robust algorithm for airborne dust monitoring, several hurdles have to be...

Diagnosing Human-Object Interaction Detectors

We have witnessed significant progress in human-object interaction (HOI) detection. However, relying solely on mAP (mean Average Precision) scores as a summary metric does not provide sufficient insight into the nuances of model performance (e.g., why one model outperforms another), which can hinder further innovation in this field. To address this issue, we introduce a diagnosis...

Sample-Cohesive Pose-Aware Contrastive Facial Representation Learning

Self-supervised facial representation learning (SFRL) methods, especially contrastive learning (CL) methods, have been increasingly popular due to their ability to perform face understanding without heavily relying on large-scale well-annotated datasets. However, analytically, current CL-based SFRL methods still perform unsatisfactorily in learning facial representations due to...

GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection

Zero-shot OOD detection is a task that detects OOD images during inference with only in-distribution (ID) class names. Existing methods assume ID images contain a single, centered object, and do not consider the more realistic multi-object scenarios, where both ID and OOD objects are present. To meet the needs of many users, the detection method must have the flexibility to adapt...

Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors...

Relation-Guided Versatile Regularization for Federated Semi-Supervised Learning

Federated semi-supervised learning (FSSL) target to address the increasing privacy concerns for the practical scenarios, where data holders are limited in labeling capability. Latest FSSL approaches leverage the prediction consistency between the local model and global model to exploit knowledge from partially labeled or completely unlabeled clients. However, they merely utilize...

AniClipart: Clipart Animation with Text-to-Video Priors

Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving...

Structured Generative Models for Scene Understanding

This position paper argues for the use of structured generative models (SGMs) for the understanding of static scenes. This requires the reconstruction of a 3D scene from an input image (or a set of multi-view images), whereby the contents of the image(s) are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along...