International Journal of Computer Vision

http://link.springer.com/journal/11263

List of Papers (Total 132)

Classification of Multi-class Daily Human Motion using Discriminative Body Parts and Sentence Descriptions

In this paper, we propose a motion model that focuses on the discriminative parts of the human body related to target motions to classify human motions into specific categories, and apply this model to multi-class daily motion classifications. We extend this model to a motion recognition system which generates multiple sentences associated with human motions. The motion model is ...

Do Semantic Parts Emerge in Convolutional Neural Networks?

Semantic object parts can be useful for several visual recognition tasks. Lately, these tasks have been addressed using Convolutional Neural Networks (CNN), achieving outstanding results. In this work we study whether CNNs learn semantic parts in their internal representation. We investigate the responses of convolutional filters and try to associate their stimuli with semantic ...

Baseline and Triangulation Geometry in a Standard Plenoptic Camera

In this paper, we demonstrate light field triangulation to determine depth distances and baselines in a plenoptic camera. Advances in micro lenses and image sensors have enabled plenoptic cameras to capture a scene from different viewpoints with sufficient spatial resolution. While object distances can be inferred from disparities in a stereo viewpoint pair using triangulation, ...

Editorial Note

How Good Is My Test Data? Introducing Safety Analysis for Computer Vision

Good test data is crucial for driving new developments in computer vision (CV), but two questions remain unanswered: which situations should be covered by the test data, and how much testing is enough to reach a conclusion? In this paper we propose a new answer to these questions using a standard procedure devised by the safety community to validate complex systems: the hazard and ...

Sentence Directed Video Object Codiscovery

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can ...

Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-Voxels

This paper considers the problem of localizing actions in videos as sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action ...

Automatic Registration of Images to Untextured Geometry Using Average Shading Gradients

Many existing approaches for image-to-geometry registration assume that either a textured 3D model or a good initial guess of the 3D pose is available to bootstrap the registration process. In this paper we consider the registration of photographs to 3D models even when no texture information is available. This is very challenging as we cannot rely on texture gradients, and even ...

Auto-Calibrated Gaze Estimation Using Human Gaze Patterns

We present a novel method to auto-calibrate gaze estimators based on gaze patterns obtained from other viewers. Our method is based on the observation that the gaze patterns of humans are indicative of where a new viewer will look at. When a new viewer is looking at a stimulus, we first estimate a topology of gaze points (initial gaze points). Next, these points are transformed so ...

Large Scale 3D Morphable Models

We present large scale facial model (LSFM)—a 3D Morphable Model (3DMM) automatically constructed from 9663 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed, containing statistical information from a huge variety of the human population. To build such a large model we introduce a novel fully automated and robust ...

DeepProposals: Hunting Objects and Actions by Cascading Deep Convolutional Layers

In this paper, a new method for generating object and action proposals in images and videos is proposed. It builds on activations of different convolutional layers of a pretrained CNN, combining the localization accuracy of the early layers with the high informativeness (and hence recall) of the later layers. To this end, we build an inverse cascade that, going backward from the ...

A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild”

Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as “in-the-wild”). This is partially attributed to the fact that comprehensive “in-the-wild” benchmarks have been developed for face detection, ...

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same ...

Movie Description

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally ...

Multi-view Performance Capture of Surface Details

This paper presents a novel approach to recover true fine surface detail of deforming meshes reconstructed from multi-view video. Template-based methods for performance capture usually produce a coarse-to-medium scale detail 4D surface reconstruction which does not contain the real high-frequency geometric detail present in the original video footage. Fine scale deformation is ...

Dynamic Behavior Analysis via Structured Rank Minimization

Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear ...

Real-Time Tracking of Single and Multiple Objects from Depth-Colour Imagery Using 3D Signed Distance Functions

We describe a novel probabilistic framework for real-time tracking of multiple objects from combined depth-colour imagery. Object shape is represented implicitly using 3D signed distance functions. Probabilistic generative models based on these functions are developed to account for the observed RGB-D imagery, and tracking is posed as a maximum a posteriori problem. We present ...

A Comprehensive and Versatile Camera Model for Cameras with Tilt Lenses

We propose camera models for cameras that are equipped with lenses that can be tilted in an arbitrary direction (often called Scheimpflug optics). The proposed models are comprehensive: they can handle all tilt lens types that are in common use for machine vision and consumer cameras and correctly describe the imaging geometry of lenses for which the ray angles in object and image ...

Free-Hand Sketch Synthesis with Deformable Stroke Models

We present a generative model which can automatically summarize the stroke composition of free-hand sketches of a given category. When our model is fit to a collection of sketches with similar poses, it discovers and learns the structure and appearance of a set of coherent parts, with each part represented by a group of strokes. It represents both consistent (topology) as well as ...

Automated Visual Fin Identification of Individual Great White Sharks

This paper discusses the automated visual identification of individual great white sharks from dorsal fin imagery. We propose a computer vision photo ID system and report recognition results over a database of thousands of unconstrained fin images. To the best of our knowledge this line of work establishes the first fully automated contour-based visual ID system in the field of ...

Fast Algorithms for Fitting Active Appearance Models to Unconstrained Images

Fitting algorithms for Active Appearance Models (AAMs) are usually considered to be robust but slow or fast but less able to generalize well to unseen variations. In this paper, we look into AAM fitting algorithms and make the following orthogonal contributions: We present a simple “project-out” optimization framework that unifies and revises the most well-known optimization ...