Biased Competition in Visual Processing Hierarchies: A Learning Approach Using Multiple Cues
Alexander R. T. Gepperth
0
Sven Rebhan
0
Stephan Hasler
0
Jannik Fritsch
0
0
A. R. T. Gepperth (&) S. Rebhan S. Hasler J. Fritsch Honda Research Institute Europe GmbH
, Carl-Legien-Str.30, 63073 Offenbach,
Germany
In this contribution, we present a large-scale hierarchical system for object detection fusing bottom-up (signal-driven) processing results with top-down (model or task-driven) attentional modulation. Specifically, we focus on the question of how the autonomous learning of invariant models can be embedded into a performing system and how such models can be used to define object-specific attentional modulation signals. Our system implements bi-directional data flow in a processing hierarchy. The bottom-up data flow proceeds from a preprocessing level to the hypothesis level where object hypotheses created by exhaustive object detection algorithms are represented in a roughly retinotopic way. A competitive selection mechanism is used to determine the most confident hypotheses, which are used on the system level to train multimodal models that link object identity to invariant hypothesis properties. The top-down data flow originates at the system level, where the trained multimodal models are used to obtain space- and featurebased attentional modulation signals, providing biases for the competitive selection process at the hypothesis level. This results in object-specific hypothesis facilitation/suppression in certain image regions which we show to be applicable to different object detection mechanisms. In order to demonstrate the benefits of this approach, we apply the system to the detection of cars in a variety of challenging traffic videos. Evaluating our approach on a publicly available dataset containing approximately 3,500 annotated video images from more than 1 h of driving, we can show strong increases in performance and generalization when compared to object detection in isolation.
-
Furthermore, we compare our results to a late hypothesis
rejection approach, showing that early coupling of
topdown and bottom-up information is a favorable approach
especially when processing resources are constrained.
Visual processing in the human neocortex is organized in a
hierarchical fashion: neurons in lower levels such as LGN
and V1 and A1 have small receptive fields and are sensitive
to a very specific set of stimuli, whereas neurons in higher
areas tend to have larger receptive fields and are
increasingly broad in their selectivity [31]. As a consequence,
neural activity in lower hierarchy levels is tightly coupled
to sensory input, whereas higher-level neurons may well
respond to rather abstract categories and concepts [31]. It
has long been known that information processing in such
hierarchies is bi-directional, consisting of a bottom-up
(away from sensory input) and a top-down (toward sensory
input) component [12, 17], and this has been linked to
accounts of attentional modulation, i.e., the selective and
large-scale enhancing or suppressing of neuronal responses
in accordance with task demands [14, 22, 32]. For visual
processing, there seem to exist at least two concurrently
active mechanisms of attentional modulation: space-based
attention that enhances certain locations in the visual field
and feature-based attention that is not localized but affects
all populations of neurons representing a particular visual
property [11].
Since cortical neurons, especially at high hierarchy
levels, compete strongly with each other for representing
the current stimulus, it has been proposed that local
facilitation or inhibition of neural responses by top-down
signals can explain the pronounced effects of attentional
modulation simply because small local biases may result in
very different stable states of the competition process [4,
18, 28]. This biased competition [4] account of attentional
modulation has influenced many models of visual
attention; we incorporated it into our research because we found
that competition between object hypotheses is an
unavoidable step for agents with constrained resources; the
biasing of the existing competition mechanism is a then
straightforward extension.
Since attentional modulation is observed to enhance
performance w.r.t. a wide variety of tasks, the question
immediately arises how models for task-specific attentional
modulation are obtained. An influential concept, the
socalled reverse hierarchy theory [12], states that such
models are first acquired in high levels of the processing
hierarchy and subsequently used to train task-specific
responses in lower levels. We present the method of
system-level learning that implements an important aspect of
reverse hierarchy theory by introducing dependency
models between highly invariant quantities available on the
highest level of a processing system. This is motivated by
our finding that such system-level models usually show
high generalization ability.
Motivation for the Presented Work
Our experience with cluttered and uncontrolled traffic
environments suggests that purely appearance-based (i.e.,
based on local pixel patterns) object detection suffers
from significant ambiguities: the more complex a scene is,
the higher is the probability that some local pixel pattern
will be similar to the object class of interest. In order to
overcome this difficulty, we claim that object-specific
models relating appearance-based visual information to
non-local and non-visual information must be taken into
account to achieve the required disambiguation. For
convergent, hierarchically organized systems, this implies
that such models can only be formed at high hierarchy
levels where the required information is available. The
idea of system-level learning (see also [8]) is to represent
all quantities available at the highest hierarchy level in a
common way in order to use a single, scalable learning
algorithm for detecting correlations. The focus of this
article is to use system-level models for generating and
using expectations to generate attentional modulation:
given a search cue, e.g., a certain object identity,
systemlevel models are queried for features correlated with this
identity, and the resulting expectation is used to define
attentional modulation.
Fig. 1 Illustration of the basic structure and the inherent novel points
of the presented system. a Learning of multi-modal system-level
models for generating attentional modulation during system operation
b Application of system-level models for attentional modulation.
What kinds of models are learned effectively depends only on the
processing results that are supplied to the system-level learning
mechanism
Research Questions, Claims and Messages
Based on our experience with object detection in complex
traffic scenes, we formulated a number of hypotheses,
which this article will investigate based on a hierarchical
car detection system system as shown in Fig. 1. We
evaluate the system in challenging real-world situations using
extended annota (...truncated)