Biased Competition in Visual Processing Hierarchies: A Learning Approach Using Multiple Cues (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs12559-010-9092-x.pdf

Biased Competition in Visual Processing Hierarchies: A Learning Approach Using Multiple Cues

Alexander R. T. Gepperth 0 Sven Rebhan 0 Stephan Hasler 0 Jannik Fritsch 0 0 A. R. T. Gepperth (&) S. Rebhan S. Hasler J. Fritsch Honda Research Institute Europe GmbH , Carl-Legien-Str.30, 63073 Offenbach, Germany In this contribution, we present a large-scale hierarchical system for object detection fusing bottom-up (signal-driven) processing results with top-down (model or task-driven) attentional modulation. Specifically, we focus on the question of how the autonomous learning of invariant models can be embedded into a performing system and how such models can be used to define object-specific attentional modulation signals. Our system implements bi-directional data flow in a processing hierarchy. The bottom-up data flow proceeds from a preprocessing level to the hypothesis level where object hypotheses created by exhaustive object detection algorithms are represented in a roughly retinotopic way. A competitive selection mechanism is used to determine the most confident hypotheses, which are used on the system level to train multimodal models that link object identity to invariant hypothesis properties. The top-down data flow originates at the system level, where the trained multimodal models are used to obtain space- and featurebased attentional modulation signals, providing biases for the competitive selection process at the hypothesis level. This results in object-specific hypothesis facilitation/suppression in certain image regions which we show to be applicable to different object detection mechanisms. In order to demonstrate the benefits of this approach, we apply the system to the detection of cars in a variety of challenging traffic videos. Evaluating our approach on a publicly available dataset containing approximately 3,500 annotated video images from more than 1 h of driving, we can show strong increases in performance and generalization when compared to object detection in isolation. - Furthermore, we compare our results to a late hypothesis rejection approach, showing that early coupling of topdown and bottom-up information is a favorable approach especially when processing resources are constrained. Visual processing in the human neocortex is organized in a hierarchical fashion: neurons in lower levels such as LGN and V1 and A1 have small receptive fields and are sensitive to a very specific set of stimuli, whereas neurons in higher areas tend to have larger receptive fields and are increasingly broad in their selectivity [31]. As a consequence, neural activity in lower hierarchy levels is tightly coupled to sensory input, whereas higher-level neurons may well respond to rather abstract categories and concepts [31]. It has long been known that information processing in such hierarchies is bi-directional, consisting of a bottom-up (away from sensory input) and a top-down (toward sensory input) component [12, 17], and this has been linked to accounts of attentional modulation, i.e., the selective and large-scale enhancing or suppressing of neuronal responses in accordance with task demands [14, 22, 32]. For visual processing, there seem to exist at least two concurrently active mechanisms of attentional modulation: space-based attention that enhances certain locations in the visual field and feature-based attention that is not localized but affects all populations of neurons representing a particular visual property [11]. Since cortical neurons, especially at high hierarchy levels, compete strongly with each other for representing the current stimulus, it has been proposed that local facilitation or inhibition of neural responses by top-down signals can explain the pronounced effects of attentional modulation simply because small local biases may result in very different stable states of the competition process [4, 18, 28]. This biased competition [4] account of attentional modulation has influenced many models of visual attention; we incorporated it into our research because we found that competition between object hypotheses is an unavoidable step for agents with constrained resources; the biasing of the existing competition mechanism is a then straightforward extension. Since attentional modulation is observed to enhance performance w.r.t. a wide variety of tasks, the question immediately arises how models for task-specific attentional modulation are obtained. An influential concept, the socalled reverse hierarchy theory [12], states that such models are first acquired in high levels of the processing hierarchy and subsequently used to train task-specific responses in lower levels. We present the method of system-level learning that implements an important aspect of reverse hierarchy theory by introducing dependency models between highly invariant quantities available on the highest level of a processing system. This is motivated by our finding that such system-level models usually show high generalization ability. Motivation for the Presented Work Our experience with cluttered and uncontrolled traffic environments suggests that purely appearance-based (i.e., based on local pixel patterns) object detection suffers from significant ambiguities: the more complex a scene is, the higher is the probability that some local pixel pattern will be similar to the object class of interest. In order to overcome this difficulty, we claim that object-specific models relating appearance-based visual information to non-local and non-visual information must be taken into account to achieve the required disambiguation. For convergent, hierarchically organized systems, this implies that such models can only be formed at high hierarchy levels where the required information is available. The idea of system-level learning (see also [8]) is to represent all quantities available at the highest hierarchy level in a common way in order to use a single, scalable learning algorithm for detecting correlations. The focus of this article is to use system-level models for generating and using expectations to generate attentional modulation: given a search cue, e.g., a certain object identity, systemlevel models are queried for features correlated with this identity, and the resulting expectation is used to define attentional modulation. Fig. 1 Illustration of the basic structure and the inherent novel points of the presented system. a Learning of multi-modal system-level models for generating attentional modulation during system operation b Application of system-level models for attentional modulation. What kinds of models are learned effectively depends only on the processing results that are supplied to the system-level learning mechanism Research Questions, Claims and Messages Based on our experience with object detection in complex traffic scenes, we formulated a number of hypotheses, which this article will investigate based on a hierarchical car detection system system as shown in Fig. 1. We evaluate the system in challenging real-world situations using extended annota (...truncated)