Dynamic, Task-Related and Demand-Driven Scene Representation (pdf)

Article PDF cannot be displayed. You can download it here:

http://link.springer.com/content/pdf/10.1007%2Fs12559-010-9077-9.pdf

Dynamic, Task-Related and Demand-Driven Scene Representation

Sven Rebhan 0 Julian Eggert 0 0 S. Rebhan (&) J. Eggert Honda Research Institute Europe , Carl-Legien-Str. 30, 63073 Offenbach/Main, Germany Humans selectively process and store details about the vicinity based on their knowledge about the scene, the world and their current task. In doing so, only those pieces of information are extracted from the visual scene that is required for solving a given task. In this paper, we present a flexible system architecture along with a control mechanism that allows for a task-dependent representation of a visual scene. Contrary to existing approaches, our system is able to acquire information selectively according to the demands of the given task and based on the system's knowledge. The proposed control mechanism decides which properties need to be extracted and how the independent processing modules should be combined, based on the knowledge stored in the system's long-term memory. Additionally, it ensures that algorithmic dependencies between processing modules are resolved automatically, utilizing procedural knowledge which is also stored in the long-term memory. By evaluating a proof-ofconcept implementation on a real-world table scene, we show that, while solving the given task, the amount of data processed and stored by the system is considerably lower compared to processing regimes used in state-of-the-art systems. Furthermore, our system only acquires and stores the minimal set of information that is relevant for solving the given task. - The visual environment of humans is full of details. To account for a limited computational power and memory capacity, humans selectively process and store those details present in the visual scene. Different experiments show that the selection process is based on the given task and the subjects knowledge about the vicinity and the world. The strong influence of a given task on the way we scan a scene was shown in Yarbus [40]. There, the observed subjects scan pattern for a visual scene varied, dependent on the given task. Subjects fixated locations containing task-relevant information more frequently, whereas other locations were not visited at all. The results of this experiment was also confirmed by others, e.g. [18, 21, 29, 31]. Beside the task, the knowledge about the current scene and the world in general plays an important role for the way we process a visual scene as experiments conducted by e.g. [9, 19, 21, 34] show. All those experiments suggest, that locations that are relevant for solving a certain task are preferred in contrast to locations containing no such information. However, the attention of humans is not limited to spatial selectivity but also applies to the details stored about objects in the scene such as color, size, form, etc. Experiments presented in Ballard et al. [18, 36] investigate the relation between a given task and the details stored about objects in the scene. They suggest that subjects store only those properties of objects which are relevant to solve a given task. As Triesch put it What we see is what we need. [36]. To summarize, short- and long-term memory as well as the current task bias the attention on objects in the current scene. The experiments have also shown that our attention is not only guided spatially but also in the feature domain. On the modeling side for visual attention, there exist quite a few architectures for vision systems. As an unbound and data-driven visual search is NP-Complete [37], we here review only models that are based on a topdown guided attention schema. One of the first models describing a top down guided visual processing was proposed in Bajcsy [4]. The so-called Active Vision Aloimonos et al. [2] models interpret vision as an active process, where sensor parameters like zoom, focus, gain or gaze are actively modulated to disambiguate the visual input in a task-specific manner. While earlier work focused on the control of sensory parameters, Ballard [6] and Aloimonos [1] emphasize the ability to control the gaze and thus the spatially selective processing and representation of the scene. Another group of system architectures based on the work of Rensink [28] concentrate on modeling the topdown influence on attention processes. In Navalpakkam and Itti [25], information about the spatial layout of a scene and the knowledge about the world is used to guide attention to locations in the image containing objects relevant for a given task. Here, the knowledge about relevant objects stored in a long-term memory is used to modulate input feature channels to render those objects more salient. In doing so, a coupling of the systems memory and its sensory apparatus is achieved, allowing a system to integrate information about its world over time and reuse it later. Another attention system incorporating topdown knowledge was published in Hamker [16]. Here, the author focuses on a biologically plausible modeling of the topdown influences by incorporating an expectation about the features that should be seen at the target location. Similar approaches using a topdown modulation in the feature space can be found in e.g. Frintrop et al. [13]. All models of vision systems presented so far incorporate the task and the knowledge about the scene and objects in the world. However, the attention in current state-of-theart systems (see Frintrop et al. [14] for a comprehensive review) is only guided spatially. It becomes clear that in contrast to what psychophysical experiments suggest, those models lack attention on which properties of an object should be extracted from the scene. That is, once a certain object is attended, in state-of-the-art models all properties of this object are extracted, not only those relevant for the task. By not selectively processing the features of objects, these systems neglect potential savings in both the amount of processing and the used memory capacity, which are relevant for resource constraint vision systems. For example, if the task only requires to determine the color of an object, state-of-the-art models nevertheless will run a classifier and store a full-fledged representation of the object as they are built on static processing pathways. In such a processing paradigm, higher level information is computed in pipes from the image pixel up to e.g. an object ID while modulating the different stages in a topdown manner. However, due to the static processing pipelines, a selective extraction of information is not possible as this would require to run and dynamically concatenate subparts of the processing pipelines. In the example of extracting the color of an object, the subparts like saliency computation, segmentation, and color extraction could form a color-extraction process, while not running e.g. the classifier. This example shows that a more flexible and dynamic system architecture is required, allowing for an easy combination of different processing modules. Very early ideas on such a flexible a (...truncated)