Dynamic, Task-Related and Demand-Driven Scene Representation
Sven Rebhan
0
Julian Eggert
0
0
S. Rebhan (&) J. Eggert Honda Research Institute Europe
, Carl-Legien-Str. 30, 63073 Offenbach/Main,
Germany
Humans selectively process and store details about the vicinity based on their knowledge about the scene, the world and their current task. In doing so, only those pieces of information are extracted from the visual scene that is required for solving a given task. In this paper, we present a flexible system architecture along with a control mechanism that allows for a task-dependent representation of a visual scene. Contrary to existing approaches, our system is able to acquire information selectively according to the demands of the given task and based on the system's knowledge. The proposed control mechanism decides which properties need to be extracted and how the independent processing modules should be combined, based on the knowledge stored in the system's long-term memory. Additionally, it ensures that algorithmic dependencies between processing modules are resolved automatically, utilizing procedural knowledge which is also stored in the long-term memory. By evaluating a proof-ofconcept implementation on a real-world table scene, we show that, while solving the given task, the amount of data processed and stored by the system is considerably lower compared to processing regimes used in state-of-the-art systems. Furthermore, our system only acquires and stores the minimal set of information that is relevant for solving the given task.
-
The visual environment of humans is full of details. To
account for a limited computational power and memory
capacity, humans selectively process and store those details
present in the visual scene. Different experiments show that
the selection process is based on the given task and the
subjects knowledge about the vicinity and the world. The
strong influence of a given task on the way we scan a scene
was shown in Yarbus [40]. There, the observed subjects
scan pattern for a visual scene varied, dependent on the
given task. Subjects fixated locations containing
task-relevant information more frequently, whereas other locations
were not visited at all. The results of this experiment was
also confirmed by others, e.g. [18, 21, 29, 31]. Beside the
task, the knowledge about the current scene and the world in
general plays an important role for the way we process a
visual scene as experiments conducted by e.g. [9, 19, 21, 34]
show. All those experiments suggest, that locations that are
relevant for solving a certain task are preferred in contrast to
locations containing no such information. However, the
attention of humans is not limited to spatial selectivity but
also applies to the details stored about objects in the scene
such as color, size, form, etc. Experiments presented in
Ballard et al. [18, 36] investigate the relation between a
given task and the details stored about objects in the scene.
They suggest that subjects store only those properties of
objects which are relevant to solve a given task. As Triesch
put it What we see is what we need. [36]. To summarize,
short- and long-term memory as well as the current task bias
the attention on objects in the current scene. The
experiments have also shown that our attention is not only guided
spatially but also in the feature domain.
On the modeling side for visual attention, there exist
quite a few architectures for vision systems. As an unbound
and data-driven visual search is NP-Complete [37], we here
review only models that are based on a topdown guided
attention schema. One of the first models describing a top
down guided visual processing was proposed in Bajcsy [4].
The so-called Active Vision Aloimonos et al. [2] models
interpret vision as an active process, where sensor
parameters like zoom, focus, gain or gaze are actively modulated
to disambiguate the visual input in a task-specific manner.
While earlier work focused on the control of sensory
parameters, Ballard [6] and Aloimonos [1] emphasize the
ability to control the gaze and thus the spatially selective
processing and representation of the scene. Another group
of system architectures based on the work of Rensink [28]
concentrate on modeling the topdown influence on
attention processes. In Navalpakkam and Itti [25],
information about the spatial layout of a scene and the
knowledge about the world is used to guide attention to locations
in the image containing objects relevant for a given task.
Here, the knowledge about relevant objects stored in a
long-term memory is used to modulate input feature
channels to render those objects more salient. In doing so, a
coupling of the systems memory and its sensory apparatus
is achieved, allowing a system to integrate information
about its world over time and reuse it later. Another
attention system incorporating topdown knowledge was
published in Hamker [16]. Here, the author focuses on a
biologically plausible modeling of the topdown influences
by incorporating an expectation about the features that
should be seen at the target location. Similar approaches
using a topdown modulation in the feature space can be
found in e.g. Frintrop et al. [13].
All models of vision systems presented so far
incorporate the task and the knowledge about the scene and objects
in the world. However, the attention in current
state-of-theart systems (see Frintrop et al. [14] for a comprehensive
review) is only guided spatially. It becomes clear that in
contrast to what psychophysical experiments suggest, those
models lack attention on which properties of an object
should be extracted from the scene. That is, once a certain
object is attended, in state-of-the-art models all properties
of this object are extracted, not only those relevant for the
task. By not selectively processing the features of objects,
these systems neglect potential savings in both the amount
of processing and the used memory capacity, which are
relevant for resource constraint vision systems. For
example, if the task only requires to determine the color of an
object, state-of-the-art models nevertheless will run a
classifier and store a full-fledged representation of the
object as they are built on static processing pathways. In
such a processing paradigm, higher level information is
computed in pipes from the image pixel up to e.g. an object
ID while modulating the different stages in a topdown
manner. However, due to the static processing pipelines, a
selective extraction of information is not possible as this
would require to run and dynamically concatenate subparts
of the processing pipelines. In the example of extracting
the color of an object, the subparts like saliency
computation, segmentation, and color extraction could form a
color-extraction process, while not running e.g. the
classifier. This example shows that a more flexible and
dynamic system architecture is required, allowing for an
easy combination of different processing modules. Very
early ideas on such a flexible a (...truncated)