The initial phase of auditory and visual scene analysis
Jean-Michel Hupe
()
2
Daniel Pressnitzer
0
1
0
De partement d'Etudes Cognitives, Ecole Nationale Supe rieure
,
Paris
,
France
1
Laboratoire de Psychologie de la Perception, Universite Paris Descartes and Centre National de la Recherche Scientifique
,
75006 Paris
,
France
2
Centre de Recherche Cerveau et Cognition, Universite de Toulouse and Centre National de la Recherche Scientifique
,
31300 Toulouse
,
France
Auditory streaming and visual plaids have been used extensively to study perceptual organization in each modality. Both stimuli can produce bistable alternations between grouped (one object) and split (two objects) interpretations. They also share two peculiar features: (i) at the onset of stimulus presentation, organization starts with a systematic bias towards the grouped interpretation; (ii) this first percept has 'inertia'; it lasts longer than the subsequent ones. As a result, the probability of forming different objects builds up over time, a landmark of both behavioural and neurophysiological data on auditory streaming. Here we show that first percept bias and inertia are independent. In plaid perception, inertia is due to a depth ordering ambiguity in the transparent (split) interpretation that makes plaid perception tristable rather than bistable: experimental manipulations removing the depth ambiguity suppressed inertia. However, the first percept bias persisted. We attempted a similar manipulation for auditory streaming by introducing level differences between streams, to bias which stream would appear in the perceptual foreground. Here both inertia and first percept bias persisted. We thus argue that the critical common feature of the onset of perceptual organization is the grouping bias, which may be related to the transition from temporally/ spatially local to temporally/spatially global computation.
1. INTRODUCTION
Auditory scene analysis leads to the formation of
perceptual objects, or streams, from the flow of acoustic
information reaching the ears [1]. It is what allows us
to follow a conversation in a crowded restaurant, in
the midst of other conversations, with music in the
background and the sound of tinkling glasses. An
essential feature of streaming is that it takes time:
initially, subjects tend to group all of the acoustic
information into one global stream [2,3]. When we first
walk into the restaurant, the first impression may be
of a loud and undifferentiated noise [4]. Only after
some time do streams begin to differentiate, allowing
switching of attention to the different sound sources.
This is termed the build up of streaming. The aim
of this paper is to re-examine the initial build up of
streaming in a bistable paradigm and to compare it
with a similar paradigm in vision.
Streaming has extensively been studied with
sequences of tones akin to simple musical melodies
[5 7]. For instance, the subject may hear L and H
tones, where L and H represent low and high tone
frequencies, repeated in an LHL LHL . . . sequence
[8]. For such a stimulus, the first report is usually of
one stream (grouped percept) experienced as a
single melody LHL LHL . . . . After a few seconds
or tens of seconds, however, perception changes to
that of two streams (split percept), L L L and
H H , which are heard as two concurrent melodies
that can be attended selectively, but not
simultaneously. Because the first switch to two streams is
probabilistic, when averaging over subjects and/or
stimulus presentations, one observes a gradual
increase in the probability of a two-stream percept
over time [3,9]. If any sudden change is introduced
in the sequence, such as a change in location,
loudness, in the silent pause between tones or even in the
attentional focus of the subject, streaming is reset
and build up starts again [3,7,10,11].
The build up of streaming has been used as an
essential landmark of streaming. In studies measuring
objective correlates of streaming, the onset versus offset
of the streaming sequence is usually contrasted, so
performance changes can be attributed to build up and
not to acoustic manipulations [12]. Build up is also
used to investigate the effect of attention on streaming,
with subjective [9] and objective [13] methods. In
animal electrophysiology, build up provides a useful
tool for accessing the temporal dynamics of streaming,
Downloaded from http://rstb.royalsocietypublishing.org/ on November 13, 2014
Build up and tristability J.-M. Hupe & D. Pressnitzer 943
as it is a measure that can be averaged across trials and
that does not require the co-registration of the
perceptual state. Correlates of build up have been found both
in the auditory cortex [14] and in the cochlear nucleus
[15] of the mammalian auditory system.
Interestingly, in all of the published data, after the
initial build up, the probability of hearing two streams
stabilizes below 100%. As pointed out by Pressnitzer &
Hupe [16], this indicates that there are subsequent
perceptual alternations back and forth to a one-stream
percept after the initial build up. Perceptual reports for
long-lasting sequences confirmed that streaming was
indeed a bistable phenomenon [16 18]. The build up
of streaming was described [16] as a combination of a
systematic bias towards the one-stream interpretation
at stimulus onset (even when the two-stream
interpretation was later experienced most of the time) and a
longer duration of this first percept compared with
subsequent one-stream percepts (we shall call this duration
effect the inertia of the first percept)see figs 1a
and S3 of Pressnitzer & Hupe [16]. Such dynamics are
different from those observed in classic examples of
visual bistability like binocular rivalry, ambiguous figures
or apparent motion. In such instances, when both
interpretations are equally likely, which percept is first
is random (unless the stimulus was presented a short
time before; in that case, perceptual stabilization can
occur [19,20]). When the stimulus is biased in favour
of one interpretation (for example, higher contrast of
the stimulus presented to one eye in binocular rivalry),
the first percept typically corresponds to the biased
interpretation [21]. Also, percept durations are
stochastic but, on average, the duration is rather constant over
time for each interpretation (average durations being
longer for the preferred interpretation), with no inertia
of the first percept. Exceptions to this rule occur when
observers are not familiar with different interpretations
of the stimulus or are not informed that their perception
may change; in that case, the first percept may last much
longer than subsequent percepts [22 24], as observed
especially for children [25]. Changes in the switching
rate over long presentation times have also been reported
for some [26] but not all [27] ambiguous visual stimuli.
In any case, a constant switching rate should only be
observed when subjects are luminance adapted (since
luminance adaptation over time correspon (...truncated)