The initial phase of auditory and visual scene analysis

Philosophical Transactions of the Royal Society B: Biological Sciences, Apr 2012

Auditory streaming and visual plaids have been used extensively to study perceptual organization in each modality. Both stimuli can produce bistable alternations between grouped (one object) and split (two objects) interpretations. They also share two peculiar features: (i) at the onset of stimulus presentation, organization starts with a systematic bias towards the grouped interpretation; (ii) this first percept has ‘inertia’; it lasts longer than the subsequent ones. As a result, the probability of forming different objects builds up over time, a landmark of both behavioural and neurophysiological data on auditory streaming. Here we show that first percept bias and inertia are independent. In plaid perception, inertia is due to a depth ordering ambiguity in the transparent (split) interpretation that makes plaid perception tristable rather than bistable: experimental manipulations removing the depth ambiguity suppressed inertia. However, the first percept bias persisted. We attempted a similar manipulation for auditory streaming by introducing level differences between streams, to bias which stream would appear in the perceptual foreground. Here both inertia and first percept bias persisted. We thus argue that the critical common feature of the onset of perceptual organization is the grouping bias, which may be related to the transition from temporally/spatially local to temporally/spatially global computation.

Article PDF cannot be displayed. You can download it here:

https://rstb.royalsocietypublishing.org/content/367/1591/942.full.pdf

The initial phase of auditory and visual scene analysis

Jean-Michel Hupe () 2 Daniel Pressnitzer 0 1 0 De partement d'Etudes Cognitives, Ecole Nationale Supe rieure , Paris , France 1 Laboratoire de Psychologie de la Perception, Universite Paris Descartes and Centre National de la Recherche Scientifique , 75006 Paris , France 2 Centre de Recherche Cerveau et Cognition, Universite de Toulouse and Centre National de la Recherche Scientifique , 31300 Toulouse , France Auditory streaming and visual plaids have been used extensively to study perceptual organization in each modality. Both stimuli can produce bistable alternations between grouped (one object) and split (two objects) interpretations. They also share two peculiar features: (i) at the onset of stimulus presentation, organization starts with a systematic bias towards the grouped interpretation; (ii) this first percept has 'inertia'; it lasts longer than the subsequent ones. As a result, the probability of forming different objects builds up over time, a landmark of both behavioural and neurophysiological data on auditory streaming. Here we show that first percept bias and inertia are independent. In plaid perception, inertia is due to a depth ordering ambiguity in the transparent (split) interpretation that makes plaid perception tristable rather than bistable: experimental manipulations removing the depth ambiguity suppressed inertia. However, the first percept bias persisted. We attempted a similar manipulation for auditory streaming by introducing level differences between streams, to bias which stream would appear in the perceptual foreground. Here both inertia and first percept bias persisted. We thus argue that the critical common feature of the onset of perceptual organization is the grouping bias, which may be related to the transition from temporally/ spatially local to temporally/spatially global computation. 1. INTRODUCTION Auditory scene analysis leads to the formation of perceptual objects, or streams, from the flow of acoustic information reaching the ears [1]. It is what allows us to follow a conversation in a crowded restaurant, in the midst of other conversations, with music in the background and the sound of tinkling glasses. An essential feature of streaming is that it takes time: initially, subjects tend to group all of the acoustic information into one global stream [2,3]. When we first walk into the restaurant, the first impression may be of a loud and undifferentiated noise [4]. Only after some time do streams begin to differentiate, allowing switching of attention to the different sound sources. This is termed the build up of streaming. The aim of this paper is to re-examine the initial build up of streaming in a bistable paradigm and to compare it with a similar paradigm in vision. Streaming has extensively been studied with sequences of tones akin to simple musical melodies [5 7]. For instance, the subject may hear L and H tones, where L and H represent low and high tone frequencies, repeated in an LHL LHL . . . sequence [8]. For such a stimulus, the first report is usually of one stream (grouped percept) experienced as a single melody LHL LHL . . . . After a few seconds or tens of seconds, however, perception changes to that of two streams (split percept), L L L and H H , which are heard as two concurrent melodies that can be attended selectively, but not simultaneously. Because the first switch to two streams is probabilistic, when averaging over subjects and/or stimulus presentations, one observes a gradual increase in the probability of a two-stream percept over time [3,9]. If any sudden change is introduced in the sequence, such as a change in location, loudness, in the silent pause between tones or even in the attentional focus of the subject, streaming is reset and build up starts again [3,7,10,11]. The build up of streaming has been used as an essential landmark of streaming. In studies measuring objective correlates of streaming, the onset versus offset of the streaming sequence is usually contrasted, so performance changes can be attributed to build up and not to acoustic manipulations [12]. Build up is also used to investigate the effect of attention on streaming, with subjective [9] and objective [13] methods. In animal electrophysiology, build up provides a useful tool for accessing the temporal dynamics of streaming, Downloaded from http://rstb.royalsocietypublishing.org/ on November 13, 2014 Build up and tristability J.-M. Hupe & D. Pressnitzer 943 as it is a measure that can be averaged across trials and that does not require the co-registration of the perceptual state. Correlates of build up have been found both in the auditory cortex [14] and in the cochlear nucleus [15] of the mammalian auditory system. Interestingly, in all of the published data, after the initial build up, the probability of hearing two streams stabilizes below 100%. As pointed out by Pressnitzer & Hupe [16], this indicates that there are subsequent perceptual alternations back and forth to a one-stream percept after the initial build up. Perceptual reports for long-lasting sequences confirmed that streaming was indeed a bistable phenomenon [16 18]. The build up of streaming was described [16] as a combination of a systematic bias towards the one-stream interpretation at stimulus onset (even when the two-stream interpretation was later experienced most of the time) and a longer duration of this first percept compared with subsequent one-stream percepts (we shall call this duration effect the inertia of the first percept)see figs 1a and S3 of Pressnitzer & Hupe [16]. Such dynamics are different from those observed in classic examples of visual bistability like binocular rivalry, ambiguous figures or apparent motion. In such instances, when both interpretations are equally likely, which percept is first is random (unless the stimulus was presented a short time before; in that case, perceptual stabilization can occur [19,20]). When the stimulus is biased in favour of one interpretation (for example, higher contrast of the stimulus presented to one eye in binocular rivalry), the first percept typically corresponds to the biased interpretation [21]. Also, percept durations are stochastic but, on average, the duration is rather constant over time for each interpretation (average durations being longer for the preferred interpretation), with no inertia of the first percept. Exceptions to this rule occur when observers are not familiar with different interpretations of the stimulus or are not informed that their perception may change; in that case, the first percept may last much longer than subsequent percepts [22 24], as observed especially for children [25]. Changes in the switching rate over long presentation times have also been reported for some [26] but not all [27] ambiguous visual stimuli. In any case, a constant switching rate should only be observed when subjects are luminance adapted (since luminance adaptation over time correspon (...truncated)


This is a preview of a remote PDF: https://rstb.royalsocietypublishing.org/content/367/1591/942.full.pdf
Article home page: http://rstb.royalsocietypublishing.org/content/367/1591/942.abstract

Jean-Michel Hupé, Daniel Pressnitzer. The initial phase of auditory and visual scene analysis, Philosophical Transactions of the Royal Society B: Biological Sciences, 2012, pp. 942-953, 367/1591, DOI: 10.1098/rstb.2011.0368