Temporal mechanisms of multimodal binding
David Burr
1
2
3
Ottavia Silva
0
2
Guido Marco Cicchini
0
2
Martin S. Banks
2
6
Maria Concetta Morrone
2
4
5
0
Faculty of Psychology, Universita` Vita-Salute San Raffaele
,
via Olgettina 58, Milan 20132
,
Italy
1
School of Psychology, University of Western Australia
,
Nedlands, Western Australia 6009
,
Australia
2
Universita` Degli Studi di Firenze
,
via S. Nicolo` 89, Florence 50125
,
Italy
3
Department of Psychology, Universita` Degli Studi di Firenze
,
via S. Nicolo` 89, Florence 50125
,
Italy
4
Scientific Institute Stella Maris
,
Calambrone, Pisa 56018
,
Italy
5
Department of Physiological Sciences, University of Pisa
,
Via S. Zeno 36, Pisa 56100
,
Italy
6
Department of Psychology, School of Optometry, Vision Science Program, University of California
,
Berkeley, CA 94720
,
USA
The simultaneity of signals from different senses-such as vision and audition-is a useful cue for determining whether those signals arose from one environmental source or from more than one. To understand better the sensory mechanisms for assessing simultaneity, we measured the discrimination thresholds for time intervals marked by auditory, visual or auditory-visual stimuli, as a function of the base interval. For all conditions, both unimodal and cross-modal, the thresholds followed a characteristic 'dipper function' in which the lowest thresholds occurred when discriminating against a non-zero interval. The base interval yielding the lowest threshold was roughly equal to the threshold for discriminating asynchronous from synchronous presentations. Those lowest thresholds occurred at approximately 5, 15 and 75 ms for auditory, visual and auditory-visual stimuli, respectively. Thus, the mechanisms mediating performance with cross-modal stimuli are considerably slower than the mechanisms mediating performance within a particular sense. We developed a simple model with temporal filters of different time constants and showed that the model produces discrimination functions similar to the ones we observed in humans. Both for processing within a single sense, and for processing across senses, temporal perception is affected by the properties of temporal filters, the outputs of which are used to estimate time offsets, correlations between signals, and more.
1. INTRODUCTION
One of the most complex tasks for the brain is to combine
the information from the five senses into a single perceptual
experience. Several studies have shown that the integration
of information between senses increases perceptual
precision and accuracy (Ernst & Banks 2002; Gepshtein &
Banks 2003; Alais & Burr 2004). However, it is crucial that
only appropriate information can be integrated because the
integration of information from different environmental
sources would be generally detrimental.
One cue for when to integrate across modalities could
be temporal coincidence: if sensory events (such as a flash
and a sound) occur at the same time, there is a good
probability that they originated from the same source. But
determining simultaneity of external sources is not an easy
matter for the brain because the arrival time of neural
signals depends on many factors, including variable
latencies in sensory transduction and neural
transmission, and, for sound, significant physical delays in
transmission. Any coincidence detector has to be flexible
and adaptable. A good deal of evidence suggests that
humans perceive brief auditory and visual events as
simultaneous over a moderately wide range of
asynchronies. In particular, the system regards auditoryvisual
events falling within 5060 ms of one another as
simultaneous (Hirsh & Sherrick 1961; Stein & Meredith
1993; Zampini et al. 2003; Arrighi et al. 2006). That
window is flexible, which is evidenced by the fact that the
nervous system takes into account the time the sound
takes to travel from its source (Kopinska & Harris 2004;
Alais & Carlile 2005). The simultaneity window is
also adaptable. Systematic training with asynchronous
audio-visual stimuli shifts the time delay at which sounds
and flashes are perceived to be simultaneous ( Fujisaki
et al. 2004; see also Vroomen et al. 2004). Indeed,
artificially delayed visual feedback during a tapping task
can distort perceived simultaneity to the extent that when
the delay is removed, subjects believe that their actions are
anticipating their intentions (Stetson et al. 2006).
While the extent of the window of simultaneity for
vision and audition (and other senses) has been examined
extensively, little is known about the nature of the
mechanisms responsible for these tasks. Fujisaki & Nishida
(2005) examined synchrony/asynchrony discriminations
with periodic stimuli, finding that the discrimination is not
possible for frequencies higher than 4 Hz (confirmed by
Arrighi et al. 2006). They suggested that this limit may
reflect a cross-correlation mechanism, computing
similarities between auditory and visual streams. Their work
further suggested that this cross-correlator does not
operate on raw inputs, but correlates salient features
extracted by the auditory and visual systems.
Considering early sensory processing as a cascade of
spatial and temporal filters has led to many useful insights,
particularly into vision and audition. Here, we use this
approach to investigate the filtering properties of
auditoryvisual synchrony mechanisms.
We measured interval discrimination thresholds where
the intervals were marked by visual, auditory and
auditoryvisual stimuli. Duration thresholds usually
follow Webers law: the required increment in duration is
proportional to the base duration ( Fraisse 1984; Mauk &
Buonomano 2004). While Webers law is frequently
observed in sensory discrimination, there are in fact
many important deviations from that behaviour. For
example, luminance discrimination departs from Webers
law at low luminances, where the thresholds become
independent of luminance (Barlow 1957). More
interestingly, many discrimination functions exhibit a dipper
function, including the functions for discrimination of
contrast ( Nachmias & Kocher 1970; Pelli 1985), blur
( Watt & Morgan 1983; Burr & Morgan 1997) and motion
(Simpson & Finsten 1995; Gori et al. 2008). Starting with
small base values, increment threshold initially decreases
with increasing base value reaching the lowest value in the
dipper, and then increases monotonically thereafter. For
large base values, threshold rises following Webers law
( Nachmias & Kocher 1970; Nachmias & Sansbury 1974;
Legge & Foley 1980; Pelli 1985; Foley 1994). Dipper
functions have also been observed in visuo-tactile
discriminations, where pedestal effects occur between
modalities (Arabzadeh et al. 2008; Burr et al. in press).
The generally accepted explanation for the dipper is that it
results from a transducer function with an early,
threshold-like accelerating nonlinearity (Legge & Foley
1980). Spatio-temporal uncertainty has also been
implicated (Pelli 1985), but not strongly supported by the
evid (...truncated)