Exp Brain Res (2007) 181:173–181 DOI 10.1007/s00221-007-0918-z
R ES EA R C H A R TI CLE
Temporal recalibration during asynchronous audiovisual speech perception Argiro Vatakis · Jordi Navarra · Salvador Soto-Faraco · Charles Spence
Received: 9 May 2006 / Accepted: 20 February 2007 / Published online: 13 March 2007 © Springer-Verlag 2007
Abstract We investigated the consequences of monitoring an asynchronous audiovisual speech stream on the temporal perception of simultaneously presented vowelconsonant-vowel (VCV) audiovisual speech video clips. Participants made temporal order judgments (TOJs) regarding whether the speech-sound or the visual-speech gesture occurred Wrst, for video clips presented at various diVerent stimulus onset asynchronies. Throughout the experiment, half of the participants also monitored a continuous stream of words presented audiovisually, superimposed over the VCV video clips. The continuous (adapting) speech stream could either be presented in synchrony, or else with the auditory stream lagging by 300 ms. A signiWcant shift (13 ms in the direction of the adapting stimulus in the point of subjective simultaneity) was observed in the TOJ task when participants monitored the asynchronous speech stream. This result suggests that the consequences of adapting to asynchronous speech extends beyond the case of simple audiovisual stimuli (as has recently been demonstrated by Navarra et al. in Cogn Brain Res 25:499–507,
A. Vatakis (&) · J. Navarra · C. Spence Crossmodal Research Laboratory, Department of Experimental Psychology, University of Oxford, South Parks Road, Oxford, OX1 3UD, UK e-mail:
[email protected] e-mail:
[email protected] J. Navarra Grup de Recerca Neurociencia Cognitiva (GRNC), Parc CientíWc de Barcelona, Universitat de Barcelona, Barcelona, Spain S. Soto-Faraco ICREA and Parc CientíWc de Barcelona, Universitat de Barcelona, Barcelona, Spain
2005) and can even aVect the perception of more complex speech stimuli. Keywords Speech · Asynchrony · Temporal order judgment · Adaptation · Temporal recalibration · Audition · Vision
Introduction People are commonly exposed to stimuli whose arrival time diVers (see Spence and Squire 2003), such as, for example, live televised satellite broadcasts that often contain a temporal mismatch between what is seen and heard (ITU-R BT 1359-1 1998; Vatakis and Spence 2006a). Despite the fact that this kind of temporal misalignment between modalities can potentially arise from a variety of diVerent causes, including the physical properties of the event (as in the example provided above) as well as from biophysical constraints on sensory information processing (i.e., attributable to diVerences in neural transduction latencies; Spence and Squire 2003), our everyday perception of environmental events is generally experienced as both temporally coherent and perceptually uniWed. That is, our perceptual experience is typically of a synchronous external world (Wlled with temporally matched auditory and visual signals; contrast this with the experience of certain neurological patients, for whom particular aspects of their multisensory perceptual experience are perceived as asynchronous; see Hamilton et al. 2006). This suggests that the human perceptual system has developed neural mechanisms that are capable of compensating for the typical temporal discrepancies associated with the processing of incoming signals from diVerent sensory modalities (and also for the asynchronies that can arise for the processing of diVerent perceptual attributes
123
174
within the same sensory modality; e.g., Moutoussis and Zeki 1997; Nishida and Johnston 2002). Research on the multisensory perception of synchrony has grown steadily over the last few years (see King 2005; Spence et al. 2001; Spence and Squire 2003). However, as yet, little is known about the speciWc characteristics and mechanisms (either behavioural or neuronal) involved in the perception of temporal synchrony between inputs in diVerent sensory modalities (though see Bergmann et al. 2006; Bushara et al. 2001; Macaluso et al. 2004; Miller and D’Esposito 2005; SotoFaraco and Alsius 2007). The ability of the human perceptual system to compensate for the asynchronous arrival of inputs from diVerent sensory modalities has typically been demonstrated in studies using simple stimuli such as light Xashes and sound bursts (e.g., Kopinska and Harris 2004; Morein-Zamir et al. 2003; Sugita and Suzuki 2003), and in a few studies using more ecologically-valid complex stimuli such as speech (e.g., Dixon and Spitz 1980; Grant et al. 2004; Navarra et al. 2005; Rihs 1995; Soto-Faraco and Alsius 2007; Vatakis and Spence 2006b, c, 2007). Many such studies have now shown that even though multisensory integration is frequently enhanced when there is approximate temporal synchrony between sensory signals (e.g., see Calvert et al. 2004; de Gelder and Bertelson 2003), precise temporal coincidence is by no means a mandatory requirement for the human perceptual system in order to create a uniWed perceptual representation of a multisensory event. For example, it has been shown that the intelligibility of audiovisually presented speech remains high even when asynchronies of as much as 250 ms are introduced between visual- and auditory-speech signals (e.g., Dixon and Spitz 1980; Munhall et al. 1996). Interestingly, this ability of the human perceptual system to deal with signal asynchronies appears to vary as a function of the nature of the stimuli with which the system is being confronted. So, for example, when people are presented with simple transitory stimuli (such as brief light Xashes and sound bursts) smaller asynchronies can be tolerated before the perception of simultaneity breaks down (e.g., Hirsh and Sherrick 1961; Zampini et al. 2003), whereas when they are presented with more complex events (such as speech, object action, or musical stimuli) synchrony can be perceived over a far greater range of temporal asynchronies (e.g., Dixon and Spitz 1980; Grant et al. 2004; Vatakis and Spence 2006b, c). For instance, studies using simple audiovisual stimuli have shown that auditory and visual signals typically need to be separated by approximately 60–70 ms in order for participants to be able to accurately judge which modality has been presented Wrst (e.g., Zampini et al. 2003). In the case of more complex stimuli, such as audiovisual speech, asynchrony can be tolerated for auditory leads of up to 100 ms, or for auditory
123
Exp Brain Res (2007) 181:173–181
lags of up to 200 ms (e.g., Dixon and Spitz 1980; Grant et al. 2004; Soto-Faraco and Alsius, submitted; Vatakis and Spence 2006b, c). The psychophysical research that has been conducted to date has shown that this compensatory ability of the human perceptual system may be accounted for by the existence of a temporal ‘window’ of multisensory integration that somehow adapts (by shifting in time and/or widening) depending on the background conditions under which the sensory (i.e., auditory or visual) signals are presented (Arnold et al. 2005; Engel and Dougherty 1971; King 2005; King and Palmer 1985; Kopinska and Harris 2004; Sugita and Suzuki 2003; but see also Lewald and Guski 2004). The existence of a ‘temporal ventriloquism eVect’ has also been demonstrated, whereby temporally misaligned auditory and visual events are ‘pulled’ into approximate temporal register (Fendrich and Corballis 2001; Morein-Zamir et al. 2003; Scheier et al. 1999; Vroomen and de Gelder 2004; Vroomen and Keetels 2006; but see also Kopinska and Harris 2004). Support for this temporal realignment eVect has been provided by the results of a recent study by Navarra et al. (2005) in which a temporal order judgment (TOJ) task was used (i.e., participants had to make unspeeded perceptual judgments regarding which of two sensory signals—a sound burst or a light Xash—had been presented Wrst). Navarra et al. (2005) demonstrated that adapting people to complex asynchronous audiovisual stimuli (such as speech or music) inXuenced the temporal perception of simple transitory stimuli (consisting of tones and light Xashes). In particular, adaptation to asynchronous speech or music increased the time interval required between the two stimuli in order for participants to correctly perceive the order of the target stimuli. The results of Navarra et al.’s study therefore provide support for the notion that the temporal window of multisensory integration can be ‘widened’ as a consequence of adaptation to asynchronous inputs. Further support for the ability of the human perceptual system to compensate for temporal discrepancies has also been provided by studies that have used simple audiovisual stimuli as the adapting stimuli. For example, Vroomen and colleagues have shown temporal realignment (a shift in the point of subjective simultaneity; PSS; i.e., the amount of time between the auditory and visual components of a stimulus for participants to make the ‘sound-Wrst’ and ‘lightWrst’ responses equally often) for pairs of simple auditory and visual stimuli after exposure to trains of desynchronized simple auditory and visual stimuli (with lags of §200 or §100 ms), using both a TOJ and a simultaneity judgment task (SJ; i.e., participants have to decide if the two stimuli were presented simultaneously or not task; Vroomen et al. 2004). Crucially, this PSS shift was in the direction of the adapting stimulus: That is, pre-exposure to visual lags (of 100 or 200 ms) resulted in the target light
Exp Brain Res (2007) 181:173–181
having to be presented later in time (relative to the target sound) for the two stimuli to be perceived as simultaneous. Similar results have also been reported in another study by Fujisaki et al. (2004), where exposure after-eVects were measured while presenting simple audiovisual stimuli at slightly larger auditory/visual lags (§235 ms) than those used in Vroomen et al.’s study. It is important to note that speech represents a stimulus of critical ecological importance for humans and it has even been argued by some researchers that it may represent a special class of perceptual event for the human brain (e.g., Bernstein et al. 2004; Massaro 2004; Munhall and Vatikiotis-Bateson 2004; Tuomainen et al. 2005). We thought it important therefore to examine whether temporal adaptation will also take place in the case where two speech signals are presented at the same time, since in real life interaction we are often exposed to more than one speech stream at any given time (think for example of the cocktail party). To date, research has shown that in the case of exposure to asynchronous speech stimuli, a wider window of audiovisual integration is observed compared to nonspeech stimuli (e.g., Dixon and Spitz 1980; Vatakis and Spence 2006b, c). In addition, a number of researchers have also argued that the presentation of speech may induce a stronger assumption of unity (whereby an observer assumes that two diVerent sensory signals refer to the same multisensory event; e.g., Bertelson and de Gelder 2004; Jack and Thurlow 1973; Jackson 1953; Vatakis and Spence 2007) than when confronted with arbitrary pairings of simple stimuli composed of beeps and Xashes. It is therefore possible that the simultaneous exposure to two diVerent audiovisual speech streams might result in temporal adaptation over a larger range of intervals, or of a larger magnitude (cf. Ernst and Banks 2002), as compared with the adaptation eVects obtained in previous studies (Fujisaki et al. 2004; Navarra et al. 2005; Vroomen et al. 2004). In the present study, therefore, we investigated for the Wrst time whether or not the modulation of audiovisual temporal perception observed in previous research using simple stimuli could also be demonstrated when participants’ temporal perception was tested using speech stimuli. Participants in this study were presented with brief video clips containing audiovisual syllables with a vowel-consonantvowel (VCV) structure in the context of a continuous stream of speech presented concurrently throughout the experiment. The continuous speech stream serving as the background stimulus was made up of a list of words. In half of the experiment, the background speech video was presented asynchronously (with the auditory signal lagging behind the visual signal by 300 ms), whereas in the remainder of the experiment, the background speech stream was presented synchronously. Half of the participants had to monitor both the background speech stream for targets (i.e.,
175
male Wrst names interspersed amongst the word list; this was done in order to ensure that the participants attended to the continuous speech video) and performed the TOJ task on the VCV video clips. The other half of the participants had to perform only the TOJ task (in order to explore whether, by itself, the presence of a background speech stream would have a detrimental eVect on TOJ performance).
Methods Participants Twenty-four participants (11 males and 13 females) aged between 20 and 33 years (mean age of 25 years) took part in the study, which lasted for approximately 50 min. The participants were given a 5 pound (UK Sterling) gift voucher, or course credit, in return for taking part in the experiment. All of the participants reported normal hearing and normal/corrected-to-normal visual acuity. The participants were naïve as to the purpose of the study and varied in their prior experience with psychophysical testing procedures. Apparatus and materials The experiment was conducted in a completely dark soundattenuated testing booth with the participants seated facing the experimental display. A series of brief audiovisual VCV video clips were superimposed on a continuous audiovisual (background) speech video using a semi-silvered mirror (see Fig. 1 for a bird’s-eye view of the experimental setup). The semi-silvered mirror was situated at eye-level approximately 40 cm in front of the participants and its placement was adjusted on a participant-by-participant basis so that the video image of the brieXy presented VCV video clips was projected directly upon the continuous video image. The monitor showing the continuous speech was hidden from the participants’ direct view by means of a curtain and the image was projected onto the surface of the semi-silvered mirror. The source of the continuous audiovisual speech video was a 15-inch (38.1 cm) CRT monitor (60-Hz refresh rate; placed approximately 90 cm from the semi-silvered mirror). The auditory speech signal was presented via two loudspeaker cones (Creative SBS 35, Cambridge Soundworks; 8.2 cm in diameter), one placed 34.3 cm to either side of the centre of the VCV monitor (see below). The 18min continuous speech video (720 £ 576-pixel, Cinepak Codec video compression, 16-bit Audio Sample Size, 24bit Video Sample Size, 30 frames/s) was presented on a black background using the Windows Media Player (Version
123
176
A
Exp Brain Res (2007) 181:173–181
Target Stimulus (VCV speech)
Semi-silvered mirror
Adaptor Stimulus (continuous background speech)
Participant
Loudspeakers presenting the target speech signal Loudspeakers presenting the background speech signal Loudspeakers presenting white noise
B
Target VCV speech
Superimposed image on semi-silvered mirror
Projected
Continuous background speech
Fig. 1 a Schematic bird’s-eye view of the experimental set-up used in this study. b Still image of the speech video clip superimposed over the VCV video clip
10; Microsoft Corporation). The video was processed using the Adobe Premiere 6.0 software package and consisted of a close-up view of the mouth of a British male (the area visible was from the upper part of the nose to approximately 3 cm below the chin). The speaker read from a list of 1,000 words (at approximately 60 words/min) of which 100 were target words. Each block contained approximately 20–25 target words, consisting of male Wrst names (e.g., Charles, Peter, Nathan...) and were presented in the list in a pseudorandomized order. The TOJ stimuli consisted of a set of brief audiovisual VCVs video clips. The visual component of these clips was presented from a 15-inch (38.1 cm) CRT monitor (60-Hz refresh rate) placed approximately 10 cm behind the semisilvered mirror. The auditory component of the VCV video clips was presented by means of two loudspeaker cones (Dan Inc., 11.1 cm in diameter) one placed 27.6 cm to either side of the centre of the monitor. The black-andwhite VCV clips were presented on a black background, using the Presentation programming software (Version 9.90, Neurobehavioral Systems Inc., CA). The video clips (680 £ 700-pixel, Cinepak Codec video compression, 16bit Audio Sample Size, 24-bit Video Sample Size,
123
30 frames/s) were also processed using Adobe Premiere 6.0 and consisted of a female British-English speaker (frontal view, including head and shoulders), looking directly at the camera, and uttering the VCVs: /aba/ and /aga/. The continuous speech video and the VCV clips were chosen to be very diVerent in composition (e.g., male vs. female face/ voice, male lip-area focused vs. female whole-face view) to ensure that participants could easily distinguish between the two clips (thus making this challenging task somewhat easier). At the beginning and end of all of the video clips (i.e., both the continuous speech and VCV video clips), background acoustic noise and a still image (extracted from the Wrst and last 33.33 ms of the video clips) were presented for a variable duration so that their diVerence to be equivalent to the SOAs used (values reported below). This was implemented in order to avoid cuing the participants as to the nature of the auditory delay with which they were being presented. In order to achieve a smooth transition at the start and end of each video clip, a 33.33 ms cross-fade was added between the still image and the video. White noise was also presented continuously at 74 dB (A; as measured from the participants’ head position) throughout the experimental blocks from two loudspeakers (Audax, VE100AO; 8.9 cm in diameter), one positioned above and the other below the monitor (approximately 27 cm from the centre of the monitor). The white noise was presented in order to reduce the intelligibility of the auditory speech signals and, therefore, to enhance the participants’ reliance on the visual speech signal. The participants responded using a standard computer mouse, indicating with their right thumb ‘visual-speech Wrst’ responses and their left thumb ‘speech-sound Wrst’ responses (or vice versa, the response buttons were counterbalanced across participants). Design Nine possible SOAs between the speech-sound and visualspeech VCV signals were used: §300, §200, §133, §66, and 0 ms. Negative SOAs indicate that the speech-sound was presented Wrst, whereas positive values indicate that the visual-speech was presented Wrst. The various SOAs were presented randomly within each block of trials using the method of constant stimuli (Spence et al. 2001). The continuous background speech stream was either presented in synchrony, or else desynchronized by 300 ms with the speech-sound lagging behind the visual speech stream. The 300 ms auditory delay was chosen on the basis of previous research showing that participants are clearly sensitive to delays above 250 ms between the signals (thus, the 300 ms lag should fall outside the temporal window of integration for speech perception; e.g., Dixon and Spitz 1980; Munhall
Exp Brain Res (2007) 181:173–181
et al. 1996; Navarra et al. 2005).1 The synchronous and asynchronous continuous speech conditions were alternated on a block-by-block basis, with the starting block (i.e., synchronous vs. asynchronous) counterbalanced across participants. In order to familiarise the participants with the experimental set-up, they initially completed two 18-trial practice blocks. These consisted of practice on the TOJ task and the male-name monitoring (12 of the participants completed both the name-monitoring and TOJ task, while the remaining 12 only completed the TOJ task). The practice blocks were followed by 8 blocks of 72 experimental trials. Procedure The participants were informed that they would be shown two audiovisual speech videos (one composed of VCVs and the other composed of a continuous stream of words) with white noise presented simultaneously in the background. They were instructed to Wxate the semi-silvered mirror at all times and informed that they would have to complete a TOJ task on the VCV video clips and (for half of the participants) to monitor the continuous speech video for male-names. The 12 participants who completed the name-monitoring task performed at well over 85% correct, indicating that they were indeed able to monitor the continuous speech video as instructed. The participants were informed that on each trial they would have to decide whether the speech-sound or the visual-speech signal had been presented Wrst and that at the end of the block they would have to report the number of male names that they had detected (if completing the dual task; no feedback was given to the participants as to the correct number of names contained in the block). The participants were also informed that the task was not speeded and that they should respond only when conWdent of their response. 1 In order to ensure that the participants did indeed experience the 300 ms asynchronous background speech as auditory-lagging (rather than as just audiovisually mismatching), we conducted a control study (N = 9 participants, of which six were female; mean age 23 years) in which the participants were presented with the background continuous speech video under various visual-auditory delays. SpeciWcally, we presented 1 min video clip portions extracted from the original video recording and introduced seven diVerent SOAs between the speechsound and visual-speech signals: §300, §200, §100, and 0 ms following the task (i.e., TOJ) and procedure of the original experiment. The results showed that participants were able to correctly identify visual lags greater than 200 ms and auditory lags above 100 ms [F(6, 48) = 19.19, P < 0.01]. In particular, participants were 97% accuracy in detecting a visual lag of 300 ms compared to the synchronous condition (48% correct) and an auditory lag of 100 ms. A visual lag of 200 ms was also detected more easily (90% correct) compared to the synchronous condition. Finally, and most importantly, participants were very accurate in detecting an auditory lag of 100 ms (70% correct), 200 ms (92%), and 300 ms (93%) as compared to the synchronous condition.
177
Results The proportions of ‘visual-speech Wrst’ responses were converted to their equivalent z-scores under the assumption of a cumulative normal distribution (Finney 1964). The data from all 9 SOAs tested were used to calculate the bestWtting straight lines for each participant and condition (single task: synchronous background exposure: /aba/, r2 = 0.91, P < 0.01; /aga/, r2 = 0.92, P < 0.01; asynchronous background exposure: /aba/, r2 = 0.91, P < 0.01; /aga/, r2 = 0.93, P < 0.01; dual task: synchronous background exposure: /aba/, r2 = 0.91, P < 0.01; /aga/, r2 = 0.87, P < 0.01; asynchronous background exposure: /aba/, r2 = 0.88, P < 0.01; /aga/, r2 = 0.93, P < 0.01; the r2 values reXect the correlation between SOA and the proportion of ‘visual speech-Wrst’ responses, and hence provide an estimate of the goodness of the data Wts; see Fig. 2). The slope and intercept of the Wtted lines were used to calculate the just noticeable diVerence (JND = 0.675/slope; since §0.675 represents the 75 and 25% point on the cumulative normal distribution) and the PSS (¡intercept/slope) values (see Coren et al. 2004, for further details). The JND provides a standardized measure of the participants’ sensitivity to the temporal order of the speech-sound and visualspeech signals. For all of the analyses reported here, the data derived from the /aba/ and /aga/ video clips were collapsed (since no diVerences were found between the two VCVs tested; main eVect of VCV: [F(1, 22) = 2.73, P = 0.13]; background exposure and VCV interaction: [F < 1, n.s.]; background exposure, VCV, and task interaction: [F(1, 22) = 2.69, P = 0.12]) and Bonferroni-corrected t-tests (where P < 0.05 prior to correction) were used for all post-hoc comparisons. A mixed between-participants analysis of variance (ANOVA) was performed on the JND data with the between-participants factor of task (single or dual task) and the within-participants factor of background exposure (synchronous vs. asynchronous continuous speech). This analysis revealed a signiWcant main eVect of task [F(1, 22) = 7.29, P < 0.01], with participants performing the TOJ task signiWcantly more accurately in the single task condition (M = 73 ms) than when performing the name-counting task at the same time (M = 119 ms; see Fig. 3a). This result was expected given the results of Navarra et al.’s (2005) recent study. In fact, the JNDs observed in the TOJ-only condition were very similar to those obtained recently in a number of other studies (Vatakis and Spence 2006b, c, 2007), where brief audiovisual speech videos were presented in the absence of any other background speech stimulus. These results might therefore suggest that participants devoted less of their attentional resources (Lavie 2005) to the processing of the continuous speech stream in the single task (TOJ) condition (cf. Navarra et al. 2005). Most importantly,
123
178
Exp Brain Res (2007) 181:173–181
p('Vision first' responses)
A
100
80
60
Synchronous Asynchronous
40
20
0 -300 Sound first
B
Background spe e ch
-200
-100
0
100
200
SO A (ms)
300 Vision first
100
Discussion
p('Vision first' responses)
80
Background spe e ch
60
Synchronous 40
Asynchronous
20
0 -300 Sound first
-200
-100
0
SO A (ms)
100
200
300 Vision first
Fig. 2 Mean percentage of ‘vision-Wrst’ responses plotted as a function of the stimulus onset asynchrony (SOA) for the synchronous and asynchronous background speech conditions tested in the (a) single and (b) dual task conditions
however, there was no main eVect of background exposure [F(1, 22) < 1, n.s.], nor any interaction between task and background exposure [F(1, 22) = 1.91, P = 0.20]. These results therefore suggest that monitoring an asynchronous continuous audiovisual speech video did not aVect the accuracy of participants’ TOJ performance for the VCV video clips. A similar analysis of the PSS data (see Fig. 3b) revealed no main eVect of task or background exposure (for both F < 1). However, there was a signiWcant interaction between task and background exposure [F(1, 22) = 4.80, P < .05]: In particular, a larger auditory lead was required for the PSS to be reached under the dual-task conditions when the participants were exposed to the synchronous continuous background speech (M = ¡24 ms) as compared to the asynchronous background speech (M = ¡11 ms)
123
[P < 0.05]. By contrast, there was no signiWcant diVerence in the PSS between the synchronous and asynchronous background exposure conditions when the participants did not actively monitor the speech stream for targets and performed the TOJ task in isolation (M = ¡1 and ¡8 ms, respectively, P = 0.31). It is important to note that the large individual diVerences in the PSS (see the SE bars in Fig. 3b) is a trend that has also been observed in a number of previous TOJ studies (e.g., Stone et al. 2001). Such diVerences are not surprising given previous reports regarding the variability of the PSS, both between individuals (e.g., Stone et al. 2001) and between diVerent studies (e.g., see Spence et al. 2001, for a review). The factors that contribute to this variability in the PSS, both between individuals and studies, remain an important issue for future research (cf. Spence and Squire 2003).
The results of the experiment reported here provide the Wrst empirical evidence that the continuous monitoring of an asynchronous audiovisual speech stream can lead to a signiWcant modulation of the temporal perception of concomitantly presented speech stimuli. SpeciWcally, the PSS derived from the participants’ TOJ responses to the brieXy presented VCV video clips changed signiWcantly in the asynchronous (background) continuous audiovisual speech stream (as compared to the PSS reported in the synchronous condition). This revealed itself as a shift in the PSS for the VCV speech video clips toward a decreased auditory lead. A similar modulation of the PSS in the direction of the adapting asynchronous stimulus has also been demonstrated recently in studies that have used simple pairs of audiovisual stimuli to test temporal perception (Fujisaki et al. 2004; Vroomen et al. 2004; Note that Fujisaki et al. also demonstrated this utilizing the ‘bounce illusion’; see Sekuler et al. 1997). Although speech is considered by a number of researchers to represent a special type of stimulus (e.g., Bernstein et al. 2004; Massaro 2004; Munhall and Vatikiotis-Bateson 2004; Tuomainen et al. 2005), our results, together with those of several previous studies (Fujisaki et al. 2004; Navarra et al. 2005; Vroomen et al. 2004) support the view that the crossmodal temporal adaptation that takes place when people monitor desynchronized audiovisual stimuli can occur for the perception of both simple and complex events (including speech). Contrary to our initial expectations, however, this temporal adaptation eVect appeared to be no larger for the case of speech than for simple stimuli (as tested previously by Navarra et al. 2005). This occurred despite the similarity between the adaptor and target stimuli in the case of speech, and the fact that the accuracy of temporal discrimination
Exp Brain Res (2007) 181:173–181
A
Poor performance
150
Background spe e ch
100 JND (ms)
Fig. 3 Mean JND (a) and PSS (b) values for the synchronous and asynchronous background speech conditions tested as a function of the task (TOJ and name-counting vs. TOJ only). The error bars represent the standard errors of the means
179
Synchronous Asynchronous
50
0 Name -counting & TOJ
Good performance
TOJ only
B Name -counting & TOJ
Background spe e ch Synchronous Asynchronous
TOJ only
-100 Sound first
performance for speech stimuli is usually worse than for simple stimuli (e.g., Dixon and Spitz 1980; Vatakis and Spence 2006b, c, 2007). Note, however, that recent studies using more discrete (as opposed to continuous) speech stimuli (as used in the TOJ task reported here; Vatakis and Spence 2006b, c, 2007) have revealed temporal discrimination performance (i.e., JNDs of 70–120 ms) that is actually not that much worse than that found for simple tone and light stimuli (e.g., Zampini et al. 2003, 2005). One interesting question for future research will therefore be to determine how variations in the magnitude of the asynchrony of the adapting stimulus may aVect performance (since only asynchronies in the range of 100–300 ms have been used successfully in previous studies; cf. Navarra et al. 2005, on this point). The adaptation eVect on the perception of the VCVs reported in the present study (a 13 ms shift in the PSS in the direction of the adapting stimulus) was demonstrated during the dual-task (male name monitoring and TOJ) condition. However, in Navarra et al.’s (2005) recent study, the modulation of the temporal perception of simple audiovisual stimuli resulting from the continuous stream monitoring resulted in a shift in the JND (i.e., higher JNDs being reported in the asynchronous than in the synchronous
-50
0 PSS (ms)
50
100 Vision first
condition), but no change in the PSS was observed. It is, at present, unclear why such diVerences in the pattern of results should have been observed between these two methodologically similar experiments. One possibility relates to the fact that speech represents a special kind of stimulus (e.g., Bernstein et al. 2004; Tuomainen et al. 2005) that may be governed by diVerent processing mechanisms compared to the processing of simple stimulus presentations. However, another factor that may help to account for the diVerences observed in the two studies might be the fact that diVerent types of stimuli were utilized (i.e., adapting and test stimuli being similar vs. diVerent): In the present study, the participants were asked to attend to targets consisting of speech stimuli while simultaneously adapting to concomitantly presented speech stream (similarly, in the studies of Fujisaki et al. 2004; Vroomen et al. 2004, the test and adapting stimuli were also very similar in kind and resulted in a PSS shift in the same direction as that observed in the present study). Rather, in Navarra et al.’s study, the stimulus requiring a TOJ response was composed of a pair of simple audiovisual stimuli while the adapting event was a speech stream. In the present study, we tested two groups of participants who completed a single- (TOJ only) or a dual-task (TOJ
123
180
and name monitoring), respectively. In the single-task condition, the participants were able to allocate their attention completely to the target VCVs, treating the background speech as an irrelevant stimulus in order to devote more processing resources, and thus improve performance, to the TOJ task (e.g., Lavie 2005; Lavie and Tsal 1994). The dual task condition was therefore used in order to ensure that the participants shared their attentional resources between both the TOJ task and the monitoring of the continuous background speech stream (cf. Navarra et al. 2005, for a similar approach). Attentional eVects have been shown to aVect temporal perception as demonstrated by research on the prior entry eVect (whereby events in the attended modality seem to occur earlier in time than the events in the unattended modality; Spence et al. 2001; Vibell et al. 2007). In the present study, however, the participants were asked to attend to both auditory and visual signals, thus diVerences due to attending to one sensory modality rather than the other should not have been present. Additionally, if for some reason the participants were biased to attend to one modality more than the other that should not have aVected the temporal adaptation eVect observed, given that attending to a particular sensory modality has been shown to leave the temporal adaptation eVect unchanged (e.g., Fujisaki et al. 2004). During the last few years neuroimaging and electrophysiological studies using both simple audiovisual stimuli and more complex speech stimuli have started to investigate the neural circuitry underpinning temporal perception (e.g., Bergmann et al. 2006; McDonald et al. 2005; Noesselt et al. 2005; Vibell et al. 2007). The Wndings from these studies demonstrate that temporal processing occurs at early levels of cortical processing that are mediated via subcortical pathways (such as the insular, regions of the prefrontal and parietal cortex, ventral occipital cortex, and posterior thalamus; e.g., Bushara et al. 2001; Calvert et al. 1997; Macaluso et al. 2004). It will be interesting in future research to determine whether these particular pathways are also involved during the temporal adaptation of a signal (simple or complex) to asynchronous stimuli. Additionally, comparisons of synchronous versus asynchronous speech signals has shown greater levels of activation in the superior temporal sulcus (e.g., Macaluso et al. 2004; Calvert 2001, Calvert et al. 1997), thus it will also be interesting to see if a similar activation is produced in the case of the temporal adaptation of an event. At this point it seems that future neuroimaging studies on the topic are necessary in order to better understand the neural mechanisms underlying the temporal perception and adaptation of audiovisual events. Acknowledgments We would like to thank R. Campbell and G.A. Calvert for providing the VCV stimuli used in this study. A.V. was supported by a Newton Abraham Studentship from the Medical
123
Exp Brain Res (2007) 181:173–181 Sciences Division, University of Oxford. J.N. was supported by a Beatriu de Pinós scholarship from Generalitat de Catalunya. J.N., S.S.F., and C.S. were supported by a grant from the McDowell-Pew Foundation, Oxford. Correspondence regarding this article should be addressed to A.V. at the Department of Experimental Psychology, University of Oxford, South Parks Road, Oxford, OX1 3UD, UK. E-mail:
[email protected] or
[email protected].
References Arnold DH, Johnston A, Nishida S (2005) Timing sight and sound. Vision Res 45:1275–1284 Bergmann D, Spence C, Noesselt T (2006) Neural correlates of synchrony perception using audiovisual speech stimuli. Poster presented at the seventh annual meeting of the international multisensory research forum (IMRF), Dublin, Ireland Bernstein LE, Auer ET, Moore JK (2004) Audiovisual speech binding: convergence or association? In: Calvert GA, Spence C, Stein BE (eds) The handbook of multisensory processing. MIT, Cambridge, pp 203–223 Bertelson P, de Gelder B (2004) The psychology of multimodal perception. In: Spence C, Driver J (eds) Crossmodal space and crossmodal attention. Oxford University Press, Oxford, pp 141–177 Bushara KO, Grafman J, Hallett M (2001) Neural correlates of auditory-visual stimulus onset asynchrony detection. J Neurosci 21:300–304 Calvert GA (2001) Crossmodal processing in the human brain: insights from functional neuroimaging studies. Cerebral Cortex 11:1110– 1123 Calvert GA, Bullmore ET, Campbell MJBR, Williams SCR, McGuire PK, WoodruV PWR, Iversen SD, David AS (1997) Activation of auditory cortex during silent lipreading. Science 276:593–596 Calvert GA, Spence C, Stein BE (eds) (2004) The handbook of multisensory processes. MIT, Cambridge Coren S, Ward LM, Enns JT (2004) Sensation and perception, 6th edn. Harcourt Brace, Fort Worth de Gelder B, Bertelson P (2003) Multisensory integration, perception and ecological validity. Trends Cogn Sci 7:460–467 Dixon NF, Spitz L (1980) The detection of auditory visual desynchrony. Perception 9:719–721 Engel GR, Dougherty WG (1971) Visual-auditory distance constancy. Nature 234:308 Ernst MO, Banks MS (2002) Humans integrate visual and haptic information in a statistically optimal fashion. Nature 415:429–433 Fendrich R, Corballis PM (2001) The temporal cross-capture of audition and vision. Percept Psychophys 63:719–725 Finney DJ (1964) Probit analysis: statistical treatment of the sigmoid response curve. Cambridge University Press, London Fujisaki W, Shimojo S, Kashino M, Nishida S (2004) Recalibration of audiovisual simultaneity. Nat Neurosci 7:773–778 Grant KW, van Wassenhove V, Poeppel D (2004) Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony. Speech Commun 44:43–53 Hamilton RH, Shenton JT, Branch-Coslett H (2006) An acquired deWcit of audiovisual speech processing. Brain Lang 98:66–73 Hirsh IJ, Sherrick Jr CE (1961) Perceived order in diVerent sense modalities. J Exp Psychol 62:423–432 ITU-R BT 1359-1 (1998) Relative timing of sound and vision for broadcasting (Question ITU-R 35/11) Jack CE, Thurlow WR (1973) EVects of degree of visual association and angle of displacement on the “ventriloquism” eVect. Percept Mot Skills 37:967–979 Jackson CV (1953) Visual factors in auditory localization. Q J Exp Psychol 5:52–65
Exp Brain Res (2007) 181:173–181 King AJ (2005) Multisensory integration: strategies for synchronization. Curr Biol 15:R339–R341 King AJ, Palmer AR (1985) Integration of visual and auditory information in bimodal neurones in the guinea-pig superior colliculus. Exp Brain Res 60:492–500 Kopinska A, Harris LR (2004) Simultaneity constancy. Perception 33:1049–1060 Lavie N (2005) Distracted and confused? Selective attention under load. Trends Cogn Sci 9:75–82 Lavie N, Tsal Y (1994) Perceptual load as a major determinant of the locus of selection in visual attention. Percept Psychophys 56:183–197 Lewald J, Guski R (2004) Auditory-visual temporal integration as a function of distance: no compensation of sound-transmission time in human perception. Neurosci Lett 357:119–122 Macaluso E, George N, Dolan R, Spence C, Driver J (2004). Spatial and temporal factors during processing of audiovisual speech perception: a PET study. Neuroimage 21:725–732 Massaro DW (2004). From multisensory integration to talking heads and language learning. In: Calvert GA, Spence C, Stein BE (eds) The handbook of multisensory processing. MIT, Cambridge, pp 153–176 McDonald JJ, Teder-Sälejärvi WA, Di Russo F, Hillyard SA (2005) Neural basis of auditory-induced shifts in visual time-order perception. Nat Neurosci 8:1197–1202 Miller LM, D’Esposito M (2005) Perceptual fusion and stimulus coincidence in the crossmodal integration of speech. J Neurosci 25:5884–5893 Morein-Zamir S, Soto-Faraco S, Kingstone A (2003) Auditory capture of vision: examining temporal ventriloquism. Cogn Brain Res 17:154–163 Moutoussis K, Zeki S (1997) A direct demonstration of perceptual asynchrony in vision. Proc R Soc Lond B 264:393–399 Munhall K, Vatikiotis-Bateson E (2004) Spatial and temporal constraints on audiovisual speech perception. In: Calvert GA, Spence C, Stein BE (eds) The handbook of multisensory processing. MIT, Cambridge, pp 177–188 Munhall KG, Gribble P, Sacco L, Ward M (1996) Temporal constraints on the McGurk eVect. Percept Psychophys 58:351–362 Navarra J, Vatakis A, Zampini M, Humphreys W, Soto-Faraco S, Spence C (2005) Exposure to asynchronous audiovisual speech extends the temporal window for audiovisual integration. Cogn Brain Res 25:499–507 Nishida S, Johnston A (2002) Marker correspondence, not processing latency, determines temporal binding of visual attributes. Curr Biol 12:359–368 Noesselt T, Fendrich R, Bonath B, Tyll S, Heinze HJ (2005) Closer in time when farther in space: Spatial factors in audiovisual temporal integration. Cogn Brain Res 25:443–458 Rihs S (1995) The inXuence of audio on perceived picture quality and subjective audio-visual delay tolerance. In: Hamberg R, de Ridder
181 H (eds) Proceedings of the MOSAIC workshop: advanced methods for the evaluation of television picture quality, 18 and 19 September 1995, Eindhoven, pp 133–137 Scheier C, Nijhawan R, Shimojo S (1999) Sound alters visual temporal resolution. Invest Ophthalmol Vis Sci 40:4169 Sekuler R, Sekuler AB, Lau R (1997) Sound alters visual motion perception. Nature 385:308 Soto-Faraco S, Alsius A (2007) Access to the uni-sensory components in a cross-modal illusion. Neuroreport (in press) Spence C, Squire SB (2003) Multisensory integration: maintaining the perception of synchrony. Curr Biol 13:R519–R521 Spence C, Shore DI, Klein RM (2001) Multisensory prior entry. J Exp Psychol Gen 130:799–832 Stone JV, Hunkin NM, Porrill J, Wood R, Keeler V, Beanland M, Port M, Porter NR (2001) When is now? Perception of simultaneity. Proc R Soc Lond B Biol Sci 268:31–38 Sugita Y, Suzuki Y (2003) Implicit estimation of sound-arrival time. Nature 421:911 Tuomainen J, Andersen TS, Tiippana K, Sams M (2005) Audio-visual speech perception is special. Cognition 96:B13–B22 Vatakis A, Spence C (2006a) Evaluating the inXuence of frame rate on the temporal aspects of audiovisual speech perception. Neurosci Lett 405:132–136 Vatakis A, Spence C (2006b) Audiovisual synchrony perception for speech and music using a temporal order judgment task. Neurosci Lett 393:40–44 Vatakis A, Spence C (2006c) Audiovisual synchrony perception for music, speech, and object actions. Brain Res 1111:134–142 Vatakis A, Spence C (2007) Crossmodal binding: evaluating the ‘unity assumption’ using audiovisual speech stimuli. Percept Psychophys (in press) Vibell J, Klinge C, Zampini M, Spence C, Nobre AC (2007) Temporal order is coded temporally in the brain: early ERP latency shifts underlying prior entry in a crossmodal temporal order judgment task. J Cogn Neurosci 19:109–120 Vroomen J, de Gelder B (2004) Temporal ventiloquism: sound modulates the Xash-lag eVect. J Exp Psychol Hum Percept Perform 30:513–518 Vroomen J, Keetels M (2006) The spatial constraint in intersensory pairing: no role in temporal ventriloquism. J Exp Psychol Hum Percept Perform 32:1063–1071 Vroomen J, Keetels M, de Gelder B, Bertelson P (2004) Recalibration of temporal order perception by exposure to audio-visual asynchrony. Cogn Brain Res 22:32–35 Zampini M, Shore DI, Spence C (2003) Audiovisual temporal order judgments. Exp Brain Res 152:198–210 Zampini M, Guest S, Shore DI, Spence C (2005) Audio-visual simultaneity judgments. Percept Psychophys 67:531–544
123