Exp Brain Res (2005) 167: 66–75 DOI 10.1007/s00221-005-0008-z
R ES E AR C H A RT I C L E
Maurizio Gentilucci Æ Luigi Cattaneo
Automatic audiovisual integration in speech perception
Received: 13 September 2004 / Accepted: 30 March 2005 / Published online: 21 July 2005 Springer-Verlag 2005
Abstract Two experiments aimed to determine whether features of both the visual and acoustical inputs are always merged into the perceived representation of speech and whether this audiovisual integration is based on either cross-modal binding functions or on imitation. In a McGurk paradigm, observers were required to repeat aloud a string of phonemes uttered by an actor (acoustical presentation of phonemic string) whose mouth, in contrast, mimicked pronunciation of a different string (visual presentation). In a control experiment participants read the same printed strings of letters. This condition aimed to analyze the pattern of voice and the lip kinematics controlling for imitation. In the control experiment and in the congruent audiovisual presentation, i.e. when the articulation mouth gestures were congruent with the emission of the string of phones, the voice spectrum and the lip kinematics varied according to the pronounced strings of phonemes. In the McGurk paradigm the participants were unaware of the incongruence between visual and acoustical stimuli. The acoustical analysis of the participants’ spoken responses showed three distinct patterns: the fusion of the two stimuli (the McGurk effect), repetition of the acoustically presented string of phonemes, and, less frequently, of the string of phonemes corresponding to the mouth gestures mimicked by the actor. However, the analysis of the latter two responses showed that the formant 2 of the participants’ voice spectra always differed from the value recorded in the congruent audiovisual presentation. It approached the value of the formant 2 of the string of phonemes presented in the other modality, which was apparently ignored. The lip kinematics of the participants repeating the string of phonemes acoustically presented were influenced by the observation of the lip movements mimicked by the actor, but only when pronouncing a M. Gentilucci (&) Æ L. Cattaneo Dipartimento di Neuroscienze, Universita´ di Parma, Via Volturno 39, 43100 Parma, Italy E-mail:
[email protected] Tel.: +39-0521-903899 Fax: +39-0521-903900
labial consonant. The data are discussed in favor of the hypothesis that features of both the visual and acoustical inputs always contribute to the representation of a string of phonemes and that cross-modal integration occurs by extracting mouth articulation features peculiar for the pronunciation of that string of phonemes. Keywords McGurk effect Æ Audiovisual integration Æ Voice spectrum analysis Æ Lip kinematics Æ Imitation
Introduction Most of the linguistic interactions occur within a face-toface context, in which both acoustic (speech) and visual information (mouth movements) are involved in message comprehension. Although humans are able to understand words without any visual input, audiovisual perception is shown to improve language comprehension (Sumby and Pollack 1954), even when the acoustic information is perfectly clear (Reisberg et al. 1987). In support of this behavioral observation, brain-imaging studies have shown that, when the speaker is also seen by an interlocutor, the activation of the acoustical A1/ A2 and visual V5/MT cortical areas is greater than when the information is presented only in either acoustical or visual modality (Calvert et al. 2000). In addition, speech-reading activates acoustical areas also in absence of any acoustical input (Calvert et al. 1997). Two hypotheses, though not mutually exclusive, can explain the integration between information on verbal messages provided by the two sensory (acoustical and visual) modalities. The first hypothesis is based on specific cross-modal binding functions, and it postulates supra-modal integration (Calvert et al. 1999, 2000; Calvert and Campbell 2003). This integration could be based on similar patterns of time-varying features common to both the acoustical and the visual input. More specifically, the timing of changes in vocalization is visible as well as audible in terms of their time-varying
67
patterns (Munhall and Vatikiotis-Bateson 1998). For example, variations in speech sound amplitude can be accompanied by visible indicators of changes in the mouth articulator’s movement pattern. Another crossmodal function is based on features of stilled (configurational) besides moving face images (Calvert and Campbell 2003). Anatomically, cortical regions along the superior temporal sulcus (STS) may be involved in specific cross-modal functions. STS is activated by observation of biological motion including mouth movements during speech (Bonda et al. 1996; Buccino et al. 2004; Calvert et al. 2000; Campbell et al. 2001) and shows consistent and extensive activation also when hearing speech (Calvert et al. 1999, 2000). Calvert et al. (2000) observed that for audiovisual appropriately synchronized speech, the profile of STS activation correlated with enhanced neuronal activity in sensory-specific visual (V5/MT) and auditory (A1/A2) cortices. This cross-modal gain may be mediated by back projections from STS to sensory cortices (Calvert et al. 1999). The second hypothesis is based on the possibility that presentation of either a human voice pronouncing a string of phones or a face mimicking pronunciation of a string of phonemes activates automatic imitation of the two stimuli. It is possible that the information provided by the two different modalities is integrated by superimposing an imitation mouth program automatically elicited by the visual stimulus on another one automatically elicited by the acoustical stimulus, in accordance with the motor theory of speech perception (Liberman and Mattingly 1985). In this respect, cortical regions within Broca’s area may be involved in audiovisual integration by imitation since it is activated by observation/imitation of moving and speaking faces (Buccino et al. 2004; Calvert and Campbell 2003; Campbell et al. 2001; Carr et al. 2003; Leslie et al. 2004; for a review see Bookheimer 2002). The activity of Broca’s area is significantly correlated with the increased excitability of the motor system underlying speech production when perceiving auditory speech (Waltkins and Paus 2004). This area is involved also in observation/imitation of hand movements (Iacoboni et al. 1999; Buccino et al. 2001, 2004; Heiser et al. 2003) according to the hypothesis that this area represents one of the putative sites of the human ‘‘mirror system’’, which is thought to be evolved from the monkey premotor cortex, and to have acquired new cognitive functions such as speech processing (Rizzolatti and Arbib 1998). The McGurk effect (McGurk and MacDonald 1976) represents a particular kind of audiovisual integration in which the acoustical information on a string of phonemes is contrasting with the visually presented mouth articulation gesture. When people process two different syllables, one presented in the visual modality and the other presented in acoustical modality, they tend either to fuse or to combine the two elements. For example, when the voice of the talker pronounces the syllable/ba/and her/his lips mimic the syllable/ga/, the observer tends to fuse the two syllables and to perceive the syllable/da/.
Conversely, when the talker’s voice pronounces/ga/and her/his lips mimic/ba/, the observer tends to combine the two elements and to perceive either/bga/or/gba/. The finding that combination rather than fusion between the two strings of phonemes occurs when the visual information on the syllable is unambiguous (/ba/ versus/ga/) suggests that merging the visual information with the acoustical information such as observed in the fusion effect occurs only in particular circumstances, i.e. when the visual stimulus offers multiple interpretations on the string of phonemes (note that external mouth pattern of/ga/is not much different from that of/da/). The trend to fuse auditory and visual speech together seems to have some characteristics of specificity for the used language. Indeed, although it has been well documented for English speakers (for a review Chen and Massaro 2004; Summerfield 1992; Massaro 1998), some Asian people such as Japanese and Chinese, are less subjected to the McGurk effect (Chen and Massaro 2004; Sekiyama and Tohkura 1993). These data pose the following problem: does the process of audiovisual matching code representations lacking features of either the visual or the acoustical stimulus or, in contrast, does it code representations always containing features of both the two information? In the present study we tested the two hypotheses by taking into account in the McGurk’s paradigm the responses in which the participants repeated either the visually or the acoustically presented string of phonemes. By using techniques of kinematics analysis and voice spectra analysis, we verified whether the two presentations always influenced the responses. In particular, we verified whether the voice spectra of the repeated string of phonemes changed as compared to the voice spectra of the same string of phonemes repeated in the condition of congruent visual and acoustical stimuli. Moreover, we verified whether they approached the voice spectra of the string of phonemes presented in the other sensory modality. A second problem is whether audiovisual integration is based either on superimposition of two automatic imitation motor programs or on cross-modal elaboration. The imitation hypothesis postulates that speech perception occurs by automatically integrating the mouth articulation pattern elicited by the acoustical with that elicited by the visual stimulus (Liberman and Mattingly 1985). The cross-modal hypothesis postulates that perception occurs by supra-modal integration of time-varying characteristics of speech extracted from both the visual and the acoustical stimulus (Calvert et al. 1999, 2000; Calvert and Campbell 2003). To test the two hypotheses we analyzed the responses in which the acoustically presented string of phonemes was repeated and verified whether its external mouth pattern was influenced by the visual stimulus, i.e. by the external mouth pattern mimicked by the actor. If two automatic imitation motor programs are superimposed, an effect of the visual stimulus on the observer’s external mouth pattern is always seen. This should occur also when the string of phonemes mimicked by the actor requires pe-
68
culiar modification of the internal mouth and external mouth movements are consequent and indirectly related to pronunciation of the string of phonemes (in the present study/aga/). On the other hand, if time-varying features specific for the string of phonemes are extracted from the visual stimulus (cross-modal integration hypothesis) we should observe an effect of the only visually presented string of phonemes with labial consonants, i.e. with external mouth modification peculiar to pronunciation of that string of phonemes (in the present study/aba/).
Methods Sixty-five right-handed (according to Edinburgh inventory, Oldfield 1971) Italian-speakers (51 females and 14 males, age 22–27 years.) participated in the present study. The study, to which the participants gave written informed consent, was approved by the Ethics Committee of the Medical Faculty of the University of Parma. All participants were naı¨ ve as to the McGurk paradigm and, consequently, to the purpose of the study. They were divided in three groups of eight, 31 and 26 individuals. Each group took part in one of three experiments (see below). Participants sat in front of a table, placing either their forearms on the table plane, in a soundproof room. They were required not to move their head and trunk throughout the experimental session. A PC screen placed on the table plane was 40 cm distant from the participant’s chest. Two loudspeakers were at the two sides of the display. The stimuli presented on the PC screen were the following three strings of letters or phonemes: ABA (/ aba/), ADA (/ada/) and AGA (/aga/). Note that in Italian the vowel A is always pronounced/a/. In experiment 1 (string of letters reading) they were printed in white on the centre of the black PC display. Each letter was 3.9 cm high and 2.5 cm wide. It was presented 1,360 ms from the beginning of the trial and lasted 1,040 ms. In experiments 2 and 3 (audiovisual presentation of string of phonemes) an actor (face: 6.9·10.4 cm) pronounced the three strings of phonemes. His half-body was presented 2,360 ms after the beginning of the trial and presentation lasted 2,000 ms. In all the experiments a ready signal, i.e. a red circle and a BEEP (duration of 360 ms) were presented at the beginning of the trial. The following three experiments were carried out: Experiment 1. Eight subjects participated in the experiment. The participants were presented with the printed strings of letters. The task was to read silently, and, then, to repeat aloud the string of letters (string-ofletters reading paradigm). Experiment 2. Thirty-one subjects participated in the experiment. The actor pronounced one of the three strings of phonemes. In the congruent audiovisual presentation, his visible mouth (visual stimulus) mimicked and his voice (acoustic stimulus) pronounced the same string of phonemes. In the incongruent audiovisual
presentation, the visible actor’s mouth mimicked pronunciation of AGA, whereas his voice concurrently pronounced ABA (McGurk paradigm). Experiment 3. Twenty-six subjects participated in the experiment. The experiment differed from experiment 2 only for the incongruent audiovisual presentation in which the visible actor’s mouth mimicked pronunciation of ABA, whereas his voice simultaneously pronounced AGA (inverse McGurk paradigm). In all the experiments the participants were required to repeat aloud, at the end of the audio and/or visual stimulus presentation, the perceived string, using a neutral intonation and a voice volume like during normal conversation. They were not informed that in some trials the visual and acoustical stimuli were incongruent. No constraint in response time was given. At the end of the experimental session, all participants filled in a questionnaire in which they indicated (1) whether during the experimental session the sound of each string of phonemes (i.e. ABA, ADA, and AGA) varied and (2) whether they noticed that in some trials there was incongruence between the acoustical and the visual stimulus. Each string of letters or phonemes was randomly presented 5 times. Consequently, experiment 1 consisted of 15 trials. On the other hand, since experiments 2 and 3 included both congruent and incongruent conditions, they consisted of 20 trials each. Participants’ lip movements were recorded using the 3D-optoelectronic ELITE system (B.T.S. Milan, Italy). It consists of two TV-cameras detecting infrared reflecting markers at the sampling rate of 50 Hz. Movement re-construction in 3D coordinates and computation of the kinematics parameters are described in a previous study (Gentilucci et al. 1992). Two markers were placed on the centre of the participant’s upper and lower lip. The participant’s two aperture-closure movements of lips during the string of phonemes pronunciation were measured by analyzing the time course of the distance between the upper and lower lip. Participant’s maximal lip aperture, and final lip closure (i.e. minimal distance between upper and lower lip) at the end of the first lip closing, and peak velocity of lip opening and maximal lip aperture during the second lip opening were measured. These parameters characterize the kinematics of the lips during consonant pronunciation. The procedures to calculate the beginning and the end of lip movements were identical to those previously described (Gentilucci et al. 2004). The time course of the actor’s lip movements was recorded in 2D space at the sampling rate of 30 Hz. Lip displacements were measured using the PREMIERE 6.0 software (ADOBE, http://www.adobe.com). We did not use the ELITE system to record the actor’s lip movements in order to avoid the possibility that during the visual presentation the markers on the lips prevented the participants from recognizing the string of phonemes. Figure 1 shows the time course of the distance between the actor’s upper and lower lip (squares) and of the distance between the right and left
69
corner of the actor’s lips (diamonds). Note that the final lip closure decreased moving from AGA to ABA (squares in Fig. 1), whereas poor variation in the distance between left and right corner of lips was observed among the three strings of letters (diamonds in Fig. 1). The voice emitted by the participants and the actor was recorded by means of a microphone (Studio Electret Microphone, 20–20,000 hz, 500 X, 5 mv/Pa/1 kHz) placed on a table support. The centre of the support was 8.0 cm distant from the subject’s chest, on the right with respect to the participant and 8.0 cm distant from the participant’s sagittal axis. The microphone was connected to a PC by a sound card (16 PCI Sound
Fig. 1 Time course of distance between upper and lower lip (squares) and left and right lip corners (diamonds) of the actor pronouncing ABA, ADA, and AGA strings of phonemes
Blaster, CREATIVE Technology Ltd. Singapore). The spectrogram of each string of phonemes was computed using the PRAAT software (University of Amsterdam, the Netherlands). The time courses of the formant (F) 1 and 2 of the participants and the actor were analyzed. The time course of the string-of-phonemes pronunciation was divided in three parts. The first part (T1-phase) included pronunciation of the first/a/vowel and the formant transition before mouth occlusion. The latter approximately corresponded to the first mouth closing movement. The second part (T0-phase) included mouth occlusion. Only the mouth occlusion of ABA pronunciation corresponded to the final lip closure. The mouth occlusion of the other strings corresponded to the final closure of internal mouth parts not recorded by kinematics techniques. The third part (T2-phase) included the formant transition during release of mouth occlusion, approximately corresponding to the second mouth opening movement, and pronunciation of the second/a/ vowel. The durations of participants’ T1-phase, T0phase and T2-phase were measured. Mean values of F1 and F2 of the participants and of the actor during the T1-phase and the T2-phase were calculated. Finally, participants’ and actor’s mean intensity of voice during string of phonemes pronunciation was measured. Mean F1 of the actor’s voice was 820, 721, and 746 Hz and mean F2 was 1,330, 1,393, and 1,429 Hz when pronouncing ABA, ADA and AGA, respectively. Intensity was on average 54.9 db. In experiment 1 the statistical analyses on the lip kinematics and the voice spectra of the pronunciation of ABA, ADA, and AGA were carried out in order to discover differences in lip kinematics and voice spectra among the three strings of phonemes. In experiments 2 and 3, the statistical analyses compared lip kinematics and voice spectra of the strings of phonemes pronounced in the congruent audiovisual presentation with those in the incongruent audiovisual presentation. The aim was to verify whether the string of phonemes in the incongruent condition differed from the corresponding string of phonemes in the congruent condition and, if so, the direction of the change. The experimental design included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3, the string of phonemes pronounced in the incongruent audiovisual presentation) as within-subjects factor for maximal lip aperture, lip closure, peak velocity of lip opening, and intensity of voice. In contrast, it included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3 the string of phonemes pronounced in the incongruent audiovisual presentation) and phase (T1, and T2) for F1 and F2. Finally, it included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3 the string of phonemes pronounced in incongruent audiovisual presentation) and phase (T1, T0, and T2) for time course of formant. The latter analysis aimed to discover differences in duration of vowel (including formant transition) and consonant pronunciation between strings of phonemes pronounced in the congruent and
70 Fig. 2 Examples of spectrograms during pronunciation of ABA, ADA, and AGA strings of phonemes in experiments 1, 2, and 3. T1-phase: pronunciation of the first/a/vowel and the formant transition before mouth occlusion. T0-phase: mouth occlusion. T2-phase: formant transition during release of mouth occlusion, and pronunciation of the second/a/ vowel. F1: formant 1, F2: formant 2
incongruent audiovisual presentations. Separate ANOVAs were carried out on mean values of the participants’ parameters. The Newman-Keuls post-hoc test was used (significance level set at P<0.05).
Results Experiment 1: string-of-letters reading paradigm. Figure 2 shows examples of voice spectrograms during the three strings of phonemes pronunciation. The tran-
sition of F1 showed a decreasing phase for the three strings of letters during T1 and an increasing phase during T2, more evident for AGA. In contrast, during T1 the transition of F2 showed a decreasing phase for ABA, and an increasing phase for ADA, more evident for AGA. The reverse occurred during T2 (Leoni and Maturi 2002). ANOVA showed a gradual increase in mean F2 moving from ABA to AGA (F(2,14)=46.7, P<0.0001, Fig. 3). The duration of T1 (158.7 ms) was longer than the duration of T0 (123.5 ms) and T2 (96.0 ms, F(2,14)=10.9, P<0.001). Summing up, the voice spectra of the vowels including formant
71 Fig. 3 Mean values of F2 in experiments 1, 2, and 3. Circles and squares refer to T1 and T2, respectively. Bar markers: SE
transition varied during pronunciation of ABA, ADA and AGA. Lip kinematics during consonant pronunciation was also affected by pronunciation of the strings of phonemes (Fig. 4). At the end of T1 final lip closure decreased moving from AGA to ABA (F(2,12)=13.3, P<0.001) and at the beginning of T2 peak velocity of lip opening increased when pronouncing ABA as compared to ADA and AGA (F(2,12)=9.8, P<0.005). Experiment 2: McGurk paradigm At the end of the experiment all participants reported that they never noticed incongruence between the visual and acoustical stimulus. In addition, they reported: ‘‘In some trials the same string of phonemes was differently pronounced’’. An acoustical analysis of the participants’ spoken responses in the incongruent audiovisual presentation showed that in the most of the trials 21 out of 31 participants repeated ABA, eight participants repeated ADA (the McGurk fusion effect), whereas two participants repeated AGA. Figure 2 shows examples of spectrograms in the condition of incongruence between the visual (AGA) and the acoustical (ABA) presentation.
The participants repeating either ABA or ADA showed a formant pattern similar to those in the congruent audiovisual presentation and in experiment 1 (Fig. 2). We performed statistical analyses on voice spectra and lip kinematics of the 21 participants who repeated ABA and of the eight participants who repeated ADA. We compared the voice spectra recorded in the incongruent audiovisual presentation with those recorded in the congruent audiovisual presentation. The analyses showed that F2 significantly increased moving from ABA to AGA (F(3,60)=110.4, P<0.000001, F(3,21)=26.1, P<0.0001, Fig. 3). F2 of the two ABA pronunciations significantly differed from each other, whereas F2 of the two ADA pronunciations did not (Fig. 3). F2 of ABA in the incongruent presentation (‘ABA’ in Fig. 3) increased approaching the F2 value of AGA. In other words, F2 of ABA repetition in the incongruent audiovisual presentation was influenced by the visually presented AGA. F1 decreased moving from ABA to AGA (ABA repetition: F(3,60)=408.6, P<0.00001, 801.1 vs. 794.7 vs. 776.4 Hz; ADA repetition: F(3,21)=4.7, P<0.01, 833.0 vs. 820.1 vs. 815.5 Hz). F1 of the two ABA (801.1 vs. 798.7 Hz) and ADA (820.1 vs. 818.2 Hz) pronunciations did not differ from each other.
72 Fig. 4 Parameters of the lips kinematics in experiments 1, 2, and 3. ‘ABA’, ‘AGA’ and ‘ADA’ refer to the strings of phonemes repeated in the incongruent audiovisual presentations. Bar markers: SE
The duration of T1 (251.5, 236.8 ms) was longer than both durations of T0 (116.6, 108.3 ms) and T2 (151.3, 117.0 ms) (F(2,40)=122.1, P<0.000001, F(2,14)=44.7, P<0.00001). No significant difference was found between the two ABA and ADA durations. These results indicate that the difference observed between F2 of the two ABA did not depend on variation in duration of T1 and T2. Indeed, a decrease/increase in duration of T1 and T2 due to shortening/lengthening of pure vowel pronunciation duration could induce a decrease/increase in F2, even if the single F2 values of formant transition and pure vowel could not vary. Lip closure significantly increased (F(3,60)=50.2, P<0.00001; F(3,21)=36.0, P<0.001, Fig. 4) and peak velocity of lip opening decreased (F(3,60)=157.1, P<0.00001, F(3,21)=69.2, P<0.00001, Fig. 4) moving from ABA to AGA. No significant difference was found between the lip kinematics of the two ABA pronunciations and between the lip kinematics of the two ADA pronunciations. Experiment 3: inverse McGurk paradigm The report of the participants at the end of the experiment was similar to that in experiment 2. An acoustical
analysis of the participants’ spoken responses in the incongruent audiovisual presentation showed that in the most of the trials 14 out of 26 participants repeated AGA, eight participants repeated ABA, three participants repeated ACA (/aka/), and, finally, one participant repeated ABGA (/abga/). We performed statistical analyses on the 14 participants who repeated AGA and on the eight participants who repeated ABA. In both cases of AGA and ABA repetition F2 significantly increased moving from ABA to AGA (F(3,39)=34.0, P<0.00001, F(3,21)=17.3, P<0.0001, Fig. 3). Most importantly, F2 of the two AGA and ABA pronunciations significantly differed from each other (Fig. 3). F2 of AGA pronounced in the incongruent audiovisual presentation (‘AGA’ in Fig. 3) significantly decreased, whereas F2 of ABA pronounced in the incongruent audiovisual presentation significantly increased (‘ABA’ in Fig. 3) as compared to F2 of the same strings of phonemes pronounced in the congruent audiovisual presentation. Summing up, in the inverse McGurk paradigm, the acoustically presented AGA and the visually presented ABA influenced voice spectra of ABA and AGA repetitions, respectively. In the case of ABA repetition, F1 of ABA (837.1 Hz) was higher than F1 of both ADA (819.3 Hz) and AGA (814.8 Hz, F(3,21)=3.7, P<0.05). However, no effect of the
73
incongruent audiovisual presentation was observed on F1 of ABA (837.1 vs. 834.1 Hz). The duration T1 (202.4, 207.6 ms) was longer than the duration of T0 (108.0, 107.3 ms) and T2 (112.6, 120.7 ms) (F(2,26)=32.5, P<0.00001; F(2,14)=15.8, P<0.0005). No significant difference was found between the durations of the two AGA and the two ABA pronunciations. Lip closure significantly increased (F(3,39)=31.9, P<0.00001, F(3,21)=21.6, P<0.00001, Fig. 4) and peak velocity of lip opening decreased (F(3,39)=40.0, P<0.00001, F(3,21)=23.9, P<0.00001, Fig. 4) moving from ABA to AGA. Post-hoc comparisons showed that final lip closure significantly decreased and peak velocity of lip opening significantly increased when AGA was pronounced in the incongruent audiovisual presentation as compared to AGA in the congruent presentation (Fig. 4). In contrast, no significant difference was found between the two ABA repetitions. Summing up, the observation of the lip kinematics of only labials (ABA visual presentation) influenced the lip kinematics of AGA repetition.
Discussion The participants in the present study relied more on the acoustical than the visual information (approximately 70%) when repeating aloud a string of phonemes presented acoustically by an actor whose mouth, in contrast, mimicked pronunciation of another string of phonemes. The prevalence of this acoustic effect was more frequent also than the McGurk fusion effect. The McGurk paradigm was never systematically tested on Italian speakers. It is well known the Italian phonemic repertoire and phonetic realization of syllables is simpler than those of other languages as, for example, English. Consequently, phonemic acoustical identification is simple enough not to require a strong reliance on additional visual cues (speech-reading). This hypothesis is in accordance with the results of previous studies in which the McGurk effect was compared among different languages (for a review see Chen and Massaro 2004). Chen and Massaro (2004, see also Massaro 1998; Sekiyama et al. 2003) showed that, when integrating acoustical with visual source of information on speech, each source is more influential if less complex. This behavioral principle was formalized by Massaro (1998) as the fuzzy logical model of perception (FLMP). Using this model, this author proposed that the type of processing of acoustical and visual source of information is universal across languages even if the process effects can be different. Although the listening of the responses showed that the participants frequently relied on either the acoustic or the visual stimulus alone, more sophisticated analyses such as the voice spectra and the kinematics analyses showed that in these responses, they were influenced also by the stimulus presented in the other modality. F2 in
the voice spectrum of ABA pronounced in the incongruent audiovisual presentation significantly increased as compared to ABA pronounced in the congruent presentation. The control experiment 1 and the condition of congruent audiovisual presentation in experiments 2 and 3 showed that F2 of AGA is higher than F2 of ABA. Consequently, in the incongruent audiovisual presentation the participants repeating ABA were likely to be affected also by the AGA presentation. This was found when AGA was either visually or acoustically presented. Conversely, F2 of AGA pronounced in the incongruent audiovisual presentation decreased more than AGA pronounced in the congruent audiovisual presentation, approaching the value of F2 of ABA. No mutual influence between the two modalities of presentation was observed for F1. This finding probably depends on the same pattern of the AGA and ABA formant transitions, and, consequently, on a smaller variation in F1 between the two strings of phonemes. In contrast, the pattern of the formant transition differed between the F2 of AGA and ABA and greater variation in F2 was observed between the two strings of phonemes. Thus, it is plausible to suppose that the mutual influence between AGA and ABA consequent to the contrasting audiovisual presentations were more detectable for F2 than F1. The finding that for all the strings of phonemes, the variations in F2 were in direction of F2 of the string of phonemes presented in the other sensory modality suggests a different perception of the string of phonemes. This was further supported by the participants’ final report. However, the mutual influence between the two strings of phonemes modified the values of F2, but did not reach the threshold in order to change the pattern of the formant transition, as it occurred in the McGurk fusion effect. In other words, the participants perceived a different sound of the same string of phonemes, rather than perceiving a different string of phonemes. Note that at the end of the experimental session, all the participants reported that they were unaware about the fact that the actor’s mouth mimicked pronunciation of a string of phonemes different from that acoustically pronounced. Taken together, these data support the hypothesis that the representation resulting from automatically matching the acoustical stimulus with the visual stimulus always contains features of both sources of information. However, we have no explanation why, when the two inputs were integrated, the strength of the acoustical or the visual information could change. We may hypothesize that integration was tuned by probabilistic changes in perception of the stimuli from one instance to the next. Random shift of attention to either the visual or the acoustical stimulus could contribute to a different stimulus perception, even if the participants were not required to pay greater attention to any of the two stimuli. The lip kinematics of the actor and the participants significantly differed when pronouncing ABA and AGA. Consequently, we could detect an influence of the actor’s lip movements on the lip kinematics of the participants
74
repeating the acoustical stimulus (ABA and AGA in experiments 2 and 3, respectively). However, the visually presented ABA influenced the lip kinematics of AGA repetition, whereas no influence of the visually presented AGA was observed on the lip kinematics of ABA repetition. This result discards the hypothesis that the visual and the acoustical inputs were integrated by imitating whatever visually detected time-varying motor patterns of the external mouth. In contrast, only the perceived time-varying lip motor pattern of a labial consonant, i.e. a consonant requiring characteristic lip movements in order to be correctly pronounced, was effective to induce changes in the lip kinematics of an observer pronouncing another string of phonemes. These modifications affected F2, which depends also on the volume of the anterior mouth cavity (Ferrero et al. 1979). Lip movements not directly related to consonant pronunciation, such as those of the AGA string of phonemes, did not influence pronunciation of another string of phonemes. The consonant/g/requires characteristic modification of the internal mouth. Consequently, the observation of the motor pattern of the visible internal mouth during AGA pronunciation could influence the kinematics of the visible internal mouth and, consequently, F2 of ABA pronunciation. In addition, it could induce the fusion effect (ADA pronunciation). The fusion effect could be elicited by imitation of the observed lip movements of AGA pronunciation, since the lip kinematics of ADA differed from those of ABA and approached those of AGA. However, it is not parsimonious to suppose that when pronouncing ABA in experiment 3, the observation of inner mouth movements affected the voice spectra. In contrast, when pronouncing ADA in experiment 2, the observation (and, probably the imitation) of outer mouth movements influenced more strongly the voice spectra, and, in particular, F2 (see Fig. 2), which is mainly related to configurations of the inner rather than the outer mouth. Summing up, only the kinematics peculiar to consonant pronunciation of the presented string of phonemes was extracted from the visual stimulus and integrated with the acoustical stimulus, as the variation in lip kinematics shows and, consequently, the voice spectra of the repeated string of phonemes. Extraction of specific information was necessarily related to a different perception of the string of phonemes. If this were not the case, other visual information poorly related to speech should be integrated with sound. These data support the hypothesis about cross-modal integration between the two inputs, rather than superimposition of automatic imitation motor programs of acoustically on visually detected motor patterns. Our data further suggest that cross-modal integration provides graded and continuous information about the speech category. This is in favor of the Massaro’s (1998) hypothesis according to which speech is not categorically produced, but it reflects the perceptual processing that led to categorization. However, the mouth motor pattern characteristic of a string of phonemes can be extracted and those not strictly
necessary to the pronunciation can be discarded only by means of execution of mouth motor programs and detection of the execution effects. This is in accordance with the motor theory of speech perception (Liberman and Mattingly 1985). Consequently, we suggest that imitation may be used at a filter stage before crossmodal integration, as supported by the finding that infants use imitation in order to learn speech (Meltzoff 2002). Broca’s area, which is known to be involved in encoding phonological representations in terms of mouth articulation gestures (Demonet et al. 1992; Paulesu et al. 1993; Zatorre et al. 1992) is activated also by imitation of face movements (Carr et al. 2003; Gre`zes et al. 2003; Leslie et al. 2004). In addition, it is activated by observation of lip reading and by repetition of perceived auditory speech (Buccino et al. 2004; Waltkins and Paus 2004; for a review Bookeimeir 2002). Conversely, STS seems to be mainly involved in the integration between the two modalities of speech presentation (Calvert and Campbell 2003; Calvert et al. 1999, 2000). Acknowledgements We whish to thank Paola Santunione and Andrea Candiani for the help in carrying out the experiments and Dr. Cinzia Di Dio for the comments on the manuscript. The work was supported by grant from MIUR (Ministero dell’Istruzione, dell’Universita` e della Ricerca) to M.G.
References Bookheimer S (2002) Functional MRI of language: new approaches to understanding the cortical organization of semantic processing. Ann Rev Neurosci 25:151–188 Buccino G, Binkofski F, Fink GR, Fadiga L, Fogassi L, Gallese V, Seitz RJ, Rizzolatti G, Freund HJ (2001) Action observation activates premotor and parietal areas in somatotopic manner: an fMRI study. Eur J Neurosci 13:400–404 Buccino G, Lui F, Canessa N, Patteri I, Lagravinese G, Benuzzi F, Porro CA, Rizzolatti G (2004) Neural circuits involved in the recognition of actions performed by nonconspecific: an fMRI study. J Cogn Neurosci 16:114–126 Calvert GA, Campbell R (2003) Reading speech from still and moving faces: the neural substrates of visibile speech. J Cogn Neurosci 15:57–70 Calvert GA, Bullmore ET, Brammer MJ, Campbell R, Williams SC, McGuire PK, Woodruff PW, Iversen SD, David AS (1997) Activation of auditory cortex during silent lipreading. Science 276:593–596 Calvert GA, Brammer MJ, Bullmore ET, Campbell R, Iversen SD, David AS (1999) Response amplification in sensory-specific cortices during cross-modal binding. Neuroreport 10:2619–2623 Calvert GA, Bullmore ET, Brammer MJ (2000) Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex. Curr Biol 10:649–657 Campbell R, MacSweeney M, Surguladze S, Calvert GA, McGuire PK, Brammer MJ, David AS, Suckling J (2001) Cortical substrates for the perception of face actions: an fMRI study of the specificity of activation for seen speech and for meaningless lower face acts (gurnings). Cogn Brain Res 12:233–243 Carr L, Iacoboni M, Dubeau MC, Mazziotta JC (2003) Neural mechanisms of empathy in humans: a relay from neural systems for imitation to limbic areas. PNAS 100:5497–5502 Chen TH, Massaro DW (2004) Mandarin speech perception by ear and eye follows a universal principle. Percept Psychophys 66:820–836
75 Demonet JF, Chollet F, Ramsay S, Cardebat D, Nespoulous JC, Wise R, Frackowiak RSJ (1992) The anatomy of phonological and semantic processing in normal subjects. Brain 115:1753– 1768 Ferrero F, Genre A, Boe¨ LJ Contini M (1979) Nozioni di fonetica acustica. Edizioni Omega,Torino Gentilucci M, Chieffi S, Scarpa M, Castiello U (1992) Temporal coupling between transport and grasp components during prehension movements: effects of visual perturbation. Behav Brain Res 47:71–82 Gentilucci M, Santunione P, Roy AC, Stefanini S (2004) Execution and observation of bringing a fruit to the mouth affect syllable pronunciation. Eur J Neurosci 19:190–202 Gre`zes J, Armony JL, Rowe J, Passingham RE (2003) Activations related to ‘‘mirror’’ and ‘‘canonical’’ neurones in the human brain: an fMRI study. Neuroimage 18:928–937 Heiser M, Iacoboni M, Maeda F, Marcus J, Mazziotta JC (2003) The essential role of Broca’s area in imitation. Eur J Neurosci 17:1123–1128 Iacoboni M, Woods RP, Brass M, Bekkering H, Mazziotta JC, Rizzolatti G (1999) Cortical mechanism of human imitation. Science 286:2526–2528 Leoni FA, Maturi P (2002) Manuale di Fonetica. Carocci, Roma Leslie KR, Johnson-Frey SH, Grafton S (2004) Functional imaging of face and hand imitation: towards a motor theory of empathy. Neuroimage 21:601–607 Liberman AM, Mattingly IG (1985) The motor theory of speech perception revised. Cognition 1:1–36 Massaro DW (1998) Perceiving talking faces: from speech perception to behavioral principle. MIT press, Cambrige, MA McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748 Meltzoff AN (2002) Elements of a developmental theory of imitation. In: Meltzoff AN, Prinz W (eds) The imitative mind:
development, evolution, and brain bases. Cambridge University Press, New York, pp 74–84 Munhall KG, Vatikiotis-Bateson E (1998) The moving face during speech communication. In: Campbell R, Dodd B, Burnham D (eds) Hearing by eye II: advances in the psychology of speechreading and auditory-visual speech. Psychology, Hove UK, pp 123–139 Oldfield RC (1971) The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia 9:97–113 Paulesu E, Frith CD, Frackowiak RSJ (1993) The neural correlates of the verbal component of working memory. Nature 362:342– 345 Reisberg D, McLean J, Goldfield A (1987) Easy to hear but not to understand: a lipreading advantage with intact auditory stimuli. In Dodd B, Campbell R (eds) Hearing by eye: the psychology of lip-reading. Erlbaum, Hillsdale NJ, pp 97–113 Rizzolatti G, Arbib MA (1998) Language within our grasp. Trends Neurosci 21:188–194 Sekiyama K, Tohkura Y (1993) Inter-language differences in the influence of visual cues in speech perception. J Phonetics 21:427–444 Sekiyama K, Kanno I, Miura S, Sugita Y (2003) Audio-visual speech perception examined by fMRI and PET. Neurosci Res 47:277–287 Sumby WH, Pollack I (1954) Visual contributions to speech intelligibility in noise. J Acoust Soc Am 26:212–215 Summerfield Q (1992) Lipreading and audio-visual speech perception. Philos Trans R Soc Lond B Biol Sci 335:71–78 Watkins K, Paus T (2004) Modulation of motor excitability during speech perception: the role of Broca’s area. J Cogn Neurosci 16:978–987 Zatorre RJ, Evans AC, Meyer E, Gjedde A (1992) Lateralization of phonetic and pitch discrimination in speech processing. Science 256:846–849