Exp Brain Res (2011) 215:141–161 DOI 10.1007/s00221-011-2883-9
RESEARCH ARTICLE
The perception of visible speech: estimation of speech rate and detection of time reversals Paolo Viviani • Francesca Figliozzi Francesco Lacquaniti
•
Received: 27 March 2011 / Accepted: 16 September 2011 / Published online: 11 October 2011 Ó Springer-Verlag 2011
Abstract Four experiments investigated the perception of visible speech. Experiment 1 addressed the perception of speech rate. Observers were shown video-clips of the lower face of actors speaking at their spontaneous rate. Then, they were shown muted versions of the video-clips, which were either accelerated or decelerated. The task (scaling) was to compare visually the speech rate of the stimulus to the spontaneous rate of the actor being shown. Rate estimates were accurate when the video-clips were shown in the normal direction (forward mode). In contrast, speech rate was underestimated when the video-clips were shown in reverse (backward mode). Experiments 2–4 (2AFC) investigated how accurately one discriminates forward and backward speech movements. Unlike in Experiment 1, observers were never exposed to the sound track of the video-clips. Performance was well above chance when playback mode was crossed with rate modulation, and the number of repetitions of the stimuli allowed some amount of speechreading to take place in forward mode (Experiment 2). In Experiment 3, speechreading was made much
Electronic supplementary material The online version of this article (doi:10.1007/s00221-011-2883-9) contains supplementary material, which is available to authorized users. P. Viviani (&) F. Figliozzi F. Lacquaniti Laboratory of Neuromotor Physiology, Santa Lucia Foundation, via Ardeatina, 306, 00179 Rome, Italy e-mail:
[email protected] F. Lacquaniti Centre of Space Biomedicine, University of Rome Tor Vergata, 00173 Rome, Italy F. Lacquaniti Department of Neuroscience, University of Rome Tor Vergata, 00133 Rome, Italy
more difficult by using a different and larger set of muted video-clips. Yet, accuracy decreased only slightly with respect to Experiment 2. Thus, kinematic rather then speechreading cues are most important for discriminating movement direction. Performance worsened, but remained above chance level when the same stimuli of Experiment 3 were rotated upside down (Experiment 4). We argue that the results are in keeping with the hypothesis that visual perception taps into implicit motor competence. Thus, lawful instances of biological movements (forward stimuli) are processed differently from backward stimuli representing movements that the observer cannot perform. Keywords Visible speech Speechreading Motor–perceptual interactions Speech rate Inverted faces
Introduction There is considerable evidence that the perception of visual movement is influenced by implicit knowledge about the principles that movements comply with (cf Blakemore and Frith 2005; Galantucci et al. 2006; Viviani 2002). An influential view that can be traced back to Hertz emphasizes the laws of mechanics as the main source of top-down influence (e.g., Freyd 1987; Freyd and Finke 1984; Johansson 1973; Shepard 1984, 1994). Another view, first articulated by Mach more than a century ago (Mach 1885), emphasizes instead the role of biological principles. In particular, the so-called motor theory of perception holds that movement perception is influenced primarily by the properties and working principles of the motor control system (for a review, Viviani 2002). Opinions differ on how the influence is mediated (cf Blake and Shiffrar 2007).
123
142
It is also debated how specific is the imprint of motor factors on perception. At one end of the range of opinions (e.g., Blakemore and Frith 2005), there is the contention that motor factors afford mainly the ability to discriminate visual movements that have a biological origin from purely mechanical movements. At the other end, there are stronger claims such as, for instance, that type and size of certain visual illusions depend on the intrinsic properties of human movements (de’Sperati and Viviani 1997; Viviani et al. 1997; Viviani and Stucchi 1989, 1992). Finally, the role of expertise in determining the strength of motor influence on perception remains to be elucidated. The fact that naı¨ve newborns show a preference for biological over non-biological motion (Simion et al. 2008) suggests that some motor competence is innate. At the same time, expertise sharpens the perceptual effects of this competence, as shown for instance in the case of dance (Calvo-Merino et al. 2010). The earliest, best known articulation of the motor theory was proposed by Liberman and coworkers (Liberman et al. 1967; Liberman and Mattingly 1985; Liberman and Whalen 2000) to account for speech perception. The theory holds that the remarkable consistency with which the continuous sound signal is segmented in linguistically relevant units (phonemes) is mediated by the implicit motor competence that speaker and listener share concerning the articulatory maneuvers required to attain a given phonetic target. Although Liberman’s theory was designed to account for phenomena in the acoustic domain, articulatory movements exhibit at least two features that make them ideally suited for exploring motor–perceptual interactions also in the visual domain. First, the motor plant responsible for speech articulation acts essentially as a displacement generator because the mechanical impedance of the relevant body segments is weak, and gravitational forces are largely irrelevant. Thus, articulatory kinematics is a more direct expression of the underlying motor commands than the kinematics of most other movements. Moreover, perceived kinematics is not likely to be influenced by possible preconceptions about body dynamics (Freyd 1987; Freyd and Finke 1984; Shepard 1984, 1994). Second, the possibility of selecting several solutions to achieve a specific acoustic target implies the existence of strong relational constraints among articulatory components, which may be perceptually conspicuous. In this study, we test the hypothesis that visual perception taps into motor competence by measuring how accurately observers detect manipulations of movement kinematics in displays of visible speech. Visible speech Language articulation involves some of the most complex coordinated motor sequences within the human repertoire.
123
Exp Brain Res (2011) 215:141–161
Even the overt component of articulation, which is just a fraction of the overall action, has remarkable properties. Indeed, visible speech is so highly structured that being able to see the mouth region strongly enhances the ability to understand the message when the sound signal is corrupted or noisy (Dekle et al. 1992; Middleweerd and Plomp 1987; Sumby and Pollack 1954). More generally, visible speech influences language perception both by integrating under-specified acoustic information and by making language perception more robust through redundancy (Campbell 2008). Actually, to an extent that varies among individuals and depends on the context, it is possible through speechreading to make out what is being uttered also when the message is not audible. Because many articulatory maneuvers are not visible, a segment of visual speech may correspond to distinct phonological outputs. Visually confusable phonemes fall into phonemic classes of equivalence (‘‘visemes’’, Fisher 1968; Massaro 1998) the number of which varies from person to person and, ultimately, sets an upper limit to the ability to speechread. In most languages, the maximum number of visemes is quite smaller than the number of phonemes. Nevertheless, at least in the case of English, whose lexical space is sparsely occupied and uniformly distributed, it has been estimated that more than 90% of the frequent words can be uniquely identified with just 12 visemes (Auer and Bernstein 1997). In fact, perhaps due to overreliance on acoustic processing, the actual speechreading accuracy of hearing people has been reported to range between 11 (Bernstein et al. 2000) and 30% (Ro¨nnberg 1995; Ro¨nnberg et al. 1998) for isolated words and between 21 and 25% for words within sentences (Bernstein et al. 2000; Demorest and Bernstein 1992), with individual scores ranging between 0 and 45%. Remarkably, proficiency at speechreading does not seem to improve with training and is poorly correlated with overall intelligence (Elphick 1996), verbal reasoning abilities (Summerfield 1991), lexical richness (Lyxell and Ro¨nnberg 1992), and other cognitive abilities (Summerfield 1992). Recently, however, it has been reported that a significant amount of variability in proficiency can be accounted for by individual differences in spatial working memory and information processing speed (Feld and Sommers 2009). Silent displays of speech acts elicit cortical activity in auditory association areas, auditory–visual integration areas, as well as Broca’s area (Calvert and Campbell 2003; Campbell et al. 2001; Fridriksson et al. 2008; Nishitani and Hari 2002; Paulesu et al. 2003; Ruytjens et al. 2006; Santi et al. 2003; Turner et al. 2009). Cortical responses to speechreading can be differentiated from responses to comparable face movements that cannot be construed as speech (Campbell et al. 2001). Moreover, cortical circuits for speechreading are activated by action representations that differentiate lexical from non-lexical processing
Exp Brain Res (2011) 215:141–161
(Paulesu et al. 2003). Finally, visual observation of speechrelated lip movements has an entry in the primary auditory cortex (Ruytjens et al. 2006; Pekkola et al. 2005; Sams et al. 1991) and enhances excitability of cortical areas involved in speech production (Fadiga et al. 2002; Nishitani and Hari 2002; Wilson et al. 2004), particularly in the left hemisphere (Campbell et al. 1996; Watkins et al. 2003). Speech rate and the direction of the time arrow Intentional modulation of speech rate plays a significant role in social communication. On the one hand, it allows the speaker to compensate for the semantic and/or syntactic complexity the message. On the other hand, rate along with pitch can be modulated for rhetorical purposes. Clearly, in order for this communicative strategy to be effective, the imposed modulations must exceed the detection threshold of the listener. Earlier estimates of the threshold for rate (Eefting and Rietveld 1989, just noticeable difference (JND) = 4.4%; Noteboom and Eefting 1994, JND \ 20%) have recently been refined by Quene´ (2007) who reported values between 3 and 4%. Rate modulation does not affect uniformly all components of the utterance, vowels being much more ‘‘elastic’’ than consonants (cf Lehiste 1970). Thus, the relative timing of the consonant in VCV syllables is invariant across changes in speech rate, whereas the timing of the vowel in CVC is not (Tuller et al. 1982, 1983). Moreover, the distribution of contractions and expansions across phonetic segments is accounted for by a stable relationship between articulator stiffness and speech rate (Munhall et al. 1985; Ostry and Munhall 1985). If indeed such distinctive constraints are motor relational features that the perceptual system is tuned to, they might provide a basis for judging the rate at which a sentence is uttered from purely visual cues. Certain sequences of movement can be performed in both directions. In some cases (e.g., alternating finger pointing between two targets), the two phases of the movement are nearly symmetric and it is virtually impossible to tell whether a video-clip of the movement is shown as originally recorded or in reverse. In other cases, biomechanical and gravitational factors introduce asymmetries in the kinematics of the movement. Walking is an interesting example (Jacobs et al. 2004; Sumi 1984; Viviani et al. 2011). Although people can walk backward without effort, accurate instrumental measurements show significant kinematic differences between forward and backward walk (Grasso et al. 1998; Thorstensson 1986). Remarkably, observers can discriminate these differences if the body appears to move forward. When instead the body appears to move backward, it is virtually impossible to tell apart a reversed recording of forward walk from a normal recording of backward walk (Viviani et al. 2011). Apparently, one of the two directions of the body
143
displacement is perceptually privileged because it is far more often associated with the walking movement than the opposite one. On several counts, speech movements are peculiar. Qualitative differences between normal and reversed displays of visible speech are not very conspicuous, and a reversal does not produce a visual event that is obviously implausible as would be, for instance, the reversal of a feeding movement. Moreover, there is no salient feature that, like the displacement of the body in walk, is usually associated with one temporal direction. More importantly, articulatory sequences cannot be performed in reverse. With some training, it is indeed possible to reverse certain aspects of the action, such as the direction of the airflow in the so-called ingressive speech (Eklund 2008). However, there is no way of articulating a sound that, when played in reverse, would sound like/apple/. To do so, one should reverse each articulatory maneuver required by the speech act, as well as the specific sequence with which these maneuvers are chained. As far as we know, no amount of training will make this motor task possible. Unlike walking, there are specific constraints in the underlying motor programs that impose an intrinsic time arrow upon the output. Outline of the experiments All four experiments reported here involved a reversal of the direction of time in recordings of speech acts. The rationale for this manipulation is that if perceiving biological movements entails the activation of the motor mechanisms that would also be involved for generating that movement, visual displays of movements that can be performed should be processed differently from visual displays of movements that cannot be performed. In Experiment 1, we focussed on movement rate by measuring precision and accuracy with which one detects departures from the normal speech rate of different speakers in normal and time-reversed displays. In Experiment 2, we measured how accurately one can discriminate the two types of displays. Experiment 3 was designed to demonstrate that discrimination is fundamentally based on the analysis of kinematic features, with only marginal contribution from speechreading. Finally, in Experiment 4, we asked whether kinematic features can still be used to discriminate normal and time-reversed presentations when the stimuli are shown upside down.
Experiment 1 The aim of the experiment was to test the hypothesis that accuracy and precision with which one can match the
123
144
speech rate of muted displays to a reference rate are affected by the direction of movement in time. First, in a familiarization phase, observers were repeatedly shown video-clips in which an actor recites a memorized text at his/her own natural pace. Both the image of the lower part of the face and the associated sound track were provided in this preliminary phase. Then, a portion of the video-clips was shown either in the normal (forward) mode or in the reverse (backward) mode, without the sound track. In both modes, playback speed could be either increased or reduced with respect to the original one. We asked observers to adjust the speed so as to match the speech rate they had experienced during the familiarization phase. The rationale of the experiment is that if the perceptual system is tuned to visual displays of natural speech movements because these movements belong to the motor competence of the observer, rate matching should be accurate in forward mode. In contrast, either accuracy or precision, or both should be negatively affected when judging time-reversed displays showing movements that the observer is not able to produce. In order to provide the results with a sufficient degree of generality, and to minimize possible learning effects, the stimuli tested in all four experiments derived from the speech behavior of several actors. Methods Participants
Exp Brain Res (2011) 215:141–161
high-speed (500 frames/s) digital camera (Nac HotShot). The result was stored as a sequence of single frames (1,024 9 1,280 pixel, RGB TIFF format). Second, a sample of 2,700 frames (5.4 s) was selected and processed with Photoshop CS4 to equalize for luminance and chromatic spectrum. The beginning and the end of the samples did not coincide with pauses between words so that the texts within the stimuli were not well-formed sentences. Finally, we translated with Adobe Premie`re various subsets of the sample into 60 Hz AVI video-clips. By varying the size of the subset, and keeping constant its duration at 4 s, the content of the video-clip was either compressed or expanded with respect to the original recording. For each actor, we produced 29 different stimuli corresponding to a number of frames from 1,300 to 2,700 in steps of 50. Thus, the actual lip movements were reproduced faithfully by stimuli that included 2,000 frames, were slowed down to 65% of the original speed by stimuli that included 1,300 frames (speed ratio:1 SR = 1,300/2,000 = 0.650), and were accelerated to 135% of the original speed by stimuli that included 2,700 frames (speed ratio: SR = 2,700/ 2,000 = 1.350). SR was varied by eliminating progressively the initial and final portions of the selected 2,700 frames, so that the central portion of 1,300 frames (2.6 s) was shared by all stimuli. Each of the 29 stimuli corresponding to one actor was generated both in forward and backward mode. The total number of available stimuli was 29 [speed] 9 2 [directions] 9 5 [actors] = 290. Four examples of the stimuli are provided as supplementary material.
Ten young students (6 men, 4 women; age range: 21–27) from the University of Rome participated in the experiment. Participants in this and all subsequent experiments had normal or corrected to normal vision and were paid a fixed sum for their services. Informed consent to procedures approved by the Institutional Review Board of Fondazione Santa Lucia was obtained from the participants, but they were not aware of the ultimate goal of the experiments. Experimental protocols complied with the Declaration of Helsinki on the use of human subjects in research.
Task
Stimuli
Procedure
The stimuli for the testing phase were video-clips showing the lower part of the face of 5 actors (two women, three men), each reciting a different memorized text. The 5 texts were excerpt of Italian poems that include some very low frequency and obsolete words. The average speaking rate of the actors was 4.50, 5.00, 4.50, 5.00, and 5.75 syllables per second, respectively. The sound track of the recording was suppressed. The preparation of the stimuli involved three steps. First, we recorded each actor for 10 s with a
The experiments were run in a dimly illuminated, quiet room. Participants sat in front of a computer display at a distance of about 60 cm. Stimuli were shown on an EIZO FlexScan S2411W calibrated monitor (resolution: 1,920 9 1,200 pixels, refresh rate: 60 Hz). The dimension of the stimuli on the screen was 21 (W) 9 16 (H) cm. (at
123
After viewing one stimulus, the observer indicated (2AFC) whether he/she perceived the speech rate as faster or slower than the spontaneous speech rate for that actor (see later). Responses were entered by pressing with the left or right index finger two buttons on an Empirisoft response box labelled ‘‘Slower’’ and ‘‘Faster,’’ respectively. The next stimulus appeared 1 s after each response. No feedback was provided after a response. Participants were not told that half of the stimuli were time-reversed.
1
By convention, speed ratio (SR) denotes a property of the stimulus and speech rate what observers were asked to judge.
Exp Brain Res (2011) 215:141–161
viewing distance, 1 cm & 1° of visual angle). To estimate the speech rate that was closest to the spontaneous one, we adopted the double psychophysical staircase method. The first stimulus in ascending and descending staircases was SR = 0.775 and SR = 1.225, respectively (selection probability: 0.5). Thereafter, the sequence of presentation of the stimuli was determined by the responses, SR approaching 1 after a correct response and moving away from 1 after a wrong response in equal steps of 0.025. Stimuli with SR [ 1.225 and SR \ 0.775 were available in case even the initial stimuli were misjudged. The selection between ascending and descending sequences was randomized. To prevent long runs, selecting one of the two sequences decreased slightly its selection probability on the next trial. The experimental session was divided into 10 blocks, one for each actor/direction combination. The order of presentation of blocks was randomized. Thus, on each trial, the stimulus could be any of the 29 available videoclips for the selected actor/direction combination. The block ended when there had been 15 inversions in both ascending and descending staircases. Each session began by asking the participant to read a written version of the instructions. Followed a warm-up and familiarization phase in which we showed a random sample of 20 stimuli from all actors and in both directions. In this phase, responses were entered but not recorded. Each block began by a presentation of the original 10-s recording of the corresponding actor with the associated sound. Participants watched these recordings as many times as they wished until they were satisfied to have a good perception of the spontaneous speech rate of the actor. The 10 blocks were then administered in sequence with a brief rest period between blocks. A session lasted between one and one and a half hour. Results Figure 1 illustrates with one representative example the difference between estimates with backward and forward stimuli. For each participant, each actor and both directions the speech rate perceived as natural was estimated separately for ascending and descending sequences by the mean SR over the last 15 inversions. Statistical analysis (General Linear Model [GLM], repeated measures, 2 [Backward/ Forward] 9 2 [Ascending/Descending] 9 5 [Actor], Greenhouse–Geisser correction) demonstrated significant effects of the direction of the movement (F(1,9) = 6.998, P = .027), of the actor (F(4,9) = 8.385, P \ .001), and of the staircase sequence (ascending/descending; F(1,10) = 7.821, P = .021), with no significant interaction. As for the effect of the sequence, further analysis showed no consistent evidence of either perseveration or anticipation effects. Only 18 among the 10 [participant] 9 5 [actor] 9 2
145
Fig. 1 Experiment 1: estimation by one representative participant of the playback speed reproducing the natural speech rate for one actor. Speed was modulated by varying the number of frames included in a constant 4-s interval. Ordinate: ratio between the number of frames of a test stimulus and the 2,000 frames of the original recording. Abscissa: rank order of the trials (each rank applies separately to ascending and descending sequences). PSE: Point of subjective equality estimated separately for ascending and descending sequences by averaging the speed ratio at the last 15 inversions in each sequence. PSEs greater (smaller) than 1 indicate that the speed of the movement is under- (over)estimated. Estimation with forward (upper panel) and backward (lower panel) playback mode is shown separately
[direction] = 100 differences between ascending and descending means were significant at the P = .01 level (t test, independent groups). Moreover, significant differences were both negative and positive and were not concentrated on any one factor. Therefore, in all subsequent analyses, results for ascending and descending sequences were pooled. The memory trace of the natural speech rate experienced at the beginning of each block (see ‘‘Methods’’) might have been blurred over time by the presentation of stimuli with
123
146
Exp Brain Res (2011) 215:141–161
Fig. 2 Experiment 1: mean speed ratio (SR) over all participants for each actor and both playback modes. Bars: 99% confidence intervals
other rates. To check this possibility, we performed a further ANOVA in which the mean SRs over the first and the last 5 inversions were analyzed separately. The results showed no significant difference in the responses (F(1,9) = 1.203, P = .301). The direction of movement in time biased the perception of speech rate. The mean SR corresponding to the rate judged closest to the natural one varied across actors (Fig. 2), but the mean for backward stimuli was always higher than the mean for forward stimuli, even when they were both smaller than 1 (Actor 5). The underestimation was present in individual performances for all participants and reached significance (One-tailed t test at P = .05) in 8 participants (Table 1). The effect was investigated further (Fig. 3) by contrasting for each actor the cumulative distribution functions (cdf) of SR at inversions for forward Table 1 Experiment 1: results for each participant Participant
SRB
SRF
D = SRB - SRF
P
S01
1.0328
0.9718
0.0610
.016
S02
1.0893
1.0305
0.0508
.037
S03
1.0005
0.9865
0.0140
.293
S04
1.0484
0.9914
0.0570
.010
S05
1.1325
1.0385
0.0940
\.001
S06
1.1023
1.0735
0.0288
.175
S07
1.0897
0.9997
0.0900
.004
S08
1.0239
0.9351
0.0888
.022
S09
1.0292
0.9542
0.0750
\.001
S10
0.9683
0.9071
0.0612
.050
SRB, SRF: speed ratio for backward and forward stimuli averaged over actors and both descending and ascending sequences. SR [ 1 (\1) indicates that speech rate is under (over) estimated with respect to the real one. D: difference between perceived speech ratios. P: significance of the test D [ 0 (One-tailed t test for paired samples)
123
and backward stimuli (pooling individual results). In all cases, the cdf’s were significantly different (Smirnov– Kolmogorov, P \ .002). The difference was mainly in the median for Actors 1, 3, 4, and 5, and in the slope at the median for Actor 2 (values inset). The last analysis focussed on the variability of the estimated rates. Judging from the slopes of the cdf’s (Fig. 3), differential thresholds for all but one actor were lower with forward than backward stimuli. However, these thresholds may be affected by both individual and actor-dependent differences among means. To factor out this potential confound, we ipsitized the SR values at the last 15 inversions by subtracting their mean for each participant and each actor. After pooling the data for all participants and all actors, the cdf’s of the ipsitized SR for backward and forward stimuli were ztransformed (Fig. 4). As shown by the excellent fit by a linear regression, Gaussian cdf’s closely approximated the data points. More importantly, the just noticeable difference (JND: inverse of the slope of the linear regression) was virtually identical for backward and forward stimuli. In summary, the direction of movement in time affected selectively the mean perceived speech rate, with no effect on the differential threshold. Discussion The results demonstrated a significant difference between the accuracy with which observers matched the reference speech rate depending on movement direction. With forward stimuli, the match was fairly accurate (0.937 \ median SR \ 1.023, Fig. 3). The speech rate of backward stimuli was instead underestimated (Fig. 2). How is speech rate perceived during the familiarization phase and matched during the experiment? The video-clips used to demonstrate the reference for each actor had a constant duration of 10 s. The duration of the stimuli was also kept constant (4 s) for all tested values of SR. Therefore, speech rates of both references and stimuli were necessarily inferred from the average interval between successive salient events, such as the opening and closing of the lips. Over the range of spontaneous speech rate of our actors (see ‘‘Methods’’), accuracy with forward stimuli was comparable to that reported in judging the rate of sequences of flashes or clicks (Mowbray and Gebhard 1955; Stevens and Shickman 1959; Welch et al. 1986). Also, precision compared favorably with that observed with flash sequences. In the rate range of articulatory gestures, Mowbray and Gebhard (1955, Table 1) reported an average absolute error Df = 0.106, which, if deviations have a Gaussian distribution, corresponds to a JND = Df Hp/2 = 0.133, close to the value estimated for both backward and forward stimuli (*0.1, see Fig. 4). Moreover, the interquartile ranges around 5 Hz reported by
Exp Brain Res (2011) 215:141–161
147
Fig. 3 Experiment 1: cumulative frequency distributions of the speed ratio (pooled over participants). Row results for each actor. Column results for Forward (left) and Backward (right) stimuli. Distributions include all speed ratios at the last 15 inversions in the psychophysical staircase (see Fig. 1). Medians and slopes at the median are calculated by interpolating (continuous lines) the distributions with generalized logistic functions. Slopes are inversely related to differential thresholds (JND). P(KS): probability of Type 1 error in testing the difference between corresponding distributions (Kolmogorov– Smirnov)
123
148
Exp Brain Res (2011) 215:141–161
Fig. 4 Experiment 1: analysis of the differential thresholds. Ordinate: Z-transformed cumulative frequency distributions of speed ratios at the last 15 inversions. Abscissa: Speed ratio SR. Individual and actordependent variability were filtered out by normalizing speed ratios to
their mean for each participant/actor combination. Data points with P [ .95 and P \ .05 were censored. Continuous lines: linear regressions and 95% confidence iperbolae. JND: Inverse of the slope of the linear regression
Stevens and Shickman (1959, Fig. 2) are also of the same order of magnitude. By contrast, the underestimation observed with backward stimuli is a new finding that could not have emerged with discrete sequences. Because the duration of articulatory gestures was the same irrespective of the direction of movement, this bias implies a difference in the way durations are perceived depending on movement direction. In particular, the difference should have to do with some aspect of movement kinematics that is not invariant with respect to time reversals. A hypothesis on the origin of the bias will be provided in the general discussion after taking into account the results of the other experiments. Here, we simply stress two points. First, several observers reported that they had been able to guess a few words. Not infrequently, however, the word reported did not actually exist in the spoken text. In any case, no participant realized that speechreading was possible only in one half of the stimuli and that the other half were time-reversed versions of normal recordings. Of course, at least for forward stimuli, this does not rule out the possibility that some automatic unconscious processing has taken place. It does rule out, however, that observers may have consciously used the possibility of speechreading to make inferences about perceived speech rate. Second, because we wanted to provide the reference rate in the most natural way, with a correct association of image and sound, we used only the forward mode. Although the muted stimuli were just a segment of the video-clips with the sound, forward stimuli were presented a few more
times than backward ones. However, the only stimulus attribute that observers had to attend was independent of movement direction. Moreover, both backward and forward stimuli alternated randomly within one session (see ‘‘Methods’’). Therefore, also familiarity per se is not likely to have biased perceived rate.
123
Experiment 2 Experiment 1 investigated how accurately one can scale a continuous parameter of speech movements, speech rate, which is spontaneously modulated for expressive purposes. Accuracy was quite good for natural (forward) movements. Instead, rate was underestimated when movements were shown in reverse, suggesting that natural and unnatural (backward) movements are dealt with differently by the perceptual system and can actually be discriminated when the task explicitly requires doing so. Demonstrating that time reversals applied to movements that cannot be executed backward are perceptually salient would provide further support to the general notion that perception is penetrable by motor competence. Thus, Experiment 2 was designed to estimate how accurately one identifies movement direction from purely visual cues. In Experiment 1, some words might have been identified through an association between voice and movements established in the preliminary phase. This was irrelevant insofar as rate estimation is concerned. In contrast, it was important to
Exp Brain Res (2011) 215:141–161
prevent these associations to play a role in this experiment. Therefore, we tested a new population of participants who never heard the spoken texts. Method Participants Ten young students (5 men, 5 women; age range: 20–28) from the University of Rome participated in the experiment. Stimuli For each actor, we used a subset of the stimuli presented in Experiment 1, namely those with the 5 values SR = 0.750, 0.875, 1.000, 1.125, and 1.250 (see above). As in Experiment 1, each actor recited a different text. Stimuli were presented either as recorded (Forward) or in reverse (Backward). The number of different stimuli was 2 [Backward/Forward] 9 5 [Actors] 9 5 [SR] = 50. Task Immediately after the end of a stimulus, participants indicated the perceived direction of the movement (2AFC). Responses were entered by pressing with the left and right index finger two keys of the response box labelled ‘‘Forward’’ and ‘‘Backward,’’ respectively. Procedure The experiment was run in the same general conditions and with the same equipment as Experiment 1. Participants were told that the video-clip could be shown either in the normal (forward) mode or in reverse (backward) mode. However, to prevent possible compensatory response strategies, we made them believe that playback mode at each trial was determined by a complex algorithm and that there was no reason to expect the two modes to occur with equal probability. Although Experiment 1 had shown that speed modulation in the stimuli was well above perceptual threshold, we stressed that attention should be focussed exclusively on the direction of the movement. Each of the 50 different stimuli was repeated 10 times, so that there were NT = [Repetition] 9 50 = 500 trials in a session divided in 5 equal blocks. Trials were presented in a different pseudorandom order for each participant. Stimuli with the same actor or the same SR were prevented to occur in successive trials. The presentation was self-paced, and each stimulus appearing 1 s after the previous response. A session lasted about 1 h. Short rest periods were allowed after each block.
149
Results After the session, several participants mentioned that speech rate was variable. However, none of them reported an impact of rate variability on the difficulty of the task, which was considered quite severe. The 500 responses were collected as 2 9 2 contingency tables: [NF|F, NF|B, NB|F, NB|B]. For each table, we computed the v2, the probability of a correct answer: P{C} = (NF|F ? NB|B)/500 and the probability of a ‘‘Forward’’ response: P{F} = (NF|F ? NF|B)/500. We also estimated discriminability through a d0 index based on the convention P{Hit} = P{F|F} and P{False Alarm} = P{F|B}.2 The main result emerged by pooling responses over actors, SR, and participants (sample size = 500 9 10 [participants] = 5,000): NF|F = 1,791, NF|B = 1,050, NB|F = 709, NB|B = 1,450; v2 = 447.59 (P \ .001), P{C} = 0.653, P{F} = 0.568, d0 = 0.774. Observers detected the direction of the stimuli with a probability significantly higher than chance level (P{C} [ 0.5). However, there was also a bias in favor of the response ‘‘Forward.’’ Tables 2 and 3 report the means over participants of P{C} and P{F} for all combinations of Actor and SR. Because these two mean indexes of performance were linearly independent (Pearson’s r = -0.139, P = .508), they were analyzed separately (GLM with repeated measures, 5 [Actor] 9 5 [SR], arcsine transformation, Greenhouse–Geisser correction). Neither Actor or SR modulated the performance (P{C}, Actor: F(4,144) = 0.219, P = .825, SR: F(4,144) = 1.377, P = .267, Interaction: F(16,144) = 0.673, P = .659; P{F}, Actor: F(4,144) = 0.573, P = .599, SR: F(4,144) = 2.236, P = .131, Interaction: F(16,144) = 2.000, P = .111). Moreover, the results for the population sample were confirmed by most individual performances (Table 4). In particular, the d0 index computed for each observer was significantly greater than 0 (One-tailed t test, t(9) = 3.393, P = .008). To test for learning effects, we computed the individual estimates of P{C} and P{F} over 5 successive blocks of trials. Statistical analysis (GLM with repeated measures, 5 [block], arcsine transformation, Greenhouse– Geisser correction) showed that the bias in favor of the response ‘‘Forward’’ remained constant across blocks (F(4,36) = 1.166, P = .338). Instead, there was a block effect on P{C} (F(4,36) = 5.516, P = .018) with a linearly increasing trend (F(1,9) = 7.880, P = .020). Discussion The probability of discriminating correctly movement direction exceeded significantly chance level. Several cues 2
Because only one stimulus was presented on each trial, we did not divide d0 values by 21/2, as suggested for discrimination tasks in which two stimuli are presented (cf MacMillan and Creelman 2005).
123
150
Exp Brain Res (2011) 215:141–161
Table 2 Experiment 2: probability of a correct answer P{C} for each actor and each speed ratio (SR)
Table 4 Experiment 2: results for each participant Participant
v2
P
P{C}
P{F}
d0
Actor SR
1
2
3
4
5
Mean
S01
173.710
\.001
0.780
0.656
1.838
S02
111.421
\.001
0.736
0.492
1.262
144.816
\.001
0.762
0.614
1.535
0.750
0.705
0.630
0.660
0.605
0.665
0.653
S03
0.875
0.620
0.605
0.600
0.640
0.595
0.612
S04
150.850
\.001
0.774
0.534
1.514
1.000 1.125
0.690 0.650
0.660 0.615
0.610 0.645
0.665 0.705
0.640 0.645
0.653 0.652
S05
0.308
.578
0.488
0.628
0.000*
S06
5.068
.024
0.550
0.558
0.254
1.250
0.690
0.700
0.665
0.660
0.640
0.671
S07
229.108
\.001
0.838
0.474
1.984
Mean
0.671
0.642
0.636
0.655
0.637
S08
3.035
.081
0.538
0.610
0.198
S09
0.828
.362
0.480
0.592
0.000*
S10
2.598
.107
0.536
0.524
0.181
Table 3 Experiment 2: probability of an answer ‘‘Forward’’ P{F} for each actor and each speed ratio (SR) Actor SR
1
2
3
4
5
Mean
0.750
0.545
0.620
0.670
0.545
0.505
0.577
0.875
0.520
0.595
0.610
0.530
0.575
0.566
1.000
0.590
0.510
0.520
0.545
0.520
0.537
1.125
0.540
0.635
0.635
0.605
0.565
0.596
1.250 Mean
0.470 0.533
0.540 0.580
0.615 0.610
0.620 0.569
0.580 0.549
0.565
may have contributed to the performance. First, participants may have recognized some of the words in the text. Unlike those of Experiment 1, participants were not given the possibility to memorize the association between words and movements because they never heard the spoken text. Thus, whenever recognition occurred, it must have resulted from true speechreading. Questioned after the experiments, some participants did in fact report having responded ‘‘Forward’’ after guessing a word. The contribution of this ‘‘positive cue’’ must be qualified. Because P{C} scores (Table 4) are likely to reflect also occasional speechreading, it can be suspected that the ability to do so differed considerably among individuals. At least two participants (S05 and S09) did no better than chance, meaning that neither speechreading nor any other cue was available to them. Even in the five participants with a rate of correct responses exceeding 0.70 (S01, S02, S03, S04, S07), the probability of giving a correct response based only on the identification of a single word was P{F|F} = P{C} ? P{F} - 0.5 and averaged 0.832. Therefore, if attention was focussed, for instance, on only three words in a text, the probability PW of identifying any of them in a trial is given by (1 - PW)3 = 1 - 0.832 and was no higher than about 0.45. The significant increase in the rate of correct responses in the course of the session may partly be credited to a repetition effect that made the identification of some words progressively easier.
123
2
P: test values (exact v test) for individual 2 9 2 contingency tables (pooling over texts and actors). P{C}: probability of correct answer. P{F}: probability of answer ‘‘Forward.’’ d0 : discrimination index (Hit: P{F|F}, False Alarm: P{F|B}). Slightly negative values of d0 (*) have been set to 0
Another cue for discriminating backward and forward stimuli might come from the components of the articulatory movements per se, i.e., independently of the ability to integrate them in a meaningful percept. If the perceptual system is tuned to features that are not invariant with respect to time reversals, these features can be used to identify movement direction. In the section ‘‘Constraints and temporal asymmetries’’ of the general discussion, we will review some of the kinematic features that may play such a role. Finally, participants reported that sometimes responses were suggested by the entire configuration of the lower face, which in some stimuli seemed unnatural. Clearly, the three potential cues for discrimination evoked above are not mutually exclusive. In order to estimate their relative weight, it is necessary to begin by testing discriminability in a condition that reduces drastically the potential contribution of speechreading.
Experiment 3 Experiment 2 showed that participants who had not heard the voice track of the video-clips discriminated movement direction with accuracy that exceeded significantly chance level. Better-than-chance performance was confirmed by deriving a d0 measure of discriminability from the stimulus– response contingency matrices. In discussing these results, we argued that the identification of some words through speechreading may have contributed to achieve such performance. The aim of this experiment was to estimate the relative weight of this contribution. In Experiment 2, word identification through speechreading may have been facilitated by the fact that, not taking into account rate
Exp Brain Res (2011) 215:141–161
modulation, each text was repeated 50 times in forward mode, allowing observers to refine progressively a preliminary guess. Because the chances to make out a word increase with the frequency with which the words occur in a text (Breeuwer and Plomp 1986), we reduced the number of repetitions from 50 to 10. At the same time, we broadened considerably the spectrum of words by increasing from 5 to 25 the number of different texts used as stimuli. Together, these changes are likely to reduce drastically the probability for any one word to be identified. The limited speed modulation introduced as factor in Experiment 2 did not interact with the ability to discriminate backward and forward stimuli (see above). Therefore, this factor was no longer introduced in the experimental design. Method Participants The same 10 individuals who participated in Experiment 1. Task, stimuli, and procedure The task was again to indicate whether a video-clip was shown as recorded or in reverse. The procedure was identical to that of Experiment 2. The stimuli were generated by the same 5 actors of the previous experiments. In this case, however, each actor recited a different 25-s-long memorized excerpt from Italian poems. Then, we extracted 5 non-overlapping samples of 2,000 frames (4 s) from each recorded excerpt and processed them as in Experiment 2. Thus, altogether, there were 25 different video-clips. The beginning and the end of the clips did not coincide with pauses between words so that the texts within the stimuli were not well-formed sentences. The video-clip for each text was shown only 10 times in the forward and 10 times in the backward direction. The average speech rate of the 5 actors was very close to that measured previously. Unlike Experiment 1 and 2, the playback speed was always the one at which the scene had been recorded (SR = 1). There were 2 [Backward/Forward] 9 5 [Actors] 9 5 [Text] = 50 different stimuli, and a session comprised NT = 10 [Repetition] 9 50 [Stimuli] = 500 trials. Trials were presented in a different pseudorandom order for each participant. Stimuli with the same actor or the same text were prevented to occur in successive trials. Results Questioned after the experiment, participants reported that no word could be identified. As in Experiment 2, for each participant and each text/actor combination, we computed the v2 of the 2 9 2 contingency tables [NF|F, NF|B, NB|F,
151
NB|B], the probability of a correct answer P{C}, the probability of a ‘‘Forward’’ response P{F}, and a d0 index based on the convention P{Hit} = P{F|F} and P{False Alarm} = P{F|B}. The main result emerged again by pooling responses over actors, texts, and participants (sample size = 5,000): NF|F = 1,695, NF|B = 1,080, NB|F = 805, NB|B = 1,420; v2 = 306.29 (P \ .001), P{C} = 0.623, P{F} = 0.555, d0 = 0.633. Observers detected the direction of the stimuli with a probability that was significantly higher than chance level (Two-tailed binomial test, P \ .001). The d0 index was significantly greater than 0 (One-tailed t test, t(9) = 4.033, P = .003), but not significantly smaller than in Experiment 2 (Two-tailed t test for independent samples, t(9) = -0.669, P = .512). Moreover, also in this condition, there was a bias in favor of the response ‘‘Forward’’ (Two-tailed binomial test, P \ .001). Tables 5 and 6 report the means over participants of P{C} and P{F} for all combinations of actors and texts (recall that each actor recited a different set of texts). Because these two mean indexes of performance were linearly independent (Pearson’s r = 0.096, P = .649), they were analyzed separately (GLM with repeated measures, 5 [Actor] 9 5 [Text] with Text nested within Actor, arcsine transformation, Greenhouse–Geisser correction). The rate of correct responses P{C} depended on both the actor and the text (Actor: F(4,36) = 3.999, P = .009; Text: Table 5 Experiment 3: probability of a correct answer P{C} for each actor and each text Actor Text
1
2
3
4
5
T1
0.535
0.665
0.725
0.665
0.745
T2
0.600
0.595
0.655
0.590
0.625
T3 T4
0.530 0.560
0.680 0.605
0.610 0.545
0.635 0.555
0.685 0.540
T5
0.640
0.645
0.595
0.635
0.715
Mean
0.573
0.638
0.626
0.616
0.662
Texts differ across actors
Table 6 Experiment 3: probability of an answer ‘‘Forward’’ P{F} for each actor and each text Actor Text
1
2
3
4
5
T1
0.655
0.585
0.575
0.615
0.515
T2
0.660
0.605
0.555
0.570
0.525
T3
0.550
0.550
0.500
0.445
0.585
T4
0.560
0.525
0.475
0.465
0.560
T5
0.510
0.655
0.515
0.515
0.605
Mean
0.587
0.584
0.524
0.522
0.558
Texts differ across actors
123
152
Exp Brain Res (2011) 215:141–161
Table 7 Experiment 3: results for each participant 2
P
P{C}
P{F}
d0
S01
17.328
\.001
0.592
0.576
0.474
S02
88.519
\.001
0.710
0.530
1.111
S03
0.645
.419
0.518
0.454
0.091
S04
34.958
\.001
0.632
0.528
0.676
S05
8.715
.003
0.566
0.510
0.332
S06
12.747
\.001
0.572
0.716
0.430
S07
22.975
\.001
0.606
0.574
0.548
S08
41.700
\.001
0.636
0.668
0.776
S09
221.355
\.001
0.832
0.468
1.940
S10
8.736
.003
0.623
0.526
0.333
Participant
v
2
P: test values (exact v test) for individual 2 9 2 contingency tables (pooling over texts and actors). P{C}: probability of correct answer. P{F}: probability of answer ‘‘Forward.’’ d0 : discrimination index (Hit: P{F|F}, False Alarm: P{F|B})
F(20,180) = 1.694, P = .038). By contrast, neither factor had a significant effect on the probability P{F} (Actor: F(4,36) = 1.895, P = .132; Text: F(20,180) = 1.439, P = .109). The results for the population sample were confirmed by most individual performances (Table 7). As in Experiment 2, we tested for learning effects by comparing individual performances across 5 successive blocks of trials. Statistical analysis (GLM with repeated measures, 5 [block], arcsine transformation, Greenhouse–Geisser correction) showed no block effect either on the bias P{F} in favor of the response ‘‘Forward’’ (F(4,36) = 1.509, P = .238) or on response accuracy P{C} (F(4,36) = 0.699, P = .598). Discussion The much smaller number of repetitions of each text and the wider sample of articulatory movements with respect to Experiment 2 was designed to reduce the chances for speechreading to occur. We cannot rule out that some unconscious lexical processing might have taken place. Thus, although observers could not be confident about identification, they might have guessed correctly, if they had been forced to. Nevertheless, the fact that the same group of participants who had confidently identified words in Experiment 1 claimed to be unable to do so in the new condition suggests that speechreading was indeed much more difficult. In addition, the fact that, unlike in Experiment 2, there was no improvement in response accuracy in the course of the session confirms that increasing the sample of texts effectively reduced the chances of guessing single words. Yet, discriminability was only marginally lower than in Experiment 2 (P{C}: 0.623 vs. 0.653 and d0 : 0.633 vs. 0.774, respectively; t test for independent samples, P{C}: t(18) = 0.425, P = .677; d0 : t(16) = 1.338, P = .206). Therefore, the occasional ability to make out
123
some words with forward stimuli cannot have played a major role in achieving the observed level of discriminability. Taken together, the results of Experiment 2 and 3 suggest that the persistent significant excess of accuracy above chance level must be partly or wholly credited to an intrinsic perceptual difference between backward and forward stimuli. Because many visible articulatory gestures comprise two phases, from rest to apical position and back, one relevant difference might be an asymmetry between movement phases. Although in connected speech, many such gestures are chained and modified by co-articulation, these asymmetries may still be perceived as local cues. In addition to local cues, one cannot rule out, however, a role of more global relationships among concurrent articulatory gestures. For instance, Rosenblum and Saldan˜a (1996) showed that as few as 20 markers provide sufficient configural information about dynamic deformations of the face to induce McGurk effects (McGurk and MacDonald 1976), even though local features are not visible. The last experiment introduced a modification of the stimuli designed to interfere with the pickup of discriminal information from the relation among local features.
Experiment 4 The suggestion that detecting configural invariants in the moving features of the lower face contributes to the discrimination between normal and reversed stimuli would be supported if accuracy was decreased by a global modification of the stimuli that affects detection. One such modification is to turn the stimuli upside down. It is well known that inverting the image of a human face disrupts profoundly the integration of the facial features into a coherent configuration, making almost impossible to identify even a very familiar face (Yin 1969). The reason why faces are disproportionally more affected by a rotation than any other object is still debated (Valentine 1988). However, it is generally agreed that inversions make it difficult to detect gross violations in the relationship between facial features and global configuration. For instance, in the so-called Thatcher illusion, (Thompson 1980), rotating by 180° only the eyes and the mouth within a normally upright face clearly turns a smiling expression into a grotesque one. In contrast, if the face is rotated while mouth and eyes are not, it is difficult to realize that the person is not smiling. In the Thatcher illusion, the affected relationship is spatial in nature. However, there is evidence from studies on the perception walking movements that inversion also affects temporal relationships. Troje and Weshoff (2006) showed that facing direction could still be identified reliably from spatially scrambled point-light displays containing no global spatiotemporal information. Instead, performance dropped to
Exp Brain Res (2011) 215:141–161
chance level when the display was inverted, suggesting that inversion affected local cues for motion direction. In upright displays, local cues can be effective even when presented for a fraction (200 ms) of the complete gait cycle (Chang and Troje 2008). More importantly, a recent study (Chang and Troje 2009) showed that inversion affects identification accuracy only when the kinematics of the points is the natural one. This suggests that discriminal kinematic information present in upright displays is no longer available when the display is inverted. By analogy, one may conjecture that showing upside down the stimuli of Experiment 3 obscures the fact that the temporal relationships within and among articulatory gestures in backward stimuli are unnatural. If so, it should be more difficult to pickup the discriminating cue upon which the response strategy relies, and this should lead to a drop in accuracy. Experiment 4 was conducted to test this prediction. To afford a direct comparison with the results of Experiment 3, we used again the same experimental population. Method Participants The same 10 individuals who served for Experiment 1 and 3.
153 Table 8 Experiment 4: probability of a correct answer P{C} for each actor and each text Actor Text
1
2
3
4
5
T1
0.465
0.495
0.560
0.550
0.490
T2
0.435
0.545
0.610
0.565
0.555
T3 T4
0.495 0.490
0.615 0.605
0.550 0.550
0.520 0.505
0.465 0.560
T5
0.555
0.535
0.625
0.630
0.645
Mean
0.488
0.559
0.579
0.554
0.543
Texts differ across actors
Table 9 Experiment 4: probability of an answer ‘‘Forward’’ P{F} for each actor and each text Actor Text
1
2
3
4
5
T1
0.495
0.635
0.480
0.600
0.380
T2
0.695
0.685
0.660
0.515
0.625
T3 T4
0.565 0.580
0.535 0.535
0.440 0.480
0.520 0.495
0.635 0.500
T5
0.645
0.605
0.535
0.330
0.655
Mean
0.596
0.579
0.519
0.492
0.559
Texts differ across actors
Task, stimuli, and procedure The task was again to indicate whether the video-clip was played in backward or forward mode. The procedure was identical to that of Experiments 2 and 3. The stimuli were the same as in Experiment 3, the only difference being that they were rotated by 1808 around the z-axis. Results Not surprisingly, participants confirmed that no word could be identified. The results (Tables 8 and 9) are presented with the same format of Experiment 2 and 3. Pooling responses over actors, texts, and participants (sample size = 5,000), the stimulus/response contingency matrix was: NF|F = 1,484, NF|B = 1,261, NB|F = 1,016, NB|B = 1,239; v2 = 40.17 (P \ .001), P{C} = 0.545, P{F} = 0.549, d0 = 0.226. At the sample level, the direction of the stimuli was discriminated with an accuracy significantly higher than 0.5 (Two-tailed binomial test, P \ .001) and the d0 index was significantly greater than 0 (One-tailed t test, t(9) = 2.486, P = .035). Moreover, there was again a significant bias in favor of the response ‘‘Forward’’ (Twotailed binomial test, P \ .001). Once again the mean of P{C} and P{F} for all combinations of Actor and Text was linearly independent (Pearson’s r = -0.150, P = .474).
Separate statistical analyses (GLM with repeated measures, 5 [Actor] 9 5 [Text] with Text nested within Actor, arcsine transformation, Greenhouse–Geisser correction) showed that neither the factor Actor nor the factor Text modulated significantly P{C} (Actor: F(4,36) = 2.467, P = .062; Text: F(20,180) = 1.618, P = .053). The bias in favor of the response ‘‘Forward’’ (P{F} [ 0.5) did not depend on the actor (F(4,36) = 2.344, P = .073), but varied across texts (F(20,180) = 2.861, P \ .001). Table 10, which summarizes the response probabilities with the same format as Table 7 shows that the trend emerging from the population sample was also present in most individual performances. In all cases but one (S05), the probability of a correct answer P{C} dropped with respect to Experiment 3. As in Experiment 3, there was no evidence of learning effects across successive blocks of trials either on P{F} (F(4,36) = 0.117, P = .976) or P{C} (F(4,36) = 0.663, P = .556). The results of Experiments 3 and 4 were also analyzed jointly (GLM, repeated measures, 2 [Experiment] 9 5 [Actor] 9 5 [Text], with Text nested within Actor, arcsine transformation, Greenhouse–Geisser correction). Rotating the stimuli by 180° reduced the probability of a correct answer P{C} (Experiment: F(1,9) = 28.390, P \ .001), as well as the d0 index (Two-tailed t test for paired samples,
123
154
Exp Brain Res (2011) 215:141–161
Table 10 Experiment 4: results for each participant Participant
v
2
P
P{C}
P{F}
d0
S01
4.282
.038
0.454
0.554
0.000*
S02
14.832
\.001
0.586
0.526
0.435
S03
0.008
.928
0.502
0.454
0.010
S04
2.905
.088
0.538
0.538
0.192
S05
1.571
.210
0.528
0.520
0.141
S06
8.430
\.004
0.564
0.584
0.330
S07
0.207
.649
0.510
0.590
0.051
S08
3.286
.070
0.538
0.674
0.211
S09
105.802
\.001
0.730
0.502
1.226
S10
0.032
.857
0.544
0.560
0.224
2
P: test values (exact v test) for individual 2 9 2 contingency tables (pooling over texts and actors). P{C}: probability of correct answer. P{F}: probability of answer ‘‘Forward.’’ d0 : discrimination index (Hit: P{F|F}, False Alarm: P{F|B}). Slightly negative values of d0 (*) have been set to 0
t(9) = 5.002, P = .001). P{C} was also affected by Text (Text: F(20,180) = 1.885, P = .016). However, the two factors acted independently (Experiment 9 Text: F(20,180), P = 1.333, P = .163). The analysis confirmed that Actor did not affect the ability to discriminate movement direction (Actor: F(4,36) = 2.318, P = .076), but interacted with Experiment (Experiment 9 Actor: F(4,36) = 5.802, P = .001). P{F} was not significantly different in the two experiments (Experiment: F(1,9) = 0.153, P = .704) and did not depend on the actor (Actor: F(4,36) = 2.344, P = .073). Discussion Rotating the stimuli upside down reduced the probability of a correct discrimination of movement direction. The drop was present also in individual performances (the linear regression P{C}4 = 0.716 9 P{C}3 ? 0.094 accounted for more than 70% of the variance). Moreover, the d0 index calculated on pooled responses decreased from 0.633 to 0.226 with respect to Experiment 3. In both Experiment 3 and 4, participants reported that they were not consciously able to identify any word. Even assuming that some unreported identifications did actually occur with inverted stimuli, it seems safe to credit most of the drop in performance to an increased difficulty to extract the cues for discriminating backward and forward stimuli from the moving features of the lower face. This is in keeping with the assumption (see above) that a 1808 rotation not only disrupts the spatial relationships between facial features and reference frame, but also blurs the temporal relationships within and among articulatory gestures. In the General Discussion, we will take up again the role of temporal asymmetries within articulatory phases in producing the observed
123
behavior. It should be clear, however, that the results are not specific enough to assess whether local or configural features are more affected. Perhaps, this issue might be clarified by a future experiment contrasting the effect of rotating either the entire lower half of the face or just the mouth. It should also be stressed that, albeit strongly reduced, discriminability was not altogether suppressed. In particular, at least three participants (S02, S06, and S09) who discriminated best in Experiment 3 (Table 7) remained well above chance level even with inverted stimuli. Because the same individuals served in Experiment 4 after serving in Experiment 3, one could suspect learning effects due to repeated exposure to the stimuli. Transfer is, however, unlikely because the stimuli in this experiment were rotated by 1808. Moreover, if learning had been instrumental for preserving some discriminability even with inverted stimuli, one would expect worst performers to profit more from repetition than strong performers. Instead, the linear regression between d0 values in Experiment 3 and 4: d0 (4) = 0.624 9 d0 (3) - 0.136 accounted for almost 85% of the variance, meaning that individual differences in performance remained practically unchanged across conditions.
General discussion Is visual perception tuned to biological movements? We addressed the question in the special case of speech movements by focussing on two measures of performance that may carry the imprint of such tuning (1) accuracy and precision in detecting departures from the normal speech rate, and (2) the ability discriminate normal and timereversed speech movements. As for the first point, we reasoned that if indeed articulatory gestures enjoy a special perceptual status, disrupting their natural sequence by an inversion of movement direction should affect rate estimation. Support to this hypothesis was provided by Experiment 1 that demonstrated that speech rate of timereversed displays is underestimated by about 7% with respect to the speech rate of normal displays Fig. 2). Differential thresholds were instead comparable (about 10%) for both backward and forward stimuli (Fig. 4). As concerns the second point, the perceptual tuning hypothesis predicts that time inversion is perceptually salient even though it does not entail a violation of any physical law. Together, Experiments 2 and 3 showed that this was the case. Both experiments revealed that forward and backward stimuli are discriminated well above chance. Moreover, by ruling out that speechreading can account alone for the performance, Experiment 3 suggested that the key discriminal cue is provided by kinematic features that are not time symmetric. Finally, Experiment 4 showed that the effectiveness of these cues for discriminating forward and
Exp Brain Res (2011) 215:141–161
backward displays is significantly reduced by a 180° rotation of the stimuli. Estimating speech rate In Experiment 1, the natural speech rate of each actor was demonstrated before the experimental session by showing the video-clips with the sound track. Moreover, each stimulus was presented many times in one experimental session. As a consequence, most participants reported that they had been able to identify a few words, either by memorizing the association between sound and movements or through speechreading. This, however, could not have biased the results because the assignment emphasized that attention had to be focussed only on speech rate. In fact, no participant ever realized that speechreading was impossible for one half of the stimuli. The task required to compare the speech rate of a manipulated recording for each actor with some internal reference representing the normal rate of the same actor. Insofar as all stimuli had a constant duration of 4 s, perceived rate had to be inferred from the number and/ or duration of the units of articulation (visemes) that the participant was able to parse (e.g., from the onset of the opening phase of the lips to the end of the closing phase). Assuming that syllables are physiologically based units of articulation (Krakow 1999), durations to be judged would have fallen in the 180–250 ms range (see ‘‘Methods’’ of Experiment 1) where precision is fairly good (Weber ratio is about 0.04, cf Gibbon et al. 1997). Subjective duration of intervals filled with task-irrelevant events is longer than that of empty intervals (the so-called filled-duration illusion, Buffardi 1971; Thomas and Brown 1974), the increase depending on the complexity of the perceptual processing required by the event (Burnside 1971; Cantor and Thomas 1976; Michon 1965; Ornstein 1969). By analogy, the underestimation of speech rate with reversed stimuli may be explained by assuming that the subjective duration of the visible units of articulation depends on the amount of processing required to decode the unit content. If so, one can speculate further that articulatory sequences compatible with our motor competence are processed in a more direct, automatic way than incompatible sequences. This hypothesis is in keeping with the observation that when subjects view and listen to speaking faces, incongruent stimuli elicit a stronger cerebral activity in the Broca’s than congruent stimuli (Ojanen et al. 2005). By contrast, the notion that rate underestimation with reverse stimuli reflects the high computational load required to perceive lip gestures incompatible with motor competence seems at odds with the common observation that speech in an unfamiliar language seems faster than in the native one. However, the contradiction may only be apparent. A recent study (Pfitzinger and Tamashima 2006), in which German
155
and Japanese listeners judged the speech rates of both languages, offers two alternative accounts for the overestimation of speech rate in an unfamiliar language. One line of argument is that by listening to an unfamiliar language, people unconsciously increase the number of phonetic units by inserting additional items to fit the input within the phonotactic structure of their native language. Alternatively, one may argue that the difference between known and unknown languages is that only in the former case, the number of speech items to be processed can be reduced by selectively pruning redundant information. Both the fill-in mechanism advocated by the first account and the pruning mechanism advocated by the second account make the sensible assumption that some phonetic parsing occurs even when one is confronted with a foreign language. In either case, the perceived high speech rate in an unfamiliar language is supposed to depend on the number of speech items to be processed rather than the complexity of the processing. Neither the first nor the second mechanism suggested by Pfitzinger and Tamashima (2006) to explain the illusory rate acceleration are instead likely to be triggered by our reverse visual stimuli. On the one side, the processes that make language understandable when only acoustic stimuli are available cannot be directly activated because the stimuli are silent. Moreover, even if we assume that a phonetic analysis is nevertheless attempted on the basis of purely visual information with the help of implicit motor competence, the observed underestimation of the rate cannot depend on the number of units to be processed because, unlike speech sounds in a foreign language, reverse stimuli can never be parsed into phonetically meaningful units. In summary, we argue that the rate overestimation experienced with unfamiliar languages and the rate underestimation demonstrated by Experiment 1 have different causes and are not mutually incompatible. Discriminating backward and forward stimuli In Experiment 2, being able to speechread correctly, at least one word in forward stimuli would immediately lead to a correct response. We have no direct estimate of how frequently this happened. Even the high hit rates P{F|F} of the best performers (Table 4) are compatible with a probability of speechreading one word that falls within the range often cited in the literature (e.g., Bernstein et al. 2000). Actually, because stimuli were repeated several times in a session, the range might be revised upward (Breeuwer and Plomp 1986). Nevertheless, comparing the results of Experiments 2 and 3 suggests that the contribution of speechreading to the performance was modest. Indeed, when the number of repetitions of each text was reduced from 50 (Experiment 2) to 10 (Experiment 3), and the text sample increased fivefold, observers reported that
123
156
Exp Brain Res (2011) 215:141–161
Table 11 P{C}: probability of correct answer P{C}
P{F|F}
P{B|B}
P{F}
d0
P{F|B}/P{B|F}
Exp. 2
0.648 (0.044)
0.716 (0.040)
0.580 (0.051)
0.568 (0.019)
0.876 (0.258)
1.886 (2.062)
Exp. 3
0.623 (0.027)
0.678 (0.033)
0.568 (0.041)
0.555 (0.026)
0.671 (0.166)
1.475 (0.549)
Exp. 4
0.545 (0.024)
0.594 (0.026)
0.496 (0.031)
0.549 (0.019)
0.282 (0.113)
1.281 (1.278)
P{F|F}: probability of correct answer for forward stimuli. P{B|B}: probability of correct answer for backward stimuli. P{F}: probability of answer ‘‘Forward.’’ d0 : discrimination index (Hit: P{F|F}, False Alarm: P{F|B}). P{F|B}/P{B|F}: Ratio of error rates. Means and standard errors (parenthesis) over all participants. Because of the nonlinearity of the computation, the means of individual d0 s are slightly different from the d0 values for the population
speechreading had become next to impossible. More importantly, only in Experiment 2 did we find a significant increase in the rate of correct responses in the course of the session, which is likely to reflect an effect of repetition on the ability to lip-read isolated words. Yet, discriminability decreased only slightly (Table 11). Word recognition may have occurred occasionally without being reported. This, however, would be true for both experiments and would not explain why word recognition in Experiment 2 did not result in a significantly better performance. Apparently, although occasional word recognition did occur as reported, it was not consistently used as a positive cue to infer movement direction. Note also that speechreading cues are potentially misleading because words shown in backward mode are sometimes identified with nonwords (Paulesu et al. 2003; pilot experiment). Therefore, the false alarm rate P{F|B} in Experiment 2 may have been inflated by these wrong identifications. In short, comparing the results of Experiment 2 and 3 leads one to emphasize in both conditions the discriminating role of the kinematics of the articulatory sequences. Specifically, participants may be able to detect whether or not articulatory sequences are compatible with their motor competence. Correct detection can be achieved in many ways that are not mutually exclusive. In line with the results of Campbell et al. (2001), one may assume that only forward stimuli are able to activate the processes that in normal speech are conducive to decoding verbal messages. More specifically, correct ‘‘Forward’’ responses would be triggered by a ‘‘resonance’’ mechanism similar to the one that has been claimed to be subserved by the mirror-neuron system (Rizzolatti and Craighero 2004), and correct ‘‘Backward’’ responses would be triggered when the expected resonance fails to materialize. Alternatively, one may suppose that kinematic profiles at variance with the forward ones that the perceptual system is tuned to become salient and used as ‘‘negative cues,’’ leading to a correct ‘‘Backward’’ response. If so, correct ‘‘Forward’’ responses would be triggered when the observer fails to detect any anomaly in the display. Clearly, the question cannot be adjudged unless independent evidence on the underlying pattern of cortical activation becomes available. Note,
123
however, that the first scenario is favored by the observation that the hit rate was higher with forward than with backward stimuli (P{F|F} [ P{B|B}, see Table 11). In the following section, we shall try to be more specific about the possible nature of the relevant kinematic cues. The performance in the scaling task (Experiment 1) did not predict reliably the performance in the 2AFC task (Experiment 3). This is not incompatible with the hypothesis that in both cases, perceptual and motor factors interacted in shaping the pattern of results. As argued above, in the 2AFC task, local cues are involved in response selection, either because they are detected or because they fail to be detected. In contrast, matching the reference speech rate in the scaling task involves some global estimate of the pace at which are chained the salient phases of the movement. Thus, because the tasks probed complementary but different perceptual mechanisms, it may be expected that individual performances are only loosely correlated. Constraints and temporal asymmetries At least two types of kinematic features may play a role in rate estimation (Experiment 1) and direction discrimination (Experiments 2, 3, and 4). A ubiquitous strategy adopted by the nervous system to simplify the control of systems with many degrees of freedom is to introduce constraints among the system components (see Bernstein 1967; Turvey 1977). Several studies have provided evidence of such constraints in speech. It has been shown that the temporal relationship between consonant- and vowel-related activities remains stable over speaking rate and syllable stress (Kelso et al. 1986; Tuller et al. 1982, 1983), possibly in a speakerspecific manner (Shaiman et al. 1995). It has also been shown that lips and jaw movements occur with fixed timing relationships with respect to laryngeal (Gracco and Lo¨fqvist 1994) and velar (Kollia et al. 1995) movements and that the coupling gets progressively tighter during development (Smith and Goffman 1998; see, however, Shaiman et al. 1995; Tasko and McClean 2004). Although temporal coupling, per se, is invariant with respect to time reversals, the degree of coupling may vary in a time-asymmetric way, depending on the functional characteristics of whole action.
Exp Brain Res (2011) 215:141–161
This is indeed the case for oral movements in the production of bilabial stop consonants where the peak velocities of the upper lip, lower lip, and jaw during the closing phase occur in a rigid temporal sequence (Caruso et al. 1988; Gracco and Abbs 1986), whereas their temporal relation is much more flexible during the successive phases (e.g., Fig. 3 in Gracco 1988). By extension, a similar asymmetry may also be expected whenever the timing to reach an apical position of the articulators is more crucial than the return phase in order to attain a specific acoustic target. If so, a first discriminal cue could be the order in which a phase of loose coupling follows a phase of tight coupling (Paulesu et al. 2003). A second source of discriminal information may arise from asymmetries in the time course of the kinematic variables. Evidence of such asymmetries, already reported in the movements of the tongue blade (Tasko and McClean 2004) and lips (Gracco 1988; Smith and Goffman 1998), was present also in our stimuli. We analyzed lip movements for syllables that include the bilabial stop consonant/ p/ by considering both displacement and velocity traces. In at least four traces, there was a marked asymmetry between the opening and closing phases of the mouth. Of course, there may exist other asymmetries in addition to those mentioned above. However, these two cues seem conspicuous enough to be picked up by the perceptual system and to provide the basis for detecting time reversals (Fig. 5). Why was discriminability virtually suppressed by a 1808 rotation (Experiment 4)? Rotations affect dramatically face recognition (e.g., Yin 1969), either because inversion disrupts the spatial relationship between critical facial features (Diamond and Carey 1986; Leder and Bruce 2000, see, however, Valentine 1988) or because a standard frame of reference is no longer available (Farah et al. 1995). Moreover, inversion has a dramatic effect both upon the perception of expressions (Kohler 1940) and the identification of the role of facial features within the context of the face (Thompson 1980; Valentine and Bruce 1985). More importantly, inversion has a detrimental effect on the detection of biologicalness. Point-like displays a` la Johansson (1973) of facial movements suppress local features (Bruce and Valentine 1988), but still convey enough dynamical information to discriminate expressions (Bassili 1978, 1979). At the same time, turning upside down pointlike displays of a moving human body has a detrimental effect on the ability to extract information that is normally available from upright displays (Mitkin and Pavlova 1990; Pavlova and Sokolov 2000; Sumi 1984), suggesting that inversion affects the integration of temporal cues into a dynamical configuration. One may therefore surmise that the lack of a canonical frame of reference almost suppressed the effectiveness with which temporal asymmetries
157
are exploited for discriminating normal and reversed stimuli. Temporal asymmetries cannot explain the bias in favor of the response ‘‘Forward’’ present in Experiment 2, 3, and 4 (Table 11, averaged across conditions, P{F} = 0.554). Because individual biases were almost insensitive to the experimental manipulations affecting discrimination, and highly correlated (compare Tables 7, 10), one may suspect an entrenched cognitive preconception. Apparently, in a number of trials, the very fact that the stimuli always represented a talking face has overridden the perceptual evaluation of the evidence (positive or negative) concerning the direction of the time arrow. Of course, the bias may have inflated the hit rate P{F|F} and the unbalance between error rates (P{F|B} and P{B|F} (Table 11). In the final section, we discuss the mechanisms that may permit the observer to use kinematic cues for discriminating normal and reversed stimuli. Familiarity vs motor competence We are continuously exposed to visual displays of speech movements, whereas time-reversed displays of these movements are hardly ever seen. Therefore, before arguing for a possible role of motor competence, one needs to consider the more parsimonious hypothesis that familiarity was the major determinant of the performance. Because familiar stimuli are expected to be easier to discriminate than unfamiliar ones, one expects a higher hit rate for forward than for backward displays, i.e., P{F|F} [ P{B|B}. The results of Experiment 2 and 3 are in keeping with this expectation inasmuch as the ratio P{F|F}/P{B|B} is equal to 1.234 and 1.194, respectively (Table 11). In contrast, familiarity cannot account for the results of Experiment 4. Indeed, an inverted display of the mouth region can hardly be construed as a familiar stimulus. Yet, the ratio P{F|F}/P{B|B} is virtually identical (1.198), suggesting that the bias in favor of forward displays has a common origin in all conditions that is not likely to be familiarity. Familiarity is also unlikely to be responsible for rate underestimation of reverse stimuli in Experiment 1. Although familiar stimuli are better remembered than unfamiliar ones, this cannot have had a selective impact on reverse stimuli because the rate of all stimuli had to be compared with the memory trace of the reference, which in all cases was a normal, forward display. In conclusion, the hypothesis that scaling and discrimination performance reflected processing differences between familiar (forward) and unfamiliar (backward) stimuli seem insufficient to account for the pattern of results. To conclude, we outline a more specific articulation of the notion of action–perception coupling notion that may better accommodate the available evidence.
123
158
Exp Brain Res (2011) 215:141–161
Fig. 5 Experiment 2: kinematics of lip movements. a Two frames from the videoclips used for Experiment 1 and 2. Initial and final phase of lip closure for articulating the bilabial stop consonant/p/. Dots mark the positions selected for measuring lip aperture. Sampling rate: 500 Hz. To compute velocity, raw measures were smoothed (cutoff frequency: 75 Hz) and differentiated with a minimumshift algorithm. b Lip aperture during the articulation of the indicated syllables for each actor. Minima correspond to the full closure of the lips. c Velocity of lip aperture. Note the large temporal asymmetry
Liberman’s original theory of speech perception has been criticized, mainly on the ground that certain features of speech cited as supporting evidence are shared also by non-human communication systems (e.g., Sussman 1989). Nevertheless, a recent reassessment of the theory (Galantucci et al. 2006) has argued that its core assumption, namely that perceiving speech is perceiving vocal tract gestures, still provides the single most coherent account of several pieces of behavioral evidence. Support to this core assumption has been provided recently by showing that the motor circuits controlling speech production also contribute to the categorical perception of phonemes (Mo¨tto¨nen and Watkins 2009). Renewed interest for the motor theory of language perception has also been sparkled by the discovery of the so-called mirror-neuron system (cf Rizzolatti and Craighero 2004). The original finding that a class of
123
neurons in area F5 of the monkey pre-motor cortex discharges both when the animal performs a specific action and when it observes the same action being performed by another agent (Di Pellegrino et al. 1992) linked motor competence to visual perception. However, recent developments (cf Nishitani et al. 2005) suggest the possibility of subsuming within a common framework both Liberman’s intuition that articulatory competence is instrumental for language perception, and the notion that certain motor structures subserve a representational function for the understanding of visually perceived motor actions. On the one hand, the concept of action–perception coupling has been generalized beyond the acoustic domain by demonstrating connections between cortical processes involved in the visual perception of speech acts and cortical processes involved in the perception and production of
Exp Brain Res (2011) 215:141–161
language (Ojanen et al. 2005; Pulvermu¨ller et al. 2006). Moreover, the original suggestion (Summerfield 1987) that the perception-by-synthesis mechanism at the core of Liberman’s proposal is at work not only with acoustic language-related inputs, but with visual language-related inputs as well was later supported by Kerzel and Bekkering (2000) who interpreted the observed interference effects between visible speech and vocal utterances as evidence that even purely visual stimuli activate the implicit motor competence underlying speech production. On the other hand, the functional role of the mirrorneuron system has been broadened by demonstrating that also acoustic inputs characteristically related to a specific action—such as the cracking noise of a peanut’s shell— activate the same neurons that discharge when performing that action (Kohler et al. 2002; Keysers et al. 2003). This has invited the inference that the mechanisms for action recognition played a role for the evolution of speech (Rizzolatti and Arbib 1998). More importantly, it has been explicitly suggested that the mirror-neuron system affords the much sought for neurophysiological underpinning for Liberman’s theory (Galantucci et al. 2006). The rather vague notions of resonance and perceptual tuning evoked previously may therefore correspond to the selectivity with which action-related networks respond to specific features of the inputs. This interpretation evokes an issue that is still debated, namely whether the motor competence tapped in processing speech-specific visual stimuli is itself specific to speech [the so-called speech-isspecial view (Liberman 1996; Santi et al. 2003)] or is instead a particular instance of a general purpose competence available for processing any gesture (Galantucci et al. 2009; Kerzel and Bekkering 2000). We assumed that the misperception of speech rate with backward stimuli and the ability to discriminate backward and forward stimuli are both consequences of the different way in which temporal asymmetries in articulatory gestures are processed perceptually. Specifically, we assumed that asymmetries that are coherent with production rules are less computationally demanding than asymmetries that are at variance with them (Experiment 1). Moreover, we also assumed that the difference between the two computational loads is conspicuous and affords a cue to identify the direction of the arrow of time (Experiment 3). Both assumptions are in keeping with the speech-is-special view. It should be stressed, however, that our results based only on speech movements are insufficient to address this issue directly. To do so, future research might extend the experimental conditions in at least two directions, namely by comparing performances with speech and non-speech mouth movements, and by showing that 1808 rotations also affect the detection of discriminal cues in tasks engaging both visual and acoustic speech perception.
159 Acknowledgments This research was supported by grants from the Italian Space Agency (Disturbi Controllo Motorio e Cardiorespiratorio), the Italian Ministry of University and Research (Programmi di ricerca di Rilevante Interesse Nazionale), and the Italian Ministry of Health (Ricerca Corrente).
References Auer ET, Bernstein LE (1997) Speechreading and the structure of the lexicon: computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness. J Acoust Soc Am 102:3704–3710 Bassili JN (1978) Facial motion in the perception of faces and emotional expressions. J Exp Psychol Hum Percept Perform 4:373–379 Bassili JN (1979) Emotion recognition: the role of facial movement and the relative importance of upper and lower areas of the face. J Pers Soc Psychol 37:2049–2058 Bernstein N (1967) The coordination and regulation of movements. Pergamon Press, London Bernstein LE, Demorest ME, Tucker PE (2000) Speech perception without hearing. Percept Psychophys 62:233–252 Blake R, Shiffrar M (2007) Perception of human motion. Annu Rev Psychol 58:47–73 Blakemore S-J, Frith C (2005) The role of contagion in the prediction of action. Neuropsychologia 43:260–267 Breeuwer M, Plomp R (1986) Speechreading supplemented with auditorily presented speech parameters. J Acoust Soc Am 79:481–499 Bruce V, Valentine T (1988) When a nod’s as good as a wink: the role of dynamic information in facial recognition. In: Gruneberg MM, Morris PE, Skyes RN (eds) Practical aspects of memory: current research and issues, vol 1. Wiley, Chichester, pp 169–174 Buffardi L (1971) Factors affecting the filled-duration illusion in the auditory, tactual, and visual modalities. Percept Psychophys 10:292–294 Burnside W (1971) Judgment of short time intervals while performing mathematical tasks. Percept Psychophys 9:404–406 Calvert GA, Campbell R (2003) Reading speech from still and moving faces: the neural substrates of visible speech. J Cogn Neurosci 15:57–70 Calvo-Merino B, Ehrengerg S, Leung D, Haggard P (2010) Experts see it all: configural effects in action observation. Psychol Res 74:400–406 Campbell R (2008) The processing of audio-visual speech: empirical and neural bases. Philos Trans R Soc B 363:1001–1010 Campbell R, De Gelder B, De Haan E (1996) The lateralization of lipreading: a second look. Neuropsychologia 34:1235–1240 Campbell R, MacSweeney M, Surguladze S, Calvert G, McGuire P, Suckling J, Brammer MJ, David AS (2001) Cortical substrates for the perception of face actions: an fMRI study of the specificity of activation for seen speech and for meaningless lower-face acts (gurning). Cogn Brain Res 12:233–243 Cantor NE, Thomas EC (1976) Visual masking effects on duration, size, and form discrimination. Percept Psychophys 19:321–327 Caruso AJ, Abbs JH, Gracco VL (1988) Kinematic analysis of multiple movement coordination during speech in stutterers. Brain 111:439–455 Chang DHF, Troje NF (2008) Perception of animacy and direction from local biological motion signals. J Vis 8:1–10 Chang DHF, Troje NF (2009) Acceleration carries the local inversion effect in biological motion perception. J Vis 9:1–17
123
160 de’Sperati C, Viviani P (1997) The relationship between curvature and velocity in two-dimensional smooth pursuit eye movements. J Neurosci 15:3932–3945 Dekle DJ, Fowler CA, Funnell MG (1992) Audiovisual integration in perception of real words. Percept Psychophys 51:355–362 Demorest ME, Bernstein LE (1992) Sources of variability in speechreading sentences: a generalizability analysis. J Speech Hear Res 35:876–891 Di Pellegrino G, Fadiga L, Fogassi L, Gallese V, Rizzolatti G (1992) Understanding motor events: a neurophysiological study. Exp Brain Res 91:176–180 Diamond R, Carey S (1986) Why faces are and are not special: an effect of expertise. J Exp Psychol Gen 115:107–117 Eefting W, Rietveld A (1989) Just noticeable differences of articulation rate at the sentence level. Speech Commun 8:355–361 Eklund R (2008) Pulmonic ingressive phonation: diachronic and synchronic characteristics, distribution and function in animal and human sound production and in human speech. J Int Phonetic Ass 38:235–324 Elphick R (1996) Issues in comparing the speechreading abilities of hearing-impaired and hearing 15 to 16 year-old pupils. Br J Educ Psycol 66:357–365 Fadiga L, Craighero L, Buccino G, Rizzolatti G (2002) Speech listening specifically modulates the excitability of tongue muscles: a TMS study. Eur J Neurosci 15:399–402 Farah MJ, Tanaka JN, Drain M (1995) What causes the face inversion effect. J Exp Psychol Hum 21:628–634 Feld JE, Sommers MS (2009) Lipreading, processing speed, and working memory in younger and older adults. J Speech Lang Hear R 52:1555–1565 Fisher CG (1968) Confusion among visually perceived consonants. J Speech Hear Res 11:796–804 Freyd JJ (1987) Dynamic mental representations. Psychol Rev 94:427–438 Freyd JJ, Finke RA (1984) Representational momentum. J Exp Psychol Learn 10:126–132 Fridriksson J, Moss J, Davis B, Baylis GC, Bonilha L, Rorden C (2008) Motor speech perception modulates the cortical language areas. Neuroimage 41:605–613 Galantucci B, Fowler CA, Turvey MT (2006) The motor theory of speech perception reviewed. Psychon B Rev 13:361–377 Galantucci B, Fowler CA, Goldstein L (2009) Perceptuomotor compatibility effects in speech. Atten Percept Psycho 71:1138–1149 Gibbon J, Malapani C, Dale CL, Gallistel CR (1997) Toward a neurobiology of temporal cognition: advances and challenges. Curr Opin Biol 2:170–184 Gracco VL (1988) Timing factors in the coordination of speech movements. J Neurosci 8:4628–4639 Gracco VL, Abbs JH (1986) Variant and invariant characteristics of speech movements. Exp Brain Res 65:156–166 Gracco VL, Lo¨fqvist A (1994) Speech motor coordination and control: evidence from lip, jaw, and laryngeal movements. J Neurosci 14:6585–6597 Grasso R, Bianchi L, Lacquaniti F (1998) Motor patterns for human gait: backward versus forward locomotion. J Neurophysiol 80:1868–1885 Jacobs A, Pinto J, Shiffrar M (2004) Experience, context, and the visual perception of human movement. J Exp Psychol Hum 30:822–835 Johansson G (1973) Visual perception of biological motion and a model for its analysis. Percep Psychophys 14:201–211 Kelso JAS, Saltzman EL, Tuller B (1986) The dynamical perspective on speech production. J Phonetics 14:29–59
123
Exp Brain Res (2011) 215:141–161 Kerzel D, Bekkering H (2000) Motor activation from visible speech: evidence from stimulus response compatibility. J Exp Psychol Hum 26:634–647 Keysers C, Kohler E, Umilta` MA, Nanetti L, Fogassi L, Gallese V (2003) Audiovisual mirror neurons and action recognition. Exp Brain Res 153:628–636 Kohler W (1940) Dynamics in psychology. Liveright, New York Kohler E, Keysers C, Umilta` MA, Fogassi L, Gallese V, Rizzolatti G (2002) Hearing sounds, understanding actions: action representation in mirror neurons. Science 297:846–848 Kollia HB, Gracco VL, Harris KS (1995) Articulatory organization of mandibular, labial, and velar movements during speech. J Acoustic Soc Am 98:1313–1324 Krakow RA (1999) Physiological organization of syllables: a review. J Phonetics 27:23–54 Leder H, Bruce V (2000) When inverted faces are recognized: the role of configural information in face recognition. Q J Exp Psychol 53:513–536 Lehiste I (1970) Suprasegmentals. MIT Press, Cambridge Liberman AM (1996) Speech: a special code. MIT Press, Cambridge Liberman AM, Mattingly IG (1985) The motor theory of speech perception revised. Cognition 21:1–36 Liberman AM, Whalen DH (2000) On the relation of speech to language. Trends Cogn Sci 4:187–196 Liberman AM, Cooper FS, Shankweiler DP, Studdert-Kennedy M (1967) Perception of speech code. Psychol Rev 74:431–461 Lyxell B, Ro¨nnberg J (1992) The relation between verbal ability and sentence-based speechreading. Scand Audiol 21:67–72 Mach E (1885) Beitra¨ge zur Analyse der Empfindungen [english translation: Contributions to the analysis of sensations. Open Court (1897), La Salle, IL] MacMillan NA, Creelman CD (2005) Detection theory: a user’s guide. Lawrence Erlbaum Associates, 2nd edn. New York, London Massaro DW (1998) Perceiving talking faces: from speech perception to a behavioral principle. MIT Press, Cambridge McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748 Michon JA (1965) Studies in subjective duration. II. Subjective time measurements during task with different information content. Acta Psychol 24:205–219 Middleweerd MJ, Plomp R (1987) The effect of speechreading on the speech-reception threshold of sentences in noise. J Acoust Soc Am 82:2145–2147 Mitkin AA, Pavlova MA (1990) Changing a natural orientation: recognition of biological motion patterns by children and adults. Psychol Beitr 32:28–35 Mo¨tto¨nen R, Watkins KE (2009) Motor representations of articulators contribute to categorical perception of speech sounds. J Neurosci 29:9819–9825 Mowbray GH, Gebhard JW (1955) Differential sensitivity of the eye to intermittent white light. Science 121:173–175 Munhall KG, Ostry DJ, Parush A (1985) Characteristics of velocity profiles of speech movements. J Exp Psychol Hum 11:457–474 Nishitani N, Hari R (2002) Viewing lip forms: cortical dynamics. Neuron 36:1211–1220 Nishitani N, Schu¨rmann N, Amunts K, Hari R (2005) Broca’s region: from action to language. Physiology 20:60–69 Noteboom S, Eefting W (1994) Evidence of adaptive nature of speech on the phrase level and below. Phonetica 51:92–98 Ojanen V, Mo¨tto¨nen R, Pekkola J, Ja¨a¨skela¨inen IP, Joensuu R, Autti T, Sams M (2005) Processing of audiovisual speech in Broca’s area. Neuroimage 25:333–338 Ornstein R (1969) On the experience of time. Penguin Books, Baltimore
Exp Brain Res (2011) 215:141–161 Ostry DJ, Munhall KG (1985) Control of rate and duration of speech movements. J Acoust Soc Am 77:640–648 Paulesu E, Perani D, Blasi V, Silani G, Borghese NA, De Giovanni U, Sensolo S, Fazio F (2003) A functional-anatomical model for lip-reading. J Neurophysiol 90:2005–2013 Pavlova M, Sokolov A (2000) Orientation specificity in biological motion perception. Percept Psychophys 62:889–899 Pekkola J, Ojanen V, Autti T, Ja¨a¨skela¨inen IP, Mo¨tto¨nen R, Tarkiainen A, Sams M (2005) Primary auditory cortex activation by visual speech: an fMRI study at 3T. Neuroreport 16:125–128 Pfitzinger HR, Tamashima M (2006) Comparing perceptual local speech rate of German and Japanese speech. In: Proceedings of the 3rd international conference on speech prosody, pp 105–108 Pulvermu¨ller F, Huss M, Kherif F, Moscoso del Prado Martin F, Hauk O, Shtyrov Y (2006) Motor cortex maps articulatory features of speech sounds. Proc Natl Acad Sci USA 103:7865–7870 Quene´ H (2007) On the just noticeable difference for tempo in speech. J Phonetics 35:353–362 Rizzolatti G, Arbib MA (1998) Language within our grasp. Trends Neurosci 21:188–194 Rizzolatti G, Craighero L (2004) The mirror-neuron system. Annu Rev Neurosci 27:169–192 Ro¨nnberg J (1995) What makes a skilled speechreader? In: Plant G, Spens K (eds) Profound deafness and speech communication. Whurr Publications, London, pp 393–416 Ro¨nnberg J, Samuelsson S, Lyxell B (1998) Conceptual constraints in sentence-based lipreading in hearing-impaired. In: Campbell R, Dodds D, Burnham DK (eds) Hearing by eye II: advances in the psychology of speechreading and auditory-visual speech. Psychology Press, Hove, pp 143–153 Rosenblum LD, Saldan˜a HM (1996) An audiovisual test of kinematic primitives for visual speech perception. J Exp Psychol Hum 22:318–331 Ruytjens L, Albers F, van Dijk P, Wit H, Willemsen A (2006) Neural responses to silent lip-reading in normal hearing male and female subjects. Eur J Neurosci 24:1835–1844 Sams M, Aulanko R, Ha¨ma¨la¨inen M, Hari R, Lounasmaa OV, Lu S-T, Simola J (1991) Seeing speech: visual information from lip movement modifies activity in the human auditory cortex. Neurosci Lett 127:141–145 Santi A, Servos P, Vatikiotis-Bateson E, Kuratate T, Munhall K (2003) Perceiving biological motion: dissociating visible speech from walking. J Cogn Neurosci 15:800–809 Shaiman S, Adams SG, Kimelman MDZ (1995) Timing relationships of the upper lip and jaw across changes in speaking rate. J Phonetics 23:119–128 Shepard RN (1984) Ecological constraints on internal representation: resonant kinematics of perceiving, imagining, thinking, and dreaming. Psychol Rev 91:417–447 Shepard RN (1994) Perceptual-cognitive universals as reflections of the world. Psychon B Rev 1:2–28 Simion F, Regolin L, Bulf H (2008) A predisposition for biological motion in the newborn baby. Proc Natl Acad Sci USA 105:809–813 Smith A, Goffman L (1998) Stability and patterning of speech movement sequences in children and adults. J Speech Lang Hear R 41:18–30 Stevens JC, Shickman GM (1959) The perception of repetition rate. J Exp Psychol 58:433–440 Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26:212–215 Sumi S (1984) Upside-down presentation of the Johansson moving light-spot pattern. Perception 13:283–286 Summerfield Q (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd D,
161 Campbell R (eds) Hearing by eye: the psychology of lip-reading. Lawrence Erlbaum Associates, London, pp 3–51 Summerfield Q (1991) Visual perception of phonetic gestures. In: Mattingly IG, Studdert-Kennedy M (eds) Modularity and the motor theory of speech perception. Erlbaum, Hillsdale, pp 117–137 Summerfield Q (1992) Lipreading and audio-visual speech perception. Philos Trans R Soc Lon B 335:71–78 Sussman H (1989) Neural coding of relational invariance in speech: human language analogue to the barn owl. Psychol Rev 96:631–642 Tasko SM, McClean MD (2004) Variations in articulatory movements with changes in speech task. J Speech Lang Hear R 47:85–100 Thomas EC, Brown I Jr (1974) Time perception and the filledduration illusion. Percept Psychophys 16:449–458 Thompson P (1980) Margaret Thatcher: a new illusion. Perception 9:483–484 Thorstensson A (1986) How is the normal locomotor program modified to produce backward walking? Exp Brain Res 61:644–648 Troje NF, Weshoff C (2006) The inversion effect in biological motion perception: evidence for a ‘‘life detector’’? Curr Biol 16:821–824 Tuller B, Kelso JAS, Harris KS (1982) Interarticular phasing as an index of temporal regularity in speech. J Exp Psychol Hum 8:460–472 Tuller B, Kelso JAS, Harris KS (1983) Converging evidence for the role of relative timing in speech. J Exp Psychol Hum 9:829–833 Turner TH, Fridriksson J, Baker J, Eoute D Jr, Bonilha L, Rorden C (2009) Obligatory Broca’s area modulation associated with passive speech perception. Neuroreport 20:492–496 Turvey MT (1977) Preliminaries to a theory of action with reference to vision. In: Shaw R, Bransford J (eds) Perceiving, acting, and knowing: towards an ecological psychology. Erlbaum, Hillsdale, pp 211–263 Valentine T (1988) Upside-down faces: a review of the effect of inversion upon face recognition. Br J Psychol 79:471–491 Valentine T, Bruce V (1985) What’s up? The Margaret Thatcher illusion revisited. Perception 14:515–516 Viviani P (2002) Motor competence in the perception of dynamic events: a tutorial. In: Prinz W, Hommel B (eds) Attention & performance XIX: common mechanisms in perception and action. Oxford University Press, Oxford, pp 406–442 Viviani P, Stucchi N (1989) The effect of movement velocity on form perception: geometric illusions in dynamic displays. Percept Psychophys 46:266–274 Viviani P, Stucchi N (1992) Biological movements look constant: evidence of motor-perceptual interactions. J Exp Psychol Hum 18:603–623 Viviani P, Baud-Bovy G, Redolfi M (1997) Perceiving and tracking kinesthetic stimuli: further evidence of motor-perceptual interactions. J Exp Psychol Hum 23:1232–1252 Viviani P, Figliozzi F, Campione GC, Lacquaniti F (2011) Detecting temporal reversals in human locomotion. Exp Brain Res 214:93–103 Watkins KE, Strafella AP, Paus T (2003) Seeing and hearing speech excites the motor system involved in speech production. Neuropsychologia 41:989–994 Welch RB, DuttonHurt LD, Warren DH (1986) Contribution of audition and vision to temporal rate perception. Percept Psychophys 39:294–300 Wilson SM, Saygin AP, Sereno MI, Iacoboni M (2004) Listening to speech activates motor areas involved in speech production. Nat Neurosci 7:701–702 Yin RK (1969) Looking at upside down faces. J Exp Psychol 81:141–145
123