I
•
INTERNATIONALJOURNALOF SPEECHTECHNOLOGY2, 201-214 (1998) ~) 1998 KluwerAcademicPublishers. Manufacturedin The Netherlands.
An Evaluation of the Diagnostic Rhyme Test STEVEN L. GREENSPAN AT&T Labs-Research, 180 Florham Park, NJ 07932
[email protected] RAYMOND W. BENNETT Ameritech, Hoffman Estates, IL 60196
[email protected] ANN K. SYRDAL AT&T Labs-Research, 180 Florham Park, NJ 07932
[email protected]
Received January 30, 1998; Revised ; Accepted February 27, 1998
Abstract.
The intelligibility of a speech output device is an important predictor of user acceptability. The Diagnostic Rhyme Test (DRT) is an ANSI standard for measuring speech intelligibility (ANSI $3.2-1989). In the DRT, respondents hear a word and choose its equivalent from two visually presented words. The two words differ only in their initial (e.g., veal-feel), and the two consonants differ only in a single distinctive acousticphonetic feature (e.g., voicing). To define "distinctive feature", the DRT uses a minimal distinctive feature system, loosely based on the work of Jakobson et al. (1963) and Miller and Nicely (1955). These studies carefully analyzed natural speech errors in various noise environments. Whether or not these studies can be freely applied to alternative forced-choice tests of coded or synthesized speech is an empirical issue. In the present study, the results of a Consonant Identification (CI) task were compared to a previously conducted DRT using the same coding algorithms. The CI data indicated that the low-bit-rate coded speech yielded significantly more multifeature confusions then the uncoded speech. Moreover, the multifeature confusions could not be easily predicted from the single-feature confusions. A fundamental assumption of the DRT is that speech errors are adequately diagnosed by testing single-feature confusions. The results of the present study contradict that assumption. In conclusion, we argue that the application of the DRT (and more generally, any closed-response choice procedure) to coded or synthesized speech is questionable.
Keywords:
speech intelligibility, diagnostic rhyme test, coded speech, two-alternative forced choice, consonant identification, speech errors
Introduction Speech quality evaluations occur throughout the development and marketing of new speech processing algorithms. The acceptability and use of a speech output device is strongly affected by its intelligibility. In fact, intelligibility may be the primary factor governing acceptability (although intelligibility and other measures
of quality can be independently varied; see Nusbaum et al., 1995; Schmidt-Neilson, 1994). If the output of a speech communication device is difficult to understand, using it may interfere with other concurrent tasks (see Ralston et al., 1994). Therefore, extensive effort has been devoted to establishing norms for intelligibility testing (see, for example, ANSI, 1960, 1989; Egan, 1948; House et al., 1965; Schmidt-Neilson,
202
Greenspan,Syrdal and Bennett
1994; Voiers, 1983). Until the late 1980s, the PB word test (Egan, 1948) was the only ANSI standard (ANSI, 1960) for evaluating speech intelligibility. The PB word test contains 1000 monosyllabic CVC words equally grouped into 20 lists. Respondents listen to and transcribe each word from a list (the words are usually presented in a carrier phrase). The test has been criticized for being cumbersome and expensive to administer, and subject to learning effects (SchmidtNeilson, 1994). However, since 1960, closed-response tests have grown in popularity, and two of these, the Modified Rhyme Test (MRT, House et al., 1965) and the Diagnostic Rhyme Test (DRT, Voiers, 1983) are included in the 1989 ANSI standard. According to Schmidt-Neilson, closed-response tests are preferable to open-response tests because closed-response tests are easier to administer, less affected by learning and practice, and less expensive to administer. Over the past several decades, many military and commercial voice output systems have been evaluated and refined using data from DRT and MRT evaluations. The DRT uses a two-alternative forced-choice procedure, while the MRT uses a six alternative forcedchoice procedure. In both tests, respondents listen to a test word (often in a carrier phrase) and then choose its closest match in the visually presented set of alternatives. Percent correct scores on these two types of closed-response tests tend to be highly correlated over a wide spectrum of speech degradations (Voiers, 1983). The two alternatives in the DRT differ only in a single distinctive acoustic feature, as defined by a distinctive feature system developed by Voiers (1983) and shown in Appendix 1. This system is based largely upon work by Jakobson et al. (1963) and Miller and Nicely (1955). For example, a respondent may be presented the sound/bi/and be asked to choose between "bee" and "pea". Errors in deciding between these alternatives might indicate that the speech output device poorly transmits the voicing distinction. The system used to construct the DRT classifies consonants along six linguistic distinctions: voicing, nasality, sustention, sibilation, graveness, and compactness. Voiers (1983) clearly demonstrates that the DRT can rationally differentiate the effects of noise, low-pass filters, and babble and diagnose differences between speech output devices. For this reason, many consider the DRT as a diagnostic tool that can be used to improve intelligibility. However, there are reasons to be concerned about the DRT in particular and closed-response tests in
general. In contrast to the DRT, open-response I tests allow listeners greater latitude in specifying what they actually heard and are less subject to ceiling effects (Schmidt-Neilsen, 1994). Are these differences relevant to the diagnosis and evaluation of speech output? The differences between the DRT and open-response tests are not critical if (1) significant differences remain significant in both classes of tests; and (2) the DRT does not miss important diagnostic information which are revealed in open-response tests. The present paper examines the DRT against these criteria.
Experiment: Low-Bit-Rate Coded Speech The purpose of this experiment is to compare the results of the DRT with those of an open-response test. For an open response task, the ANSI standard (1960,1989) recommends the PB word test. In the PB word test respondents transcribe the entire word that they heard. In contrast, the DRT and MRT are segmental intelligibility tests, i.e., the stimuli are constructed so that alternatives differ by only a single phoneme. In addition to this difference, the PB task has a number of methodological problems, i.e., the results are strongly affected by practice and familiarity of the target word, and the results are cumbersome to score (Luce, 1987; SchmidtNeilsen, 1994). To reduce these problems, and allow a more direct comparison between the DRT and an open-response test, we used a Consonant Identification (CI) test. In this test, respondents hear a consonantvowel syllable and transcribe the consonant. For example, upon hearing/ha/a respondent might write, or key press, the letter "B". The use of consonant-vowel stimuli in intelligibility testing has a long history, beginning as early as 1910 (Campbell, 1910). The CI test has been used extensively to study the effects of noise on intelligibility and to assess the underlying acoustic dimensions of speech (Miller and Nicely, 1955; Wang and Bilger, 1973; Nusbaum et al., 1984).-"
Method Subjects.
Subjects were 47 undergraduates recruited from North Central College, a small Methodist college located in Naperville, Illinois. Thirteen to nineteen subjects participated in an experimental session. All subjects were native speakers of English who reported no history of speech or hearing disorders, and no previous experience with low-bit-rate coded speech or with the kind of perceptual tests used in the present
The Diagnostic Rhyme Test
study. Subjects were paid $20 for their participation in the two-hour test session. Stimuli. On each trial, subjects listened to a single CV syllable. The consonants were the 21 initial consonants that could easily be described to and transcribed by untrained listeners, i.e.,/b,ch,d,f,g,h,j,k,l,m,n,p,r,s,sh,t, th(unvoiced),v,w,y,z/(pronunciations are indicated in Appendix 2). The voiced/th/was not used because of the potential difficulty of teaching subjects to discriminate it from the unvoiced/th/in their written responses. Each consonant was present in three vowel contexts. The three vowels were/a/(as in the word "cot"), /i/ (as in the word "bee"), and /u/ (as in the word "too"). These vowels are traditionally used in Consonant Identification tasks (Wang and Bilger, 1973; Nusbaum et al., 1984) because they constitute articulatory and acoustic-phonetic extremes for steady-state vowels. Each consonant-vowel combination was presented six times in each half of the experiment; the first half of the experiment used syllables produced by one speaker (JP), and the second half used syllables spoken by the other speaker (DD). Both speakers were male. The speech samples for the two speakers were recorded on different occasions; those of speaker DD had been previously recorded for use in a different experiment; speaker JP was recorded for the present study. However, the speech of both speakers were processed similarly, with one difference. The speech of speaker DD was recorded through a 3.9 kHz low-pass filter at a 8 kHz sampling rate, but the speech produced by JP was recorded through a 4.2 kHz low-pass filter at a 10 kHz sampling rate, and then digitally down-sampled to 8 kHz for subsequent processing. 3 For both speakers, the recordings were blocked by vowel context, i.e., all of the syllables containing/a/ were recorded first, then those containing/i/; and finally those containing/u/. Each digitized syllable was hand edited into separate files. Each stimulus file was then codec filtered, analyzed and resynthesized by each of the experimental coding algorithms. The filtering and coding algorithms are described in (Bronson et al., 1987; McAuley and Quatieri, 1985). After processing, the speech files were recorded onto videocassette tapes via a digital audio processor, in the order that they were to be presented during the experimental session. Procedure. Subjects were assigned to one of three experimental speech coder conditions:
203
9 uncoded but codec-filtered speech; 9 speech that had been codec-filtered and then processed by a software version of the 4.8 kbps fully quantized implementable phase harmonic coder; 9 speech that had been codec filtered and then processed by a software version of the 4.8 kbps fully quantized nonphase harmonic coder. Each experimental session tested only one speechcoder condition, and each session involved 13 to 19 subjects. Due to last-minute cancellations, 19 subjects participated in the session using the uncoded but codecfiltered speech, 15 subjects in the test of the 4.8 phase coder, and 13 subjects in the session using the 4.8 nonphase harmonic algorithm. In each session, subjects sat at a large common table and listened to 756 CV syllables through Beyer dynamic hi-fidelity headphones connected to a portable Sony videocassette recorder via a Sony digital audio processor and a Pioneer amplifier. The output volume of the amplifier was calibrated across sessions using a 1 kHz tone. Subjects were informed that they would be listening to short syllables that had been transmitted over experimental voice circuits and then recorded. They were told that each syllable would be a consonant followed by a vowel, and were asked to write the consonant in the appropriate space in a booklet. The set of possible consonants was visible throughout the experiment. Before beginning, the experimenter read aloud the sounds represented by the orthographic code to insure that the subjects understood the intended mapping between each orthographic letter and each phoneme (see Appendix 2). In addition, attention was drawn to consonants that were represented as two letters (i.e., "ch", "th" and "sh"), and to possible letter-sound confusions (e.g., the initial sound of the word "see" was represented by the letter "S" and not by "C"). These possible confusions involved the letters C for S as in the syllable/si/(pronounced as "see"), W for H as in the sound/hu/(pronounced as "who"), G for J as in the syllable/ji/(pronounced as in the initial consonant-vowel of"jeep"), and U for Y as in the sound/yu/(pronounced as "you"). Subjects were told that the syllables would be presented once every five seconds, and that on every fifth trial the trial number would be announced in an uncoded, unfiltered female voice. They were asked to guess if they were not sure which consonant they had heard, not to go back to earlier trials, and if they happened to accidentally fall behind, to leave blanks and move to the correct trial. (None of the subjects reported
204
Greenspan, Syrdal and Bennett
analyses produced essentially the same conclusions, only the analysis of the corrected data will be reported. The main effects of coder, vowel context, speaker, and trial set were all significant as were most of the interactions between these factors. To facilitate the presentation of these results, interactions that accounted for less than 1% of the total within-subjects variance will not be discussed. The intelligibility of the three speech-coding algorithms and the uncoded, codec-filtered speech is summarized in Table 1 as a function of voice and vowel context. The main effects of coding condition, vowel context, speaker, and trial set were each significant (p < .01). The main effect of coding condition (the means relevant to this main effect are presented in the column marked "Average score") was highly significant; F(2, 44) = 106.8, M S e = 0 . 0 1 7 , p < .0001. According to a Newman-Kuels Multiple Range test, the uncoded speech was significantly more intelligible than the two 4.8 kbps coders (p < .01). In addition, the two 4.8 coders were reliably different (p < .05). However, caution must be observed in interpreting this difference because of a strong interaction with voice and vowel context; F(4, 8 8 ) = 9.4, MSe = 0.0038, p < .0001. Speech from Speaker DD processed through the 4.8 phase coder is significantly more intelligible than the same speech processed via the 4.8 nonphase harmonic coder for all vowel contexts. For speaker JP, this difference is significant only for consonants proceeding/u/. The main effect of vowel context was significant; F(2, 126) = 226.2, MSe = 0.0037, p < .0001. A Newman-Kuels multiple range statistic indicated that consonants followed by the vowel/a/yielding an average percent correct score of (89.1%) o r / u / (89.1%) were more intelligible than those followed by /i/ (79.7%), p < .01. Although the effects of vowel context were significantly affected by coder condition and voice (all two-way interactions, as well as the triple interaction, were significant, p <.0001), the basic
losing their place.) A ten-trial practice tape preceded the experimental tape. The experimental tape was divided in two, with each half containing a different male voice. The presentation of the two voices was blocked within each experimental session, and each subject heard only one speech-processing algorithm. Sessions lasted about two hours. Each half of the experiment was broken into three sets of 126 trials (two presentations of each consonant-vowel, in constrained-random order). This sequence was used so that learning effects between sets could be easily analyzed. The same order of CV syllables was used in all experimental sessions. The constraint on the randomized order was that no consonant was repeated twice in a row, and that identical consonant-vowel combinations were separated by at least three trials.
Results and Discussion There are three sources of response error in a Consonant Identification task: (a) misperceptions, (b) guessing, and (c) letter-sound confusions (e.g., writing "W" for the syllable/hu/because the English w o r d / h u / i s written as "who" and begins with the letter W). Unlike the first two types of errors, errors due to lettersound confusions are essentially an artifact of the task, and were eliminated as much as possible by treating responses that could be attributed to the four lettersound confusions mentioned in the Introduction as correct. 4 A post-experimental review of the stimuli revealed that one speech file,/vi/produced by speaker DD, had been accidentally corrupted. Therefore, the consonant / v / i n all three vowel contexts for both speakers was eliminated from the statistical analyses of performance. To test that these corrections had little effect on our statistical conclusions, an additional analysis was conducted on the uncorrected data. Because the two
Table 1. Percent correct Consonant Identificationby vowel context. Speaker JP
4.8 kbps phase 4.8 kbps nonphase Uncoded, codec-filtered Mean
Speaker DD
Average score
/a/
/i/
/u/
/a/
/i/
/u/
Mean
SE
85.7 84.7 94.3 88.2
66.3 66.8 89.6 74.2
84.6 79.6 94.7 86.3
86.7 81.9 96.3 88.3
79.4 76.5 91.5 82.5
89.2 84.2 96.9 90.1
82.0 79.0 93.9 --
0.54 0.58 0.32 --
The Diagnostic Rhyme Test
205
pattern (i.e., consonants preceding a n / a / o r / u / w e r e significantly more intelligible than those preceding an /i/) was found for all combinations of voice and coder condition. This basic pattern of vowel context effects is common for a variety of types of degraded speech, e.g., speech in noise, and speech generated by rule from text (see Nusbaum et al., 1984; Wang and Bilger, 1973). The speech produced by the two speakers was reliably different, F(1, 44) = 49.9, MSe = 0.0065, p < .001. Overall, Speaker DD (88%) was more intelligible than Speaker JP (84%). The interaction between coding condition and speaker was significant, F(2, 88) = 2 4 . 7 , MSe = 0.0038, p < .001, as was the interaction between coding condition and speaker was also significant, F(2, 44) = 5.2, MSe = 0.004, p < .01. The triple interaction between coding condition, speaker, and vowel context was also significant, F(4, 88) = 9 . 4 , MSe = 0 . 0 0 4 , p < .01. A useful way to explain this interaction is to examine the differential effect of the/i/vowel context relative to t h e / a / a n d / u / vowel context. Table 2 summarizes this interaction by calculating the difference in intelligibility between the vowel context/i/and the vowel contexts/a/or/u/, for coded and uncoded speech. The intelligibility for syllables containing an/i/, spoken by JP, and processed through the 4.8 kbps coders was significantly lower than would be expected from the main effects and twoway interactions. The main effect of trial set was significant (see Table 3), where trial set is defined as the first,
second, and last blocks of 126 trials for each speaker, F(2, 88) = 14.4, MSe = 0.0021, p < .01. Averaged across both speakers, the average percent correct was 84.8%, in the first trial set, 86.4% in the second trail set, and 86.7% in the third trial set. A Newman-Kuels analysis indicated that the scores in the first set were significantly lower than the scores in the two subsequent sets (p < .05); there was no significant difference between the second and third trial sets. Trial set did not reliably interact with coding condition or speaker condition. Nor was the three-way interaction between trail set, vowel context and speaker condition significant. The remaining two-way, and three-way interactions with trail set, and the four-way interaction were each significant (p < .05) but each accounted for less than 1% of the systematic variance. In addition, analyses of these remaining cases continued to demonstrate a strong and significant difference between the intelligibility of the uncoded and coded speech (p < .05). Thus, the effects of vowel context and voice have little bearing on the finding that 4.8 kbps speech is significantly less intelligible than uncoded speech, and it is this finding that forms the focus of the next section.
Table 2. Percent correct difference in consonant intelligibility between the /i/ vowel context and the /a/ and /u/ vowel contexts, i.e.,
9 significant differences remain significant in both classes of tests; and 9 the DRT does not miss important diagnostic information which are revealed in open-response tests.
The primary focus of this paper is not the results of the Consonant Identification (CI) task, per se, but rather the implication of these results for the DRT. As mentioned earlier, the differences between the DRT and open-response tests are not critical for evaluating and diagnosing speech output devices if: Comparisons with the DRT.
average/i/-average(/a/,/u/).
4.8 coded speech Uncoded speech
SpeakerJP
SpeakerDD
-17.1 -4.9
-7.6 -5.1
The CI task indicated that the uncoded speech was significantly more intelligible than the 4.8 coded speech. However, differences between these same coding
Table 3. Percent correct Consonant Identificationby stimulus set.
Speaker JP
Speaker DD
Average score
Coder condition
Set 1
Set 2
Set 3
Set I
Set 2
Set 3
Mean
SE
4.8 kbps phase 4.8 kbps nonphase Uncoded, codec-filtered Means
76.4 76.3 91.8 82.6
80.5 77.1 93.7 84.9
79.7 77.7 93.1 84.6
84.1 78.7 94.9 87.0
85.3 81.7 94.3 88.0
86.1 82.1 95.6 88.8
82.0 79.0 93.9 --
0.54 0.58 0.32 --
206
Greenspan, Syrdal and Bennett
Table4. Intelligibilityscores for the CI, DRT,and DAM tasks. CI results (n = 15, 13, 19) Coder condition
Mean
4.8 kbps phase 82.0% 4.8 kbps nonphase 79.0% Uncoded, codec-filtered 93.9%
algorithms were not significant in a DRT test conducted with the same coding algorithms. The DRT was conducted by Dynastat (the company that developed the DRT) for AT&T. As part of this evaluation, Dynastat also conducted a speech quality evaluation using their Diagnostic Acceptability Measure (DAM; Voiers, 1977; see also Schmidt, 1994). The DAM, like most subjective quality methods, requires participants to listen to an utterance (usually a sentence). After each listening condition, participants rate the experience on a variety of general perceptual dimensions, e.g., intelligibility and pleasantness, and specific perceptual dimensions, e.g., the presence or absence of fluttering in the signal. The DAM has been used extensively for the evaluation and refinement of low-bit-rate speech coding algorithms (Schmidt, 1994). Dynastat conducted both evaluations in 1986-1987 using their standard procedures (see Voiers, 1977; Syrdal, 1987). The principle results of these evaluations are shown in Table 4. Each test was conducted with three female and three male speakers. To simplify comparison to the present Experiment only the results of the male speakers are shown. There are several interesting differences between these experiments. First, the percent accuracy scores for the CI test are lower and span a greater range than those for the DRT. Second, although the differences between the 4.8 coding algorithms and the codec-filtered speech were highly significant for both the DAM and CI studies, these coding conditions were not well separated in the DRT test. An analysis of the DRT resuits (involving 12 coder conditions, including the three listed in Table 3), indicated no significant difference in intelligibility between the uncoded and the nonphase harmonic speech, and between the nonphase harmonic and the phase coders. Only the difference between the uncoded and the phase-coded speech was reliable (p < .05). Thus, the results of the CI and DAM tests appear similar to each other, but different from the DRT scores. The obvious problem with the DRT results is that the differences among scores may have been compressed by ceiling effects, as would be expected of any set of
DRT results (n = 8)
SE
Mean
SE
0.54 0.58 0.32
93.1% 0.64 94.3% 0.62 95.7% 0.77
DAM results (n = 12) Mean
SE
53.5 54.2 68.0
1.6 1.5 2.2
percent correct scores (or scores derived from percent accuracy) that is near 100%. Scores for the DRT are corrected for guessing (see Voiers, 1983). The possibility of ceiling effects is even more obvious if one considers the uncorrected DRT scores: 97.2% for the 4.8 nonphase harmonic coder, 96.6% for the 4.8 phase harmonic coder, and 97.9% for the uncoded speech. 5 Thus, the CI task may be more sensitive than the DRT to differences among coding conditions because the accuracy scores are lower, and therefore less affected by ceiling effects. There are several reasons why the CI task would be expected to yield lower scores than the DRT and also why the CI task might provide more diagnostic information about differences in segmental intelligibility. First, the DRT presents only two response alternatives per trial. This is a small fraction of the possible stimulus set. In contrast, the CI study permitted the full range of 21 alternative responses on each trial. Therefore, it is possible that subjects in the DRT study used minimal acoustic information to discriminate between the two response alternatives. Two additional aspects of the DRT, as conducted by Voiers (1983), increase the likelihood that task-specific strategies might have been used: (a) Participants are used repeatedly, are highly trained, and are therefore knowledgeable about the task and the stimuli, and (b) the two response alternatives on each trial are presented before the test word. Therefore, prior to the presentation of a DRT test word, subjects may be able to focus attention on the exact acoustic-phonetic cue (e.g., voicing) that will discriminate between the two response options when listening to the test word. Another related factor is that the phonological contrast investigated on each DRT trial always involves a single distinctive feature, and therefore many potential perceptual confusions are never tested. In contrast, the CI study allows the examination of multiple- and single-feature confusions.
Feature Confitsions in CI Experiment. The DRT task is designed to examine consonant confusions resulting from single phonetic-feature confusions. The validity
The Diagnostic Rhyme Test
of the DRT as a general measure of segmental intelligibility depends on the assumptions that (a) the feature system used to create stimulus pairs reasonably reflects the underlying acoustic-phonetic distinctions used by listeners when hearing the speech samples, (b) an overwhelming majority of their perceptual confusions involve only one feature, and (c) those confusions involving more than one feature can be predicted from data on single-feature confusions. These assumptions are reasonable for uncoded speech transmitted in a quiet environment, because minimal distinctive feature systems are constructed so that highly similar sounds differ by only one feature (see Miller and Nicely, 1955). However, we are not aware of any studies that evaluate this assumption for low-bit-rate coders. To test this assumption, we categorized the identification errors as single, double, or "3+" feature confusions, using the same feature system that was used to construct the DRT. As shown in Fig. 1, single-feature confusions dominate uncoded speech errors: 76% of the errors could be defined as a single-feature confusion (e.g., voicing or sustention), and 24% as a confusion of two or more perceptual features. These percentages are conditionalized on the total number of errors, not the total number of responses. The result is not surprising. The feature system used in the DRT was modeled on natural speech and is intended to specify approximately independent dimensions of speech. In contrast to the uncoded speech, almost half the errors on the coded speech can be classified as multiplefeature confusions. For both the 4.8 kbps phase and nonphase coded speech, 56% of the errors involved single-feature confusions, and 44% of the errors involved two or more feature confusions. 6
The uncoded speech has a significantly lower proportion of multifeature confusions than either the nonphase or phase-coded speech (z = 10.7, p < .001, and z = 10.3, p < .001, respectively); whereas the difference between the nonphase and phase coders is obviously not significant. In light of these results, can the DRT's focus on single-feature confusions be defended? One defense is to explain multiple-feature confusions as arising from chance combinations of single-feature confusions. If this hypothesis were correct, then the diagnostic information provided by the multiple-feature confusions might be deducible from single-feature confusions. However, if this hypothesis were disconfirmed, it would imply that diagnostic information provided by multiple-feature confusions cannot be derived from tasks that only investigate single-feature confusions (e.g., the DRT). A disconfirmation would also suggest that the distortions introduced by the coding algorithms do not normally occur with uncoded speech, since minimal distinctive-feature taxonomies attempt to define frequently occurring perceptual errors as single-feature confusions.
Phoneme Specific Analyses.
The defense that multiple-feature confusions arise from the chance combination of single-feature confusions predicts that:
9 the lower the accuracy score, the more likely the occurrence of multiple-feature confusions, and 9 if an incorrect response is a multiple-feature confusion then a sequence of single-feature confusions leading from the correct response (i.e., the originally articulated consonant) to the incorrect response should be observable.
100%.
80%
0
t-
60%
o 0
40%
[] multi-feature
o~
9 1 feature 20%
0%
uncoded (6% errors)
207
4.8 nonphase (18% errors)
4.8 phase (22% errors)
Coding Algorithm Fig.re 1. Feature confusion errors catogorized according to Voier's (1983) feature taxonomy.
208
Greenspan,Syrdal and Bennett
Although the overall results presented in Fig. 1 support the first prediction, a close inspection of the phonemic confusion data argue against it. Appendix 3 presents for each vowel context and coder, a compilation of the five consonants with the lowest percent accuracy scores. For each of these consonants, the most frequent incorrect response is reported along with the number of features that separate the correct and incorrect response. The frequency of the incorrect response is also reported as a percent of total responses (correct and incorrect responses) and as a percent of incorrect responses. The mean accuracy scores for the consonants associated with a frequent singlefeature confusion is 53%; in contrast, the mean accuracy scores for consonants associated with two and three, or more feature confusions are 62% and 64%, respectively. Thus, for the data presented in Appendix 3, the results are opposite of what the explanation would reasonably predict. An alternative method of converging on the same conclusion is to observe that within each block defined by coder and vowel context, there is little correlation between "% correct" and "# of features confused". The second prediction is more difficult to test, but the following example argues against its validity. Figure 2 presents the confusion network for two syllables/da/ and/pu/. Figure 2(a), indicates that on 33% of the trials in which/da/was presented, participants incorrectly identified the initial consonants as "t", "g", or "th". Both /t/ and /g/ differ from /d/ by a singlefeature, ignoring graveness which according to Voiers does "not apply" to/g/. (Graveness does apply to/g/ according to Chomsky and Halle (1965) in which case the/d/to/g/confusion represents a multifeature error.) T h e / d / t o / t h / c o n f u s i o n is a multifeature confusion, and accounts for almost 50% of the errors generated by/da/, i.e., 17.2% of the/pu/trials were misclassifled as "th". Can the response "th" be rationalized as a sequence of observable single-feature confusions? No. Consider for example, the path f r o m / d / t o / g / t o /th/. Even if t h e / d / t o / g / c o n f u s i o n were considered a single-feature confusion, t h e / g / t o / t h / c o n f u s i o n is multifeature error, and the other identification errors associated with /g/ do not provide any clear feature confusion path t o / t h / ( i n fact, they are further from /th/in Voiers' feature matrix). Alternatively, the seq u e n c e / d / t o / t / t o / t h / m i g h t have provided support for the second prediction, b u t / t / t o / t h / e r r o r s did not occur in this study. No reasonable weighting, or joint probabilities, of the observed single-feature confusions generated b y / d a / c a n produce the observed probability of transcribing/da/as "th". Thus, the observed
confusion data for/da/ does not support the second prediction. Figure 2(b) provides a second example. The syllable /pu/is correctly categorized as a "p" 95% of the time when the stimulus is not coded, 41% of the time when a nonphase coder is applied, and 63% of the time when a phase coder is applied. Furthermore, the response "h" accounts for 17.2% of the/pu/trials processed by the phase coder (almost 50% of the errors generated by /pu/). Similarly, the response "h" accounts for 18.9% of the/pu/trials processed with the nonphase coder. In contrast, the response "h" only occurred once in the uncoded/pu/trials. Ignoring "graveness" which according to Voiers (1983) does not apply t o / h / o r / k / , can the response "h" be rationalized as a sequence of single-feature confusions in 4.8 phase coder results? Consider the confusion path f r o m / p / t o / f / t o / h / : The sound/fu/was never transcribed as "h". Next consider, the possible path from/p/to/k/to/h/, for which there is some support in the observed data. Transcribing/p/as "k" accounts for 16.7% of the/pu/trials. Transcribing / k / a s "h" accounts for 4% of the/ku/trials. No reasonable weighting of these confusion probabilities can produce the observed probability of transcribing/pu/ as "h" (17% of the/pu/trials; 48% of the observed/pu/ errors). Moreover, the/p/-/h/confusion would never be detected by the DRT, and never contribute to the DRT accuracy scores, but such confusions might indicate a coding problem with initial bursts. These findings suggest that the DRT may be overestimating speech intelligibility and may miss important diagnostic information.
General Discussion Although many studies have examined the correlational validity of the DRT to other tests (e.g., the MRT), many of the DRT's underlying assumptions have gone untested. For example, the DRT uses an acousticphonetic feature system, loosely based on (Jakobson et al., 1963; Miller and Nicely, 1955) to specify word pairs that differ only in one distinctive feature. Jakobson et al. and Miller and Nicely based their taxonomies on careful analyses of natural speech production and perception in various noise environments. Whether or not these taxonomies can be freely applied to alternative forced-choice tests of coded, compressed, or synthesized speech is an empirical issue. In the present study, we evaluated two speech coders against uncoded, but filtered speech, and compared the results of that study to the results of an earlier evaluation that used the DRT and DAM. The motivation was
The Diagnostic Rhyme Test
17 (10.9%) errors Ir
[
I t (0 e~rors)
[
I \
'
'
', (0%o)errors \
.ror
209
_1 . . . .
I 7 (4.8%)errors
[ g (32 errors, 20.5%)
1(>1%~.ror
/ / ~ 9(5.8%) errors 2 (1.3%) errors
/
\
Iv 77o o ,
I
Ida/to/tha/probabilities for the different coders:
4.8 phase: 2/7 errors 4.8 nonphase: 34/91 errors uncoded: 0/0 errors (a)
t errors, 36%)~
I 1
1 (>1%~rors
l (>1%)error
~ l ~ ( > 1 % ) e f ~ d ( 2 e r r o r s ) -ch " ~ l(25 errors) ]
30 (16.7%) errors
1 (>1%)error ~-l[
f (4 errors, 2~o)
w (6 errors )
]
k (8 errors, 4%) [
~ 31 (17.2%) errors
0 errirs
2/8 (25%) errors
/
[ h (41 errors,22.7% )
[
/pu/to/hu/probabilities for the different coders: 4.8 phase: 31/65 errors 4.8 nonphase: 34/91 errors uncoded: 1/10 errors
(b)
Figure2. Confusionnetwork (a) for/da/when coded by the 4.8 nonphase harmonic coder (number of trials = 156) and (b) for/pu/when coded by the 4.8 phase harmonic coder(number of trials = 180). to discover whether or not an open-response test would reveal patterns of error substantially different from the DRT, a closed-response test. The important empirical comparison between these two methods is not the absolute intelligibility scores generated by each test, but whether or not the two tests lead the same conclusions about the relative merits and deficits of the different coders. The open-response CI task indicated overall accuracy differences between the 4.8 coders and the
uncoded speech. These differences were also seen when using the DAM, a subjective judgment measure, but were not apparent in the closed-response DRT. The major differences between the CI task and the DRT are due to differences in their respective stimulusresponse mappings. The DRT uses a two-alternative forced-choice procedure in order to focus the evaluation on single-feature confusions. The CI task does not restrict per-trial responses to a subset of the target vocabulary, and therefore allows multiple-feature
210
Greenspan,Syrdal and Bennett
confusions. The analysis of the CI data indicated that the low-bit-rate coded speech generated significantly more multifeature confusions than the uncoded speech. Thus, for these coding algorithms, the DRT missed important stimulus confusions and might be diagnostically misleading. This result can be criticized as being entirely dependent upon the feature system used for classifying errors. If we were simply claiming that coded speech leads to more multifeature errors than uncoded speech, the criticism would be entirely valid. However, the point of the analysis was to evaluate the underlying structure of the DRT. The same feature system used to construct the DRT was used to classify single and multifeature CI errors. Thus, if the DRT's feature system is not appropriate, then the DRT as it is currently constructed is also inappropriate. If the feature system is appropriate, then the DRT is inappropriate for evaluating some forms of coded speech. Either way, the diagnostic value of the DRT is questionable. In fact, the data suggest that the application of any closed-response procedure to coded or synthesized speech is questionable if the alternatives are derived from theoretic models of natural speech. Reminiscent of the conclusions of Nusbaum et al. (1984) and Greenspan et al. (1988) for synthetic speech, these data suggest that some forms of coded speech may produce acoustic phonetic cues that are substantially different from the natural speech. In such cases, the DRT is probably inappropriate because the structure of the DRT strongly assumes an acousticphonetic space that is modeled on natural, uncoded speech. Alternatively, the problem may be that Voiers' (1983) feature system overly simplifies articulatory and acoustic dimensions. For example, voicing is indicated as a binary feature, but as Wickelgren (1966) notes, the articulatory effort necessary to produce voice is greater for stop consonants than for fricatives, nasals, and semivowels.
h~telligibility Testing. Schmidt-Neilson (1994) argues that closed-response tests are preferable to openresponse tests because closed-response tests are easier to administer, less affected by learning and practice, and less expensive to administer. However, in making that assertion, she compared the open-response PB word transcription task to closed-response tasks such as the DRT. Such comparisons are questionable since the DRT is constructed to focus attention on the initial consonant of a word, whereas the PB word transcription
task focuses attention on the whole word. Comparing the DRT to an open-response task such as the CI task may be more appropriate. Both tasks focus attention at the phonemic level. In the present study, participants required little training and were efficient. Administration was simple, as was data coding. Furthermore, no differences were revealed between the second and third trial set, and the same basic differences among coders were present in all three sets. Participants quickly reached asymptotic performance levels. However, it is not the purpose of the present paper to argue that the CI task should replace the DRT. In the present study, the CI task was a tool for evaluating the DRT. Intelligibility, in its common sense meaning, is not a single stimulus-response dimension. It is the result of many perceptual, cognitive, and response characteristics. It is unlikely that any single task could adequately measure intelligibility in all its richness. Neither the DRT nor the CI task is adequate for testing phoneme or word intelligibility in all of the various accented and unaccented, and initial, medial, and final positions. Therefore, it is recommended that evaluations use a variety of testing procedures. In very specific applications with very limited vocabularies, it is best to develop in situ tasks. However, there are business imperatives that demand simple evaluation criteria that can be used across multiple applications and devices. Except in cases in which the application ensures a restricted vocabulary set and highly trained listeners, we recommend the open-response format. Open-response, identification tasks are less likely to be biased by theoretical frameworks, and more likely to match the processing characteristics of the natural speech environment. Preferably, the criteria for evaluating intelligibility ought to include more than one type of test: At least one that is focused on phoneme processing and one that is focused on utterance processing. Changes in intelligibility as a function of the listener's experience with the voice-output system should also be considered (Greenspan et al., 1988). For diagnostic purposes, closed-response tasks are appropriate as long as the stimulus-response construction makes valid assumptions about the nature of errors, for the type of speech used. That condition was not observed in the present study. It is worth noting that Dynastat has constructed versions of the DRT that examine some multifeature comparisons. Clearly, constructing word pairs, in which the initial consonant differs by more than one feature,
T h e D i a g n o s t i c R h y m e Test
211
is p o s s i b l e . H o w e v e r , t h i s d o e s n o t l e s s e n the c r i t i c i s m s
e i t h e r d e v e l o p a v a l i d t h e o r y o f the c o n d i t i o n s that lead
s u g g e s t e d b y the p r e s e n t study. T h e p r e s e n t r e s e a r c h ar-
to m u l t i f e a t u r e c o n f u s i o n s o r to first c o n d u c t a n o p e n -
g u e s a g a i n s t a n y n o t i o n that the s a m e m u l t i f e a t u r e c o n -
r e s p o n s e test to d e t e r m i n e w h i c h m u l t i f e a t u r e c o n f u -
f u s i o n s w i l l be likely f o r d i f f e r e n t c o d i n g a l g o r i t h m s .
s i o n s are likely to o c c u r . In the a b s e n c e o f a n y o v e r -
E v e n if s o m e m u l t i f e a t u r e c o m p a r i s o n s are a d d e d to
arching theory of perceptual confusions and speech
the D R T , the o n l y p r i n c i p l e d w a y o f k n o w i n g w h e t h e r
c o d i n g , the s a f e s t c o u r s e o f a c t i o n is to i n c l u d e an o p e n -
o r n o t t h e s e are the critical c o m p a r i s o n s f o r the p a r -
r e s p o n s e test as p a r t o f the e v a l u a t i o n a n d d i a g n o s t i c
t i c u l a r s p e e c h o u t p u t a l g o r i t h m s b e i n g e v a l u a t e d is to
criteria.
Appendix 1: The Distinctive Feature System Developed by Voiers (1983) and Used in the DRT Stimulus Construction
Voicing
m
n
v
z
j
b
d
g
w
r
1
y
f
th
s
sh
ch
p
t
+
+
+
+
+
+
+
+
+
+
+
+
.
.
.
.
.
.
.
-
+
+
+
+
+
+
+
-
.
.
.
.
+
+
+
0
+
-
0
0
+
-
0
0
+
-
-
0
+
-
+
+
-
+
+
+
+
.
Nasality
+
+
.
Sustention
-
-
+
Sibilation
-
Graveness
+
Compactness
-
-
Vowel-likeness
-
-
-
+
.
.
+
. -
+
+
-
0
+
.
-
+
.
.
. +
.
. .
.
.
.
k
h
. +
+
-
0
0
+
+
.
Symbols: Feature present (+), feature absent (-), and does not apply (0). Note: The phonetic symbols used to identify the initial phonemes in the words "jaw", "you", "shoe", "chew", and "thaw" are not conventional. They were used in the present study because they were easy to describe to the experimental subjects, and because they correspond to soundto-spelling rules typically used in English. The similar chart in Voiers (1983) uses the conventional phonetic transcription conventions. It should also be noted that many feature taxonomies exist in the literature. They differ in the number of features and in the feature categories. Voiers' (1983) feature system was used in the present study in order to analyze the CI response errors with same feature classification system used on the DRT.
Appendix 2: Phonetic Transcription Conventions Used in the Consonant Identification Task
Phonetic symbol
Example
Phonetic symbol
Example
b
bee
p
pea
ch
chew
r
row
d
do
s
see
f
fee
sh
shoe
g
go
t
tea
h
hi
th
thaw
j
jaw
v
vow
k
key
w
way
1
lie
y
you
m
me
z
zoo
n
no
Note: The phonetic symbols used to identify the initial phonemes in the words "jaw", "you", "shoe", "chew", and "thaw" are not conventional. They were used in the present study because they were easy to describe to the experimental subjects, and because they correspond to soundto-spelling rules typically used in English.
212
Greenspan, Syrdal and Bennett
Appendix 3: Most Common Consonant Identification Errors by Vowel Context and Coding Algorithm For each vowel context and coder combination,
the five phonemes
that had the lowest accuracy scores were
identified. For each identified phoneme, the most frequent phonemic confusion was observed and the number of features separating the correct and incorrect responses was calculated. Vowel context
Stimulus
% correct
Most common error
No. of features confused
% of total responses
% of total errors
4.8 phase (total possible correct responses = 180) /a/
b
39%
v
1
27%
45%
/a/
z
59%
th
2
40%
98%
/a/
m
62%
v
2
33%
86%
/a/
z
59%
th
2
40%
98%
/a/
p
78%
h
3-4
14%
63%
/a/
z
45%
th
50%
91%
/a/
b
50%
v
1
19%
38%
/a/
v
50%
f
1
35%
71%
/a/
d
67%
th
2
16%
49%
/a/
m
71%
v
2
18%
62%
4.8 nonphase (total possible correct responses = 156) 2
Uncoded (total possible correct responses = 228) /a/
th
61%
s
1
20%
51%
/a/
z
89%
y
3
5%
44%
/a/
j
92%
ch
1
6%
71%
/a/
f
94%
th
1
3%
44%
/a/
m
95%
v
2
3%
61%
/i/
j
1
43%
54% 97%
4.8 phase (total possible correct responses ----180) 21%
ch
/i/
k
23%
t
1
74%
/i/
g
37%
t
2
31%
49%
/i/
th
39%
s
1
32%
53%
/i/
v
51%
z
2
28%
58%
46%
4.8 nonphase (total possible correct responses = 156) /i/
j
17%
ch
1
38%
/i/
b
28%
p
1
32%
45%
/i/
th
28%
f
1
51%
70%
/i/
k
37%
t
1
60%
96%
/i/
d
39%
t
1
38%
63%
/i/
th
38%
f
1
46%
75%
/i/
v
51%
z
2
43%
88%
/i/
k
61%
t
1
39%
99%
/i/
j
72%
g
1
24%
86%
/i/
d
79%
g
1
9%
44%
4.8 uncoded (total possible correct responses = 228)
(Continued on nextpage)
The Diagnostic Rhyme Test
213
(Continued.) Vowel context
Stimulus
% correct
Most common error
No. of features confused
% of total responses
% of total errors
4.8 phase (total possible correct responses = 180) /u/
j
20%
ch
1
76%
95%
/u/
b
45%
h
3-4
30%
55%
/u/
p
64%
h
2-3
17%
48%
/u/
th
74%
s
1
15%
58%
/u/
h
77%
f
1-2
11%
46%
4.8 nonphase (total possible correct responses = 156) /u/
b
25%
h
3-4
27%
36%
/u/
j
35%
ch
1
24%
36%
/u/
p
42%
h
2-3
22%
38%
/u/
z
51%
1
2
35%
72%
/u/
v
62%
f
1
22%
59%
Uncoded (total possible correct responses = 228) /u/
j
68%
ch
1
29%
92%
/u/
h
82%
w
3-4
16%
88%
/u/
th
85%
f
1
13%
85%
/u/
ch
87%
t
2
10%
78%
/u/
sh
94%
s
1
4%
58%
Acknowledgments The authors hereby express their gratitude to Greg Blonder, Wendy Greenspan, Allen Milewski, and Judy Tschirgi for their encouragement during various stages of this research.
Notes 3. 1. The term, closed-response, is meant to apply to any test procedure in which the number of response choices permitted on any trial is a subset of the total number of possible stimulus categorizations. Typically in such cases, participants map stimuli onto response categories that are arbitrarily related to stimulus category, e.g., pressing the "left" or "right" button to choose the word "bee" because on that particular trial the written word "bee" appears on the left or right side of the screen (a given word can appear on either side of the screen). In contrast, open-response procedures allow on any particular trial the full set of response alternatives. Often this is accomplished by letting participants identify the stimulus in the same way that they might identify it in the real world, e.g., naming, transcription, etc. There is little or no special stimulus-response mappings constructed for the experiment. 2. It might be argued that a more direct comparison to the DRT would be a task in which listeners identified the initial consonant of a spoken word. Presenting words in such an experiment
4.
5. 6.
can produce at least two types of artifacts: (a) Recognition accuracy of a word is affected by the familiarity of the word relative to the familiarity of similar sounding words (Luce, 1987); and (b) lexical and phonotactic constraints limit the set of possible confusions, i.e., if the initial consonant of a word is poorly transmitted, the listener can use the latter portion of the word to restrict the possible response alternatives (Salasoo and Pisoni, 1985). For example, in English, the word fragment/Iv/(pronounced to rhyme with "give") can only be preceded with the phonemes/g/,/1/,/s/; and therefore the listener need only choose among these three alternatives. Although the influence of aliasing was not apparent with the 8 kHz sampling rate, the 10 kHz sampling rate was used to further decrease the distortions due to aliasing. For example, "C" was counted as a correct response to the syllable /si/. That this type of correction is reasonable is supported by the observation that the specified letter-sounds confusions were very frequent in the expected vowel context but rarely occurred in the other vowel contexts. For instance "G" responses occurred frequently for the syllable/ji/but almost never for the syllables /ju/and/ja/. At worst, the correction provides an overestimate of intelligibility. This is the conservative tactic because, as will be discussed latter, the DRT scores are much higher than the CI scores. Thus, the correction decreases overall differences between the DRT and the CI scores. The statistical relationship between coder conditions is the same for both corrected and uncorrected scores. The feature taxonomy used to compute the percent of single and multiple feature confusions was based on the feature taxonomy
214
Greenspan,Syrdal and Bennett
used in (Voiers, 1983). Voiers uses three values to map features onto phonemes: feature-present, feature-absent, and "does not apply" (see Appendix 1). According to Voiers (1983) graveness "does not apply" to the phonemes /g,l,y,j,sh,ch,k,h/ and compactness "does not apply" to/I/. In other, distinctive feature systems (e.g., Chomsky and Halle, 1965), graveness is considered present for some of these phoneme and absent for others. To avoid the debate as to which system is more appropriate and because the present study is concerned primarily with the DRT, the present analysis ignored the graveness distinction. Thus, the calculation of "number of features confused" (see Fig. 1) is conservative, it reduces the number of multiple feature confusions.
References American National Standards Institute. (1960). American standard method for measurement of monosyllabic word intelligibility (ANS $3.2-1960). New York: American Standards Association. American National Standards Institute. (1989). Method for measuring the intelligibility of speech over communication systems (ANS $3.2-1989). New York: American Standards Association. Bronson, E., Carlone, D., Kleijn, W.B., O'DelI, K., Picone, J., and Thomson, J. (1987). Harmonic coding of speech at 4.8 Kb/s. In
Proceedhtgs of the IEEE lmernational Conference on Acoustics, Speech, and Speech Processhlg, pp. 2213-2216. Campbell, G.A. (1910, cited in Sehmidt-Nielsen, 1994). Telephonic intelligibility, Phil. Mag. January. Chomsky, N. and Halle, M. (1968). The Sound Pattern of English. New York: Harper and Row. Egan, J.E (1948). Articulation testing. Lyryngoscope, 58:955-991. Greenspan, S.U, Nusbaum, H.C., and Pisoni, D.B. (1988). Perceptual learning of synthetic speech produced by rule. Journal of
Experimental Psychology: Human Learning and Performance, 14(3):421-433. House, A.S., Williams, C.E., Hecker, M.H.L., and Kryter, K.D. (1965). Articulation testing methods: Consonantal differentiation with a closed-response set. Journal of the Acoustical Society of
America, 37:158-166. Jakobson, R., Fant, C.G.M., and Halle, M. (1963). Prelimhlaries to
Speech Analysis: The Disthrctive Features and their Correlates. Cambridge, MA: MIT.
Luce, P.A. (1987). Structural distinctions between high and low frequency words in auditory word recognition. Unpublished doctoral dissertation, Indiana University. McAuley, R.J. and Quatieri, T.E. (1985). Mid-rate coding based on a sinusoidal representation of speech. In Proceedings of the IEEE
hzternational Conference on Acoustics, Speech, and Speech ProcesshTg, pp. 945-948. Miller, G.A. and Nicely, P. (1955). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27:338-352. Nusbaum, H.C., Dedina, M.J., and Pisoni, D.B. (1984). Perceptual confusions of consonants in natural and synthetic CV syllables. Research on Speech Perception: Progress Report No. 10, Speech Research Laboratory, Indian University, Bloomington, Indiana, pp. 409-422. Nusbaum, H.C., Francis, A.L., and Henly, A.S. (1995). Measuring the naturalness of synthetic speech. International Journal of
Speech Technology, 1:7-19. Ralston, J.V., Pisoni, D.B., and Mullenix, J.W. (1994). Perception and comprehension of speech. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.), Applied Speech Technology. Boca Raton, FL: CRC Press. Salasoo, A. and Pisoni, D.B. (1985). Sources of knowledge in spoken word identification. Journal of Verbal Lealvting and Verbal Behavior, 24:210-234. Scmidt-Neilsen, A. (1994). Intelligibility and acceptability testing for speech technology. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.), Applied Speech Technology. Boca Raton, FL: CRC Press. Syrdal, A. (1987). Methods for a detailed analysis of Dynastat DRT results. AT&T Bell Laboratories Technical Memorandum. Voiers, W.D. (1977). Diagnostic acceptability measure for speech communication systems. In M.E. Hawley (Ed.), Speech IntelligibUityl and Speaker Recognition, vol. 2. Stroudsberg, PA: Dowden, Hutchinson, and Ross. Voiers, W.D. (1983). Evaluating processed speech using the diagnostic rhyme test. Speech Technology, 30-39. Wang, M.D. and Bilger, R.C. (1973). Consonant confusions in noise: A study of perceptual features. Journal of the Acoustical Society of America, 54:1248-1266. Wickelgren, W.A. (1966). Distinctive features and errors in shortterm memory for english consonants. Journal of the Acoustical Society of America, 39(2):388-398.