INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 7, 81–91, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Verbal Descriptors for VoIP Speech Sounds GREGORY W. CERMAK User Centered Design, Mailcode LA0MS38, Verizon Laboratories, 40 Sylvan Rd., Waltham, MA 02451, USA
[email protected]
Abstract. Laboratory studies with human observers were used to develop and apply a set of qualitative verbal descriptors for the sound of digital speech signals in a packet network under various loads of packet loss and jitter. The specific set of speech samples used were derived from 24 systems that were combinations of – – – –
G.729A vs. G.711 codecs, Produced by two manufacturers, With three levels of packet delay variation (“jitter”) introduced by a packet network emulator, and Three levels of packet loss, also by means of the emulator.
Each system was represented by a set of speech samples that had been recorded through it. A suite of verbal descriptors developed in a preliminary study was used to distinguish among the set of speech samples, and thereby the systems. Speech samples that had gone through one kind of system had a different profile of ratings than speech that had gone through other kinds of systems. Statistical analysis showed that the descriptors discriminated among all the experimental variables and most of their interactions. The system variables with the greatest effect on sound character were packet loss and a combination of codec and manufacturer. The qualitative descriptors also predicted overall subjective quality of the speech. Keywords: VoIP, speech quality, packet loss, jitter, network emulator, packet network Introduction Voice Over Packet Network Voice over Internet Protocol network (VoIP) still seems to be a technology worth taking seriously, despite recent setbacks in the various telecommunications and Internet industries. Although usage figures from telecommunications service providers are proprietary and rarely available, the trade press continues to carry stories about the immanent large scale adoption of voice over packet networks (e.g., Barthold, 2001, 2002; Bischoff, 2002; O’Shea, 2002). Fitchard (2002) claims: “Voice over IP has come of age, and new applications from videoconferencing to IP/PBX are eking [sic] into the enterprise world,” and reports that specific enterprise applications fall in the following (descending) order of current usage: toll bypass, IP voice services,
IP/PBX, branch office link to a central PBX, and videoconferencing. Work continues on VoIP in various standards bodies (see www.ietf.org under iptel; www.t1.org/t1a1/ a13hom.htm; www.itu.int), and a significant fraction of voice traffic worldwide is currently carried by VoIP and other packet networks (Bischoff, 2002). VoIP is in the process of evolving from a technology used by only a small fraction of the population to being part of the national communications infrastructure. Part of that process of evolving includes the growth of methods for characterizing and measuring different levels of VoIP performance. Characterizing VoIP Performance Currently, three approaches are used to characterize VoIP performance. One is to characterize the
82
Cermak
performance of the system, for example in terms of bit error rate, packet loss, packet delay, packet delay variation, and echo (e.g., Kostas et al., 1998). A second is to subjectively measure the overall quality of the speech throughput of the system, e.g., using the so-called “mean opinion score” (MOS) (ITU, 1996). A third is to objectively measure the quality of the speech throughput with an algorithm that has been tuned to correlate well with subjective judgments of speech quality (e.g., Thorpe and Yang, 1999). The present study introduces a fourth approach, namely the subjective characterization of component qualities of VoIP speech. There appears to be very little public domain literature on VoIP characteristics. The Bell System Technical Journal lists a few studies of the speech quality of packetized voice, but these studies do not characterize how speech sounds when transmitted through packet systems. Instead, VoIP quality is described either by system performance measures such as bit error rate, or by overall subjective quality on a single good-bad type scale (Funka-Lea et al., 1998; Goodman and Sundberg, 1983; Jayant, 1981). Presumably, much more is known about VoIP characteristics than appears in reports in the public domain. At the time that digitized speech was being introduced in commercial telephony, a vigorous research program existed in the major telephone labs. The present study is similar to studies for compressed digital speech reported more than 20 years ago (see Barnwell and Voiers, 1978; McDermott, 1978; Voiers, 1977; Voiers et al., 1976). We reinvent speech quality studies for packetized voice on the grounds that the subjective quality of packetized speech under error conditions may be qualitatively different from the subjective quality of non-packet digital speech under error conditions. We also may be reinventing speech quality studies because much of the institutional memory of the telecommunications industry has been lost as a consequence of industry restructuring; labs, staff, and libraries have been disbanded or dispersed and information has become more proprietary. We now only discover after the fact that similar studies had been done in other labs. Perhaps documented measures of how VoIP sounds have not been necessary for practical work in the VoIP industry heretofore. Now, however, VoIP has become accepted enough that people in the telecommunications industry often ask what it sounds like under various performance conditions. Demonstrations are one answer to such questions (Committee T1, 2003). However,
demonstrations are not very portable. Also, demonstrations do not permit an analytic characterization of speech quality as a function of various causal influences. Also, for some industry applications, describing VoIP speech quality in words might be important, especially when communicating with customers and end users. Three general circumstances for encounters between a company and its customers are (a) new product development, (b) marketing and (c) service. In each case, particular attributes or features of the product and its performance are likely to be at issue, and the company and customer require a common vocabulary for describing them. For example, if a customer calls in to a VoIP phone service supplier and wants to describe a problem with the speech quality, some commonly understood terms might be useful to the customer service agent in diagnosing the problem. If the customer, for example, said that the service sounded simply “bad,” the customer service agent might inquire whether it sounded bad in a particular way such as having an echo, sounding like static, or syllables disappearing. Or, if a VoIP service provider conducted a tracking study of service quality, that study would most likely be conducted as a survey, and the survey might benefit from a commonly understood vocabulary beyond “good” vs. “bad.” Goals of Current Study The current study has two general goals, one to develop qualitative descriptors (words) of VoIP under various operating scenarios, and the second to quantitatively relate the descriptors to objective attributes of VoIP performance. More specifically, the current study intends to: • Develop and validate a set of verbal descriptors for VoIP. • Quantitatively relate the descriptors to packet loss and packet delay variation (“jitter”). • Quantitatively relate the descriptors to mean opinion score (MOS). • Using the descriptors, describe how two popular codecs sound under various operating conditions. The study is intended to apply across a broad range of packet network performance. The study is not intended as an evaluation of particular packet network components. This study also is not intended as a final word on
Verbal Descriptors for VoIP Speech Sounds
how to describe the sound of packetized speech; with technology changing, presumably this study should be replicated every few years. Method Two-Stage Procedure Developing Candidate Descriptors. The need to develop verbal descriptors, while rare in telecommunications standards, is common in new product development. The subjective variables or attributes that describe and distinguish (potential) products often are not known and have to be discovered or developed. The general technique in consumer research for developing and then applying descriptors in a new product domain is called the “Voice of the Customer” (Griffin and Hauser, 1993). It consists of two main phases, – A qualitative phase in which users or consumers are interviewed in order to develop a set of descriptors for the domain; – A quantitative phase in which a second set of users evaluates a set of products with respect to the descriptors; output consists of both the product evaluations and estimates of the relative importance of the individual descriptors. Subjects. Thirty-one subjects provided the judgment data in the present experiment, five in the interview phase and 26 in the quantitative phase. The subjects were not company employees (except for the author). Their ages varied from 20 to 58. Females were 17/31 of the sample. Subjects were not screened for technology use other than the trivial screen that they use a telephone. The subjects were drawn from a subject pool characterized by a median income of approximately $50k and higher than usual education level. Stimuli Variables Manipulated. The main goal for assembling the set of speech samples was that they cover the range of subjective qualities for speech through VoIP systems available at the time. The present set of variables and their levels are generous in their coverage. Codec. Two quite different codecs in common use today were used in the experiment. The G.711 represents simplicity and high performance at the expense of
83
bandwidth. It does not compress the digitized speech; it runs at 64 kb/s. The G.711 standard also currently does not incorporate any error concealment for lost packets, although a proposal is on the table for incorporating error concealment in future versions of the standard (Kapilow and Perkins, 1999). The G.729A standard does compress speech to 8 kb/s and does incorporate packet loss concealment code, at the expense of requiring more processing time and computing cycles. Specific implementations of these standards can vary in a number of significant details. Packet Loss. Defining a “realistic” level of packet loss is difficult. The level of packet loss depends on the level of service being offered. Private packet networks presumably operate with low levels of packet loss, while telephony over the Internet is often delivered with high levels of packet loss. If one constrained the packet loss range by specifying a level of service, there is still the problem of getting network operation data: Network operators are cautious about revealing network performance (also see Aracil et al., 1999, p. 262, regarding difficulty esitmating packet loss in a non-proprietary network). More certain and accessible is the range of packet loss talked about in the engineering community. Dvorak (2002) considers packet loss in the range 0–2%. Perkins et al. (1999, Table 3) consider packet loss in the range 0–4% in a major network planning model. Also, packet loss in the range 0–10% has been described as being not unusual or alarming (Kostas et al., 1998). In any event, the recordings used in the present study cover the range of 0–5% packet loss. Jitter. The range of delay variation or jitter was 0– 100 ms, as defined in the NIST Net emulator (below). This range is fairly large; it definitely covers the range of jitter discussed at the T1A1 standards meetings. Manufacturer. Both codecs were represented by two well-known manufacturers. To the extent that the manufacturers’ implementations of the standards-based codecs differ at all, the differences are proprietary. Thus, it is difficult to say how large a range of manufacturer differences is covered by the collection used here. Two major variables are not covered in this study: delay and echo. We do not have a way of representing delay in recordings of sentences that are delivered by one-way playback. Echo was left out of this set of recordings for two reasons. One is that there are already
84
Cermak
Table 1. Design of experiment. Each X represents one of the 24 combinations of variables. Manf. A, G.711; Manf. B, G.711 Packet Loss (%) Jitter, (ms)
0
0 50 100
×
Manf. A, G.729A; Manf. B, G.729A Packet Loss (%)
1
5
0
1
5
× ×
× ×
× × ×
× × ×
× × ×
more variables than can easily be dealt with. The other is that echo has such a dominating effect that it could have made it difficult to distinguish effects of any of the other variables. Combinations of Variables. A full factorial experiment with the four variables above would require 36 stimuli. The present study was designed as a partial factorial using 24 of the 36 possible combinations, as shown in Table 1. This design economized in the amount of recording necessary and in amount of subjective data to be collected at the cost of not being fully orthogonal. The SAS∗ (1989) GLM statistical analysis procedure handles non-orthogonal designs. Eight Sentence Samples Per Combination of Variables. The recordings were eight short pairs of sentences from the phonetically-balanced “Harvard sentences” (IEEE, 1969). An example is: “Oak is strong and also gives shade. Cats and dogs each hate the other.” Four samples were in female voices and four were in male voices. The same eight samples were recorded through each of the 24 combinations of variables (or systems) plus in their original form, giving a total of 200 samples. Production of Sound Samples – The speech samples were each eight sec long and stored as a 128 kbyte WAV file. The speech had been digitized at 8000 samples per sec and 16 bits per sample. – A Hammer∗ call emulation system transformed each speech sample into 64 kbs pulse code modulation (PCM) format and output the sample to the first of a pair of gateways. – The gateways used in the experiment were an Ascend 6000∗ configured for G.729A or G.711 and a Cisco AS 5300∗ also configured for G.729A or
G.711. These gateways were 1999 models. Note that the point of the experiments is to develop vocabulary for describing the sound of packetized speech; it is not to compare particular products on the market today. The jitter buffers of the gateways for this experiment were set to 180 ms. Each packet contained 20 ms of speech signal, equal to two frames. There was no voice activity detection, and no adaptive jitter buffer. Echo cancellation was implemented. A gateway encoded and converted the speech sample to packets, and output it to an unused subnet of the local ethernet system. – NIST Net (Carson, 1998) is a freeware network emulator that represents the output performance of an IP network. The emulator examines an input packet stream packet by packet, and perturbs each packet according to a stochastic model. The user of the emulator controls three parameters of the model: (1) the mean delay, in ms, added to each packet (set to 0.0 for all conditions in the present experiment), (2) the standard deviation of a Pareto normal distribution of delays, in ms, added to any packet (“jitter”), and (3) the probability that the packet is discarded. The packet stream exiting the emulator was output to the local lab ethernet. Note that the emulation is a random process: Two speech samples that are modified by the emulator with exactly the same settings are very likely to be perturbed to different degrees and in different parts of their time series. Note that the packet loss generating algorithm in this version of NIST Net uses a constant probability of packet loss; currently there is interest in “bursty” packet loss in which the loss probability is not constant over time (Dvorak, 2002). Also note that the algorithm for generating delays to apply to the packets is constructed in a way that can change the order of packets when jitter is large, a condition that can occur in operating IP networks. Further, note that the definition of jitter is currently being debated; see Morton (2001). The telecommunications standards world currently seems more interested in definitions based on ranges of delay observed over a time series, whereas the NIST Net emulator defines jitter as the standard deviation of a distribution of delays that is sampled. The standard deviation is roughly one third the size of the range from 5 to 95% of a Gaussian distribution. – The system was symmetric about the emulator: Another gateway to transform packets back into PCM,
Verbal Descriptors for VoIP Speech Sounds
Table 2. Fit of the predictor variables packet loss, jitter, and the manufacturer x codec interaction to each descriptor variable. Model df
R2
F
Staticky
7
0.85
12.47
Choppy
7
0.89
18.12
Warbly
7
0.87
15.90
Slurred
7
0.83
11.17
Garbled
7
0.90
20.55
Whooshy
7
0.49
2.18
Fading
7
0.82
10.13
Lost words
7
0.65
4.18
Dropped syllables
7
0.73
6.12
Distortion
7
0.87
15.10
Descriptor
then back through the Hammer system to digitally record and store the speech files. – The PCM files were transformed into WAV files in a comparable 8 kHz, 16-bit format (using the editor Cool Edit 96∗ ). The files were trimmed to 8 sec (approximately 125 kbytes) and switching transients were removed. – The WAV files were played back to subjects via the Sound App∗ application over Bose∗ speakers in an acoustically-treated room with 38 dB(A) background noise. Data collected using the speakers is highly correlated with data collected using a handset in another experiment with the same recordings (Table 3 below). Note that many details of the gateways’ performance are not accessible or documented; these gateways are not really intended for use in scientific experiments. One cannot say exactly how the gateways implement codec algorithms; codec standards leave some room for engineering creativity. Nor can one say how jitTable 3.
Fit of linear regression models for MOS.
Variables
R2
MOS
“Total”
0.91
Garbled
0.91
Distorted
0.92
Fades
0.34
Warbly
0.86
Staticky, Fades
0.95
Staticky, Choppy, Fades
0.97
85
ter buffers are implemented. Therefore, one cannot say how often a given level of jitter is transformed into extra packet loss. Nor do we have information about how packet loss concealment was actually implemented in the gateways; the results suggest that one of the gateways may have used packet loss concealment when its G.711 codec was operating. Because of these uncertainties about the particular gateways used in the current studies, and because technology is changing very quickly, one is never quite sure how generalizable results can be from an experiment based on any specific equipment. In particular, the equipment used here may not be very representative of VoIP telephones. Other researchers are encouraged to replicate the present study with other gateways or VoIP phones, other emulators, other samples of subjects, and other interviewers.
Interview Procedure Five subjects were interviewed individually as they listened to recordings of VoIP samples played through various combinations of codec and network performance. The original version of each recording was presented, followed by a test version. Subjects were asked to attend to differences between the original and the test recording, and to describe all the differences they could hear. Each recording was played until the subject could find no more characteristics to describe. The interviewer did not point out specific features of the recordings and did not help subjects with their descriptions. Recordings with maximum packet loss and jitter were presented first. Later, recordings from a different codec and then from a different manufacturer were played. Then recordings with smaller values of packet loss and jitter were played. Each interview lasted roughly an hour.
Rating Speech Samples with Respect to Verbal Descriptors The instructions to a subject explained that they would be judging 24 telephony systems by means of recorded sentences that had gone through each system. The original recordings were demonstrated. Then the 10 descriptors that had been derived from the interviews were explained (see below). Use of a seven-point rating scale was explained. For each descriptor, the subject was to indicate the extent to which the descriptor applied to the
86
Cermak
set of speech samples just presented. The seven-point scale used was: Amount of effect on speech recording 1
2
None, can’t hear it
3
4 Some, intermittent
5
6
7 A lot, all the time
The speech samples were presented in groups of eight. Subjects rated a system with respect to the 10 descriptors only after all eight samples had played. Subjects were told that their judgment should reflect the system’s effect taken across the whole group of sound samples. Subjects were given sheets listing the descriptors so they could keep notes during the playing of a suite of eight samples. The order of the 24 systems was randomized separately for each subject. The order of the suite of eight speech samples was also randomized separately for each subject and system. Subjects delivered their judgment ratings in the same order for each system. Results Interview Data From 14 to 33 descriptors were collected from each individual subject. These descriptors were redundant, both within each subject’s list and across subjects. Each subject’s list was pruned conservatively, leaving in descriptors in case of doubt. Lists of descriptors were then combined across subjects, again pruning redundant descriptors. The twelve descriptors below emerged from this pruning. Staticky/crackly Warbly/quavery/quivery Slurred/fuzzy Garbled Choppy/jerky Light sabre/hum Whooshy/breathy Distortion comes and goes Sound level fades/muffled Interrupted/lost words (individual words, not all of them) Syllables are dropped/clipped off Echo
After the first three subjects in the second (quantitative) phase of the study, the descriptors echo and light sabre/hum were eliminated because subjects in this phase were not detecting either of those attributes in the recordings. Given that each subject evaluated each of 24 systems with respect to the battery of descriptors, and given that the experimental sessions were running longer than an hour, the benefit of keeping rarely-used descriptors was not worth the cost. The final battery consisted of 10 descriptors. Previous Data Set of MOS as a Function of Variables Manipulated A data set was already on hand regarding the judged overall quality of the same 24 systems under study here (Cermak, 2001). These data can be useful for comparing with the descriptor data collected in the present study. Presumably, the overall quality is either wholly or at least partially a function of the component qualities of the sound samples. The data analysis which follows determines the extent to which the component descriptors predict the overall judged quality (MOS) of the speech samples. Further, the analysis determines which of the descriptors are most important in predicting the overall quality data. Note that the previous study used a standard telephone handset for playback rather than speakers. Examples of Average Profiles G.729A of Manufacturer A vs. G.711 of Manufacturer B. Figure 1 shows the profiles for the gateways A729 (codec G.729A of Manufacturer A) and B711 averaged across all subjects for the high impairment condition of 100 ms jitter and 5% packet loss. The B711 was rated higher in the descriptors of harshness (staticky, choppy), and the A729 was rated higher in the descriptors for fading and the complete loss of words and syllables. Analysis of variance (below) shows that the difference in profiles is statistically reliable. The two gateways do sound different under large loads of packet loss and jitter; the profiles in Fig. 1 express that difference in the words of non-expert phone users. Average Profiles for Low, Medium, and High Levels of Impairment. Figure 2 shows average profiles for zero packet loss and jitter, 1% packet loss and 50 ms jitter, and for 5% packet loss and 100 ms jitter for the A729 codec. For most of the descriptors there is a
Verbal Descriptors for VoIP Speech Sounds
Figure 1.
87
Average profiles for codec G.729A of Manufacturer A and G.711 of Manufacturer B at 100 ms jitter and 5% packet loss.
clear increase as the impairments go from low to high, and the middle level of impairment receives a middle score. That is, the descriptors act like reasonable rating scales. Figure 2 also shows that the descriptors are not equal in their ability to discriminate different levels of impairment—“fading” shows a large difference between the low and high conditions, while “whooshy” shows none. Also, one might suspect that the scores for the various descriptors are correlated. Analytical Results The utility of the descriptor variables depends in part on their relationships with two other sets of variables, (a) variables that describe how speech transmission was manipulated, i.e., type of codec and manufacturer, packet loss, and jitter; (b) variables that describe how the speech transmission had been evaluated by human judges with regard to overall quality (e.g., MOS). The former set of variables might be viewed as causes of the speech impairments named by the descriptors. The latter variable can be viewed as an effect of the speech impairments.
Factor Analysis. During the experimental sessions some subjects applied the descriptors differentially to each set of speech samples. Most subjects, however, tended to group their responses into categories in which several descriptors would rise and fall as a unit, depending on the speech samples. Given this observation, the resulting dataset should be redundant. A “principal components” version of factor analysis was applied to the 10 descriptor variables represented by 624 observations (24 speech samples from different systems by 26 subjects). Three orthogonal factors accounted for 67% of the variance in the data. The first (and largest) component was fairly clearly a simple “bad-good” factor: All the descriptors correlated positively with it; “Garbled” and “Distortion comes and goes” correlated most highly. During the experimental sessions the subjects seemed to treat these two descriptors in the general sense of “not good,” whereas the other descriptors were used in a more selective way to describe certain kinds of impairment. The second and third components identified descriptors that moved in opposite directions for at least a subset of the samples—“Staticky/crackly,”
88
Cermak
Figure 2. Average profiles for codec G.729A for Manufacturer A at zero packet loss and jitter, 1% packet loss and 50 ms jitter, and 5% packet loss and 100 ms jitter.
“Warbly/quavery/quivery,” and “Choppy/jerky” moving as a unit in one direction, and “Sound level fades/muffled,” “Interrupted/lost words,” “Syllables are dropped/clipped off” moving as a unit in the other direction. “Slurred/fuzzy” and “Whooshy/breathy” also moved as a unit for some subjects independent of the other descriptors. Thus, the principal components analysis indicated that, on average across the group of subjects, the suite of descriptor variables was composed of about three independent pieces of information or dimensions, the rest being redundant. The primary dimension, a badgood dimension, appears to coincide with changes in packet loss and jitter—an hypothesis to be examined in the next section. The secondary dimension is bipolar, and might be viewed as two qualitatively different effects. This dimension seems to capture the different personalities of the sounds of the G.729A and G.711 codecs when they are stressed by packet loss and jitter. We use these ideas in the next sections.
System Variables Predict Descriptors. ANOVA (analysis of variance computed with the SAS GLM procedure) was used to determine how well the variables manufacturer, codec, jitter and packet loss predict the individual descriptor variables, and whether some of the descriptors were better predicted by manufacturer and codec and some were better predicted by jitter and packet loss. Several combinations of variables were tried. The basic problem is that there is a four-way interaction of variables going on in a system without very many free parameters: At zero packet loss, all four combinations of manufacturer and codec show almost no amount of any impairments named by the descriptor variables. At high levels of packet loss, all the descriptors increase, but not uniformly. For one of the two manufacturers, but not both, the two codecs diverge greatly in their performance. That is a three-way interaction of manufacturer, codec, and packet loss. This performance
Verbal Descriptors for VoIP Speech Sounds
difference appears in one set of descriptors for the G.729A and in another set of descriptors for the G.711. That is the fourth-way interaction. Because of the complicated interaction of variables in this dataset, it is difficult to choose one model over another. Also, all the models fit the data extremely well, so that there was very little practical advantage of having one model rather than another. The main results of the ANOVA were: • Jitter was of marginal importance. This was also true in the original study which used these recordings (Cermak, 2001). • The full model using manufacturer x codec x packet loss predicts the individual descriptors well, but is not very interesting. With only 24 systems and 12 combinations of manufacturer/codec and packet loss, the model is bound to predict well. Also, it does not allow a contrast between the effect of packet loss and the effect of manufacturer/codec. • The simpler models with just one two-way interaction, either between manufacturer and codec or between codec and packet loss, fit the individual de-
Figure 3.
89
scriptors with R2 values between 0.7 and 0.9 (correlations between about 0.85 and 0.95 see Table 2). • In the model that allowed a distinction between the gateways and the network impairments, packet loss was a strong predictor of all the descriptors and was the best predictor of the variables that describe a harsh or distorted sound quality: staticky/crackly, warbley/quavery/quivery, choppy/jerky. Manufacturer x codec was the best predictor, or was tied for best, for the descriptors of parts of the speech signal being lost entirely: whooshy/breathy, sound level fades/muffled, interrupted/lost words, syllables are dropped/clipped off. Presumably, this variable is picking up differences among the codecs regarding whether packet loss concealment was activated or not. Figure 3 shows an example of the core results of this part of the analysis. In the figure are examples of the two qualitatively different categories of descriptor questions, staticky/crackly and sound level fades/muffled, and the four combinations of manufacturer and codec. For this figure, the data are averaged across all packet loss conditions. Consider the descriptor “fading” in the
Ratings for four codecs on two qualitatively different descriptors.
90
Cermak
figure. The codec A729 is a significant outlier (worst). Now consider “staticky.” The A729 is now the best of the lot and A711 is the worst. The B codecs are in the low-middle range of ratings for all conditions. This pattern of results holds across the rest of the descriptors: The A729 is susceptible to dropping or covering parts of words and sentences, the A711 sounds harsh when there is packet loss, the A729 sounds good when it is not dropping parts of the signal, and the B711 and B729 always do fairly well, not bad and not especially good. Relationship Between Descriptors and Overall Quality. The final set of relationships to calculate are those between the descriptors and the measure of overall quality of the speech samples, MOS. For this analysis the mean ratings for each of the 24 systems with respect to the descriptor variables and MOS were used. The mean ratings are continuous variables, so the analysis was just straightforward regression. A one-parameter regression does quite well, although it is not very interesting. There are three very similar one-parameter models, all involving an overall “bad-good” variable. Two of these are the descriptors garbled and distortion comes and goes that subjects treated as summaries of their impressions of the samples. The third variable of this type was just a total rating summed across all descriptors. These one-parameter models neglect the qualitative differences among the sounds of the various systems. Among the variables that do capture qualitative speech sound differences, those that describe harshness (staticky, warbley, choppy) correlate with MOS much better than do the variables that describe dropping part of the content of the speech (fades, lost words, lost syllables). Two-parameter models that include one of the “overall” variables always do slightly better than twoparameter models using the more descriptive variables. The best three-parameter models do not include any of the “overall” variables. Table 3 shows these results. Conclusions Study Success In studies like the present one, there is no way of knowing whether the set of descriptors is “optimal” in some way. A less ambitious goal is that the descriptors cover the set of attributes that are important for VoIP. However, the set of attributes is not known independently, so there is no way to compare
the current set of descriptors against such a benchmark. A set of descriptors can be useful and interesting even if not optimal. Usually, “interesting” and “useful” are determined by the relationships among the set of descriptors and other variables. Establishing relationships among sets of variables has been the major point of the analysis of this experiment— between the descriptors and the objective variables that define the VoIP systems used, and between the descriptors and another measure of subjective quality, MOS. Meaningful Descriptors The set of speech quality descriptors examined here is meaningful in four senses: 1. The descriptors are systematically related to objective variables that logically could cause changes in how speech sounds over VoIP: codec type and manufacturer, network packet loss, and network packet jitter. The relationships among the descriptors and the objective variables give one a way to talk about the differences between the sound character of the G.729A and the G.711 codecs. 2. The descriptors are very accurate in predicting MOS. Descriptors having to do with harshness of VoIP sounds due to packet loss are the best predictors. Descriptors of how speech sounds when portions of the speech content have been deleted add predictive power, but are themselves not accurate predictors. 3. The descriptor data are internally consistent; the internal data noise was fairly low in the ANOVA. 4. Face validity: The two-phase method used in the present study produces descriptors that are transparent by design. Possible Applications Two potential applications of the descriptors would be in situations of customer contact where customers and telecommunications employees need to describe how VoIP sounds. One instance is customer service; another is market research. For example, if one were going to try assessing consumer acceptance of a VoIP service using a survey, and if the sound quality of the service was expected to be less than PSTN sound quality, then the best descriptors to use in the survey would be “garbled” and “distorted,” perhaps with some modifiers such as
Verbal Descriptors for VoIP Speech Sounds
“seldom,” “often,” “slightly,” or “completely.” In a customer call center where customers report problems, the more discriminating descriptors might be preferred: “staticky/crackly,” “choppy/jerky,” “warbley/quavery,” “sound level fades/muffled,” “interrupted/lost words,” “syllables are dropped/clipped off.” Notes B. Khasnabish, K. Ratnam, and W. Yang made the IP network and gateways available for these studies. P. Skelly set up the NIST Net emulator on site and made it available. L. Eggert recruited consumer judges and helped record the modified speech samples. R. Jagadeesan of Cisco Systems∗ kindly provided the original speech samples on which the study is based. ∗ Trademarks
and tradenames indicated by “∗ ” are the trademarks and tradenames of their respective holders.
References Aracil, J., Morato, D., and Izal, M. (1999). Analysis of Internet services in IP over ATM networks. IEEE Communications Magazine, Dec., 37(12):258–266. Barnwell, T.P. and Voiers, W.D. (1978). Objective measures for speech quality testing. Journal of the Acoustical Society of America, 64(S1):s140 (meeting abstract). Barthold, J. (2001). Slo-mo packets. Telephony, March 26, 2001 (at www.telephonyonline.com). Barthold, J. (2002). A telephony-sized appetite. Telephony, June 10, 2002 (at www.telephonyonline.com). Bischoff, G. (2002). Opening the VoIP floodgates. Telephony, Feb. 11, 2002 (at www.telephonyonline.com). Carson, M. (1998). Documentation for NIST Net emulator. http://www.antd.nist.gov/itg/nistnet/faq.html and http://www.itl. nist.gov/div892/itg/carson/nistnet/slides/index.htm. Cermak, G.W. (2001). Subjective quality of speech over packet networks as a function of packet loss, delay and delay variation. International Journal of Speech Technology, 5(1):65–84. Committee T1 (2003). Descriptors for user-perceived impairments in speech over voice-over-Internet-Protocol (VoIP) networks. Technical Report T1.TR. p. 80. Washington, DC: Standards Committee T1 Telecommunications. Dvorak, C. (2002). A framework for setting packet loss objectives for VoIP. Contribution T1A1.3/2002-031. Washington, DC: Standards Committee T1 Telecommunications.
91
Fitchard, K. (2002). Ramping up to VoIP, Telephony, March 11, 2002 (at www.telephonyonline.com). Funka-Lea, C.A., Janczewski, C.L., Lau, W.C., Nagarajan, R., Wang, Y.-T., and Xin, Z.-L. (1998). QoS routing and performance in packet networks: A visual simulation platform and case study. Bell Labs Technical Journal, 3(4):240–54. Goodman, D.J. and Sundberg, C.-E. (1983). Combined source and channel coding for variable-bit rate speech transmission. Bell System Technical Journal, 62(7):2017–2036. Griffin, A. and Hauser, J.R. (1993). The voice of the customer. Marketing Science, 12:1–27. IEEE (1969). IEEE recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, Sept., 227–246. ITU-T (1996). Methods for subjective determination of transmission quality. Recommendation P.800. Geneva: International Telecommunications Union. Jayant, N.S. (1981). Adaptive post-filtering of ADPCM speech. Bell System Technical Journal, 60(5):707–717. Kapilow, D. and Perkins, M. (1999). Proposal for T1 standard on frame erasure concealment for G.711. Contribution T1A1.7/99012. Washington, DC: Standards Committee T1 Telecommunications. Kostas, T.J., Borella, M.S., Sidhu, I., Schuster, G.M., Gradiec, J., and Mahler, J. (1998). Real-time voice over packet-switched networks. IEEE Network, 12(1):18–27. McDermott, B.J. (1978). Subjective attributes that influence judgments of digital transmission quality. Journal of the Acoustical Society of America, 64(S1):s140 (meeting abstract). Morton, A.C. (2001). Proposal for delay variation parameters in Rec. Y.1540. Contribution T1A1.3/2001-015. Washington, DC: Standards Committee T1 Telecommunications. O’Shea, D. (2002). VoIP carriers, vendors stand at the crossroads. Telephony, July 1, 2002 (at www.telephonyonline.com). Perkins, M.E., Dvorak, C.A., Lerich, B.H., and Zebarth, J.A. (1999). Speech transmission performance planning in hybrid IP/SCN networks. IEEE Communications Magazine, 37(7): 126– 131. SAS Institute Inc. (1989). SAS/STAT User’s Guide, Version 6, 4th edition, Vol. 2. Cary, NC: SAS Institute, Inc. Thorpe, L. and Yang, W. (1999). Performance of current perceptual objective speech quality measures. Proceedings of the 1999 IEEE Workshop on Speech Coding for Telecommunications. Porvoo, Finland: IEEE, pp. 144–146. Voiers, W.D. (1977). Individual differences in valuation of perceived speech qualities. Journal of the Acoustical Society of America, 62(S1):s5 (meeting abstract). Voiers, W.D., Sharpley, A.D., and Lake, O.L. (1976). Journal of the Acoustical Society of America, 59(S1):s55 (meeting abstract).