INTERNATIONALJOURNALOF SPEECHTECHNOLOGY2, 45-59 (1997) © 1997 KluwerAcademicPublishers. Manufacturedin The Netherlands.
Factors Affecting Users' Choice of Words in Speech-Based Interaction with Public Technology C. BABER
Industrial Ergonomics Group, School of Manufacturing & Mechanical Engineering, University of Birmingham, Birmingham, B15 2TT G.I. JOHNSON
Technology Development, NCR, Kingsway West, Dundee, DD2 3XX D. CLEAVER
Industrial Ergonomics Group, School of Manufacturing & Mechanical Engineering, University of Birmingham, Birmingham, B15 277" Received April 19, 1996; Accepted September 19, 1996
Abstract. This paper contains three studies of factors which affect the choice of words in simple speech-based interactions. It is shown that choice of words is affected by the level of constraint imposed on users, such that variability is much higher when no constraint is applied than when some form of constraint is used and that variability can be reduced by employing different forms of feedback. In particular, the design of visual and auditory feedback has a bearing on users' choice of words. However, it is proposed that these results do not necessarily lead to people copying the computer, but arise from users developing appropriate communication protocols in their transactions. The paper concludes that choice of words will be subject to a number of factors, that some of these factors can be modified through system design but that 'out-task' vocabulary or inappropriate use of commands can still present problems. Until we have a better understanding of the linguistics of speech-based interaction with machines, these problems will remain intractable. Keywords:
vocabulary selection, human factors, automatic speech recognition, public technology
Introduction In this paper a surprisingly under-researched topic is considered: how do people choose which words to use when speaking to a computer? Three studies, in which factors influencing people's choice of words in speaking to automatic speech recognition (ASR) technology, were conducted and will be discussed in this paper. One reason why the problem has not been addressed is that early applications tended to be industrial (Hollingum and Cassford, 1988) with predefined vocabularies of a few hundred words, and trained users (Noyes et al., 1992). Clearly the rapid increase in
telephone-based applications has led to a change in the field, with untrained users and requirements to expand vocabulary size to cope with the range of commands and requests such services will handle. With untrained users comes the problem of people using words which do not lie within the vocabulary of the ASR device, i.e., out-task vocabulary.
Handling 'Out-Task' Vocabulary The problem of defining appropriate vocabulary has beset ASR research for many years and the speech community recognises the problems of 'out-task'
46
BabegJohnson and Cleaver
vocabulary. Not only has the problem been recognised for some time, but a number of approaches have been developed in an effort to deal with it. The following sections present examples of some of the approaches which have been suggested for handling out-task vocabulary.
Using Vocabulary Design. A feature of many applications is the limited choice of words offered to users, e.g., users might be restricted to using the digits 0-9 and 'yes' and 'no'. While this limitation can allow some operations, it necessarily restricts functionality and leads to devices operating in what might be termed 'Spanish Inquisition' mode with users having to answer a long string of questions before finding the information they require. It is apparent that, even with questions designed to elicit 'yes' or 'no', some users will reply with other words (Baber et al., 1990; Spitz, 1991). Thus, using a restricted vocabulary with apparently unambiguous questions need not eliminate out-task vocabulary ~. An alternative approach is to develop an exhaustive vocabulary, containing all the words which people are likely to use. For example, Dillon et al. (1993) found that, in an application for nursing, recognition accuracy of phrases was superior when a 103 word vocabulary was used in comparison to a 73 word vocabulary2. Using Machine Intelligence. In developing the VODIS train enquiry system, researchers used a framebased dialogue manager (Peckham, 1989); words are allocated slots in a command structure and the slots are filled as users speak relevant words (or words which can substitute for relevant words). Following parsing, the system prompts the user for more information by asking more specific questions. The approach led to specific user behaviour, e.g., there was a clear interaction between task type and success with around 80% of callers successful on enquiries involving single destinations and 40% of callers successful on enquiries involving multiple destinations (Cookson, 1988). Furthermore, the majority of callers (91%) used single word expressions in their transactions. Cookson (1988) suggests that the ASR system performance was better with isolated than with connected speech and that, "Subjects who began their initial VODIS conversation with connected speech soon 'learned' to abandon its use and switched to one or two word responses". (p. 1318). Word Spotting. Where the 'machine-intelligence' approach involved recognition of all incoming words (and had attendant problems caused by recognition error), an
alternative approach seeks to filter incoming speech in order to focus attention only on relevant words. A number of approaches can be found which fit this general category of 'word-spotting'. The likely success of word-spotting is related to the amount of extraneous speech which has to be filtered by the system. Brems et al. (1995) propose a simple metric of likely success of word-spotting by dividing utterances produced by participants in their study into 1 of 3 categories: (i) expressions which did not contain an acceptable keyword; (ii) expressions which contained the keyword with no more than two additional words; (iii) expressions which contained the keyword and more than two additional words. The implications of this approach are that appropriate, relatively unambiguous keywords can be defined and that people can be encouraged to limit their extraneous speech.
Ignoring Out-Task Vocabulary. The final approach to be considered in this section deals with the problem by removing out-task speech from the to-be-processed signal. One way to think of this is as a variation of word-spotting; where word-spotting looked for relevant words, this approach employs a 'wild card' to match all irrelevant words. Hence, the approach seeks to 'recognise' all out-task words as a homogeneous mass and exclude them from analysis.
Handling Out-Task Vocabulary by Human Factors The problem of out-task vocabulary is likely to be compounded by the increasing range of products and systems which are intended for public use. Machine-based approaches rely on the implicit assumption that either the designers will know all possible synonyms for a particular command, or that people will be consistent in their choice of words. However, the assumption could be false on a number of counts, e.g., people might not be consistent in their choice of words, people might employ quite different assumptions on how the ASR device works and how they ought to speak to it from those held by designers. One popular human factors response to the challenge of out-task vocabulary attempts to govern users' selection of words and phrases by designing the information presented to users in such a way as to constrain their choice of words. This approach has been referred to as 'convergence' (Leiser, 1989).
Convergence. Convergence has been observed in human-human communication and involves speakers
Factors Affecting Users' Choice
mimicking speech mannerisms of their interlocutors (Giles and Powisland, 1975). Thus, speakers might share similar grammatical constructions, similar pitch contours and similar phrases. From Giles and Powisland's (1975) work, 'convergence' would appear to be an unconscious modification of speaker behaviour, in that the speakers do not necessarily report awareness of changing the manner in which they are speaking. A number of studies suggest that users will copy the command words and command structures presented by ASR, after a short period of time (Zoltan et al., 1982; Zoltan-Ford, 1984; Zoltan-Ford, 1991; Ringle and Halstead-Nussloch, 1989; Leiser, 1989). For some, these results support the notion of convergence. However, there is an important difference between the convergence observed in human-human communication and speech-based interaction with computers; while convergence appears to be the result of unconscious modification of speech patterns in humanhuman communication, it is proposed that 'convergence' in speech-based interaction with computers is the result of a deliberate change in speaking. Baber (1993) proposed that, rather than arising from convergence, the interaction between prompt/feedback and vocabulary selection could be attributable to 'uncertainty reduction'. This assumes that speakers actively engage in reducing uncertainty concerning the capabilities of the ASR device and that information from the system are used in this process. In order to converse at an appropriate level, people need to determine what sort of vocabulary and speech style the computer can handle. Zoltan-Ford (1991) notes that, while speakers model the length and content of their response on that provided by the computer, this effect was found in both speech and keyboard conditions, and there was little effect of type of feedback provided (i.e., some users received feedback which only used content words to all commands, while other users received feedback only when their speech matched the computer's style of communication). Further, if people are presented with verbose feedback, they tend to use even shorter phrases than if they are presented with succinct feedback (Baber, 1991). Relevance. One approach to considering speakers' behaviour with ASR is to apply Grice's (1975) cooperative principle to human-machine interactions. The cooperative principle is based on the assumption that speakers have a duty to help their listeners by producing speech which is as easy to process as possible. The
47
maxims derived from the cooperative principle are as follows: QUANTITY Qnl: Make your contribution as informative as required Qn2 : Do not make your contribution more informative than is required
QUALITY Qll: Do not say what you believe to be false QI2: Do not say that for which you lack adequate evidence RELATION RI: Be relevant
MANNER: MI : Be perspicuous
M2: Avoid obscurity of expression M3: Avoid ambiguity M4: Be brief(avoid unnecessary prolixity) M5: Be orderly
A number of studies indicate that a characteristic of human speech to ASR devices, is that commands are short, succinct and highly task specific (Baber and Stammers, 1989; Richards and Underwood, 1984; Hauptmann and Rudnicky, 1988). This seems to support the maxims in italics. This suggests that people make assumptions concerning the capabilities of ASR and construct their speech so as not to cause too many difficulties (as indicated by Cookson, 1988). From this discussion, it is proposed that users of ASR are actively seeking the appropriate communication protocol to employ. Studies
Speech-based interaction with machines will be shaped by a number of factors, which have an impact on the context of the interaction (Baber, 1993). In this paper, the relationship between some of these contextual factors and vocabulary selection will be explored. It was decided that the task needed for this exercise had to be based on public technology (so that people could draw on previous experience of using similar machines), had to be easy to use (so learning would not interfere with performance), and had to be a sensible candidate for the application of speech. We chose the rather unusual idea of a speech-based automatic-teller machine (ATM) as a vehicle for this work. The speechbased ATM also gave some control over the range of vocabulary, so that it would be possible to conduct the studies without artificially constraining people's choice of words. Furthermore, a speech-based ATM would be sufficiently novel to ensure that participants in the study would be unlikely to have encountered a similar type of application.
48
BabecJohnson and Cleaver
Participants were told that the technology was being investigated for use by visually-impaired users but could be available for other user groups. Ostensibly, participants were involved in a concept evaluation exercise. The body of a conventional, free standing ATM was used to house a computer and visual display unit. A telephone handset was attached to the front of the machine.
Study One The first study employs the Wizard of Oz method. While the method has been criticised as providing a different communicative context to real speechbased action with machines (Button, 1990; Button and Sharrock, 1995; Fraser and Gilbert, 1991; Wooffitt and MacDermid, 1995), it provides some advantages for this work: (i) it allows a context within which to speak unrestricted vocabulary, so that each command utterance can receive some feedback; (ii) it does not require dedicated ASR, and so does not require users to engage in error-handling. This, of course, presents a wholly unnatural context. Even with a good Wizard, interpretation will be lenient. In this study, it was felt that this was an advantage. Given that people could say anything, two questions arise: (i) do individuals maintain consistency over time? and (ii) do people share common command words? If the answer to either of these question is yes, system design can be made less difficult by employing individualised phrases (perhaps triggered by the insertion of a card to identify the user) or by common words shared by a given user group. The aim of this study was to investigate the choice of words adopted by people given complete freedom to say what they deemed appropriate to perform specific functions. We were also interested in how consistent speakers would be. If speakers were consistent across trials, then one could propose design recommendations based on the assumption of consistency. Finally, given tasks with which people have some familiarity from day-today experience, the task context might be sufficient to constrain choice of words.
Method. In this study 6 people performed a series of 5 transactions. Each transaction consisted of 3 task steps. The age of the participants varied from 23 years to 46 years (with a median of 24 years). 2 of the participants had not used an ATM before, while the remaining 4 used an ATM at least twice a week. None
of the participants had prior experience with the use of ASR. An IBM P/S 2 personal computer with colour graphics monitor was housed in an ATM casing. The computer was running dBase IV version 2.0. Feedback was displayed to users in response to single keypresses by the experimenter after the user had spoken a command. A microphone was linked from the handset to an audio-cassette recorder to allow transactions to be recorded. Participants were informed that they were assisting in the evaluation of a speech-based ATM. To this end, a number of messages had been prepared to show to the 'users'. For example, the initial screen said 'enter pin...' and was followed by 'please enter request...'. While these prompt screens could be assumed to have some bearing on the manner in which people might speak, it was felt that some prompting and feedback was necessary in order to ensure that people progressed through the tasks. Participants were told that the machine would respond to spoken commands and that they were free to choose commands which they felt were appropriate to perform the tasks required. For this study, it was decided to only use three tasks: (i) to find out how much money was in a bank account; (ii) to take five pounds from the machine; (iii) to obtain proof of the transaction in (ii). The tasks were selected as being the most commonly performed transactions with an ATM (Burford and Baber, 1994). Participants were allowed to work at their own pace. Each participant performed the set of transactions five times. All transactions were audiotaped and transcribed.
Results. The commands used by each speaker in each trial of the 3 conditions are shown in appendix one. No statistical analysis was performed on the data and the results will be handled in descriptive terms in the following section. There was a high degree of consistency across speakers in terms of the choice of word to identify the contents of an account, i.e., all speakers used the word "balance" and there were no alternatives used. Furthermore, there seemed to be little effect of trial on length of utterance; it had been assumed that utterance length might reduce across trials. The only speaker to show marked variation across trials was speaker F (see Appendix 1). The level of consistency across trials is interesting; utterance length and content for speakers A, D and E remains constant across trials, while speaker B changes only for trial B4 (adding
Factors Affecting Users' Choice
"of account" to the command) and speaker C uses one utterance construction for trials C 1 and C2 (dropping "hello" on trial 2) and one utterance construction for trials C3, C4 and C5. This might suggest that wordspotting would be appropriate in this instance. However, recall the measure of appropriateness proposed by Brems et al. (1995), that word-spotting would be most suitable when there was less than 2 words sp oken with the command word. This criterion will be referred to as the command length criterion. The proportion of utterances which fit the command length criterion is 0.37. Only speaker A produced speech which fit this criterion on all trials, although speaker B produced such speech on four of the five trials. Of the remaining speakers, the mean length of utterance was 5.85 words. This suggests that the surrounding 'extraneous speech' could introduce problems for word-spotting. Five of the six participants employed the word "please" in their commands. Previous work suggests that this is more likely to occur when people believe that they are talking to a person rather than a computer (Baber and Stammers, 1989). However, four of the six participants stated, during debriefing, that they believed that they were speaking to a machine and only two stated that they believed they were talking to a person (one of these was speaker A who did not use "please" at all). Furthermore, some utterances used expressions such "Can I.." (C3, C4, C5, D3, E{1-5}) or "Could I..." (D1, D2, D3, D5, F2, F4) or "I'd like to..." (C2, F1), which further indicates that, in many of these transactions, commands are being issued as requests. For commands to take five pounds from an account, the proportion of utterances complying with the command length criterion is only 0.23. Of the utterances, only A{1-5}, B1, B2, B3, B5 and F4 comprise three words or less (although the 'amount' part of C4, D4 and E{1-5} follow the rule). The remaining utterances contain an average of 5.9 words to indicate that a withdrawal was to be made. To make matters worse, the utterances do not contain a common command word. This task was performed in one of two broad forms: by issuing a command or by issuing a request. If we assume that the command word is "withdraw" or "withdrawal", the utterances which can be said to comply with the vocabulary were A2, A3, A4 and A5 and C3 and C4 (and even here, the speakers do not use these words for all transactions). Interestingly, A{ 1-5} and E{ 1-5} show commands which resulted in screen [4] appearing [(State amount to withdraw)]. While both speakers A and E were presented with the
49
same message, speaker A responded to the 'withdraw' word (as did C4), speaker E did not use this command word even when presented with it (the same seems to be true of D4). An alternative form of utterance used to issue a 'withdraw' command was to omit the word 'withdraw', i.e., to use the command implicitly (as used in A1, B2 and B3). For several transactions, the withdrawal of cash was presented as a request, e.g., "can I have..." ( B4, C5, E{1-5} and F5), "could I have.." (D3 and D5), "can I take..." (C2 and F{1-3}) or "could I take..." (speaker D1, D2 and D4). Another form of request incorporates the word please, e.g., "five pounds please" (F4) or "five pounds cash please" (B1 and B5). Some transactions involved a command/request without specifying an amount (with users preferring to respond to screen [4] (A{2-5}, C4, D4 and E{ 1-5}), while the remaining transactions incorporated the amount in the command/request. Finally, some transactions linked the command/request in the performance of the previous transactions, e.g., by using the conjunctions to link the command to the previous 'balance' command, such as "and" (B1, B4, C2, C5, D4), "right" (F2) or "OK" (C1, C4). Other transactions combined the 'withdraw' command/request with the 'obtain receipt' request (B4, E{2-5}, F2). Thus, requests B4 and F2 could be viewed as a single transaction (combining 'balance', 'withdrawal' and 'obtain receipt'), while the other transactions separated 'balance' from 'withdraw' or 'withdraw' from 'obtain receipt'. In the obtain proof of transaction task, all utterances contained the command word "receipt". A minority of utterances obeyed the command length criterion (A{1-5}, B3, B5, D5, F1). Interestingly, not only were some of the utterances directly linked to the 'withdraw' command (see appendix one), but also many of the utterances were linked to the 'withdraw' command using a conjunction, such as "and" (B1, B2, B3, B5, D{ 1-5}, F3, F4, F5) or "OK" (C1), and E1 terminated the 'withdraw' command using "Thank you" as part of the 'receipt' command. Finally, some of the utterances were phrased as requests, e.g., using "Can" (B 1, C3, C4, C5, D3, D4, El) or "Could" (C1, D1, D2). In terminating the transaction, not all transactions were terminated with a command word. When they did verbally terminate transactions, participants tended to either say "Thank you" (A3, B{1-5}, C{1-5}, E4, E5, F2, F5) or nothing (A1, A2, A4, D{1-5}, El, E5, F1, F3, F4). A5 used the word "End", and E2 responded to the first screen (accidentally displayed by the experimenter) by saying "OK". Furthermore,
50
Baber, Johnson and Cleaver
several speakers used the words "thank you" or "OK" during the utterances to either mark boundaries between tasks or to accept feedback.
Conclusions. Given the relative simplicity of performing rudimentary functions using an ATM, such as obtaining a balance and withdrawing cash with a receipt, it is surprising to find such variety in the command constructions used. While there was a high level of consistency for choice of word for 'balance' and 'receipt' tasks, there were clear differences in the manner in which people asked for money. These differences were sufficient to suggest that the only consistent feature of the 'withdraw' commands was the use of an 'amount' term. However, it is not advisable to use the 'amount' term to indicate that a withdrawal should be made as amounts could be entered for other forms of transaction, such as bill payment, cheque deposit and credit transfer. Thus, it is necessary to determine an appropriate command word for withdrawing cash. Even if a suitable command word could be found, the length of utterances in this study makes the use of wordspotting impractical. Assuming that word-spotting will be more reliable in speech with minimal extraneous words, it would appear that only a third of the utterances presented in this study would fit the command length criterion (of a total of 84 separate utterances, 27 obeyed the command length criterion, i.e., 0.32 of all utterances). If word-spotting could be used to remove consistent 'extraneous' speech, such as "and" or "please", this proportion rises to 32/84 = 0.38. An interesting feature of the utterances is the division between commands for a function and requests for a function. This has a bearing on the choice of construction and on the use of 'polite' words, such as "please" and "thank you". It was not clear from this study what factors affect people's attitudes to the machine, and during debriefing participants were not aware of treating the machine in a specific manner. On the basis of this study it would seem that, given free range in choice of words to perform tasks on an ATM, people exhibit a high degree of variation in both the selection of words and construction of commands. This is taken to mean that speech-based ATM (without further measures to constrain users' speech) could be highly problematic. Furthermore, it is proposed that word-spotting alone need not be a suitable answer to the problems of variability in speech. Finally, the lack of consistency tends to suggest that task context alone need not be sufficient to constrain choice of words. This is, at first glance,
surprising in that the tasks were relatively straightforward and one would anticipate that they would be easy to perform. However, one explanation proposed for this effect runs as follows, while people used common words for objects, i.e., 'balance' and 'receipt', they differed in their choice of words for action, i.e., 'withdraw cash'. It is possible that the use of words to describe things was found easier than the use of words to perform tasks, especially as participants may have been familiar with performing the task manually. This suggests two factors for discussion and further research: the relationship between words to describe objects and words to perform actions (which seems to parallel the work of concrete/abstract depictions in icon research, cf. Rogers, 1989), and the use of words to perform actions per se (which has some analogy with the notion of speech acts, cf. Searle, 1969; Bunt et al., 1978; Waterworth, 1982; Baber, 1993). However, in this paper, the second study focussed on introducing constraints on speakers.
Study Two Study one showed that, given no constraints, choice of words and construction of commands/requests is highly variable. In this study, three factors were introduced to examine whether vocabulary could be sufficiently constrained to reduce variability. The first factor was the use of an automatic speech recogniser instead of a human 'Wizard'. This ought to reduce the possibility of people constructing commands which were appropriate for a human (although, as mentioned above, participants in study one claimed to believe that they were talking to a machine). The second factor is a set of command words provided to speakers. From study one, it was noted that people were not consistent in their choice of commands when speaking to the ATM. However, it was apparent that certain commands, while not universal, were popular. For the purposes of this study, therefore, these popular command expressions were used as the vocabulary, e.g., to withdraw £10 a person could say 'take out money' or 'withdraw cash' or 'cash', each of these followed by '10', or could simply say '10'. As the recogniser was speaker dependent, it was necessary for each user to 'train' each of the command expressions prior to the main trial. This was intended to give the users an expanded set of command expressions for use in the trial. Thus, rather than allowing carte blanche to speakers, there was a constraint on what could be said. It is
Factors Affecting Users' Choice
proposed that this approach takes a limitation of the technology and turns it to advantage; while the users are constrained as to their choice of commands, the approach allows the study to determine whether the command expressions remain variable across speakers (within the limits of available words) or whether speakers tend to become more consistent in their choice of commands. If this consistency could be demonstrated, then it is suggested that speech-based ATMs employ a first-phase recognition to handle these command expressions, and a possible second-phase recognition to perform 'keyword-spotting'. It is assumed that users of speech-based ATM will be sent instruction booklets or leaflets describing how the machines work, and that within these instructions possible commands will be presented. The aim of introducing this variable is to provide users with some choice while limiting the type of speech which can be used, e.g., providing options which do not employ the word "please". The third factor was the type of screen presented to users. It was decided to use screen design as the independent variable in the experimental trials. Three screens were used, which were intended to represent conditions of 'minimal information', 'cashpoint metaphor', and 'speech menu'. The 'Minimal Information' screen, as its name implies, was essentially blank until the person spoke a command. Recognition of a spoken command led to the 'pressing' of a button and the appearance of the appropriate field. If the user then issued a confirmation command, such as {'ok', 'service', 'next' or 'thank you' }, the screen became blank. The 'Cashpoint metaphor' screen was designed to look as similar to a conventional (UK) ATM as possible. Figure 1 illustrates the design of the main screen. The reason for using this screen design was simply to capitalise on the knowledge that users could be assumed to possess on the basis of previous experience. A further point to note is that, if a speech-based ATM can be shown to function adequately with current display designs, then it would not be necessary to introduce
Balance
Cash
Mini-statement
Cash with Receipt
Statement
Deposit
Chequebook
Pay Bills
Figure 1. Main screen for 'cashpoint metaphor' condition.
Water Board ICash with Receipt I Balance
~
Figure 2.
~
~
m
m
E
51
I S I SO
Display design for 'speech-menu' condition.
new forms of display to support the speech system. When a command was recognised, the main menu was hidden and a relevant field displayed, e.g., in response to a 'balance' command, the main menu was replaced by a field containing the balance of the account. The main menu was recalled by issuing the {'ok', 'service', 'next' or 'thank you'} command. The rationale behind the speech-menu metaphor display (shown in Fig. 2), was that speakers need to know which words could be spoken at any point in the dialogue. The screen shown in Fig. 1 provides this information only for main commands and does not provide prompting for entering numbers or naming payees for bill payments. The screen shown in Fig. 2, on the other hand, provides all relevant information. The sections of the screen are greyed (or blacked in the current version) to indicate when they are not 'legal'. Thus, the speaker will be guided as to which words to use. While a command led to the replacement of the main menu by an appropriate field in 'cashpoint-metaphor' screen, for the speech-menu screen the main menu remained on the screen and fields were displayed below this box.
Method. 24 people participated in the study. All participants were either students or staff at the University of Birmingham. 17 of the participants were male and 7 were female. The age of participants ranged from 22 to 48, although the median was 24. Software programmed using the object-oriented language HyperCard 2.2, was run on an Apple Macintosh Power PC. The HyperCard program basically paired on-screen buttons with various functions, usually the functions involved displaying a field of text, e.g., "Balance: £52.75". Speech recognition was performed using Articulate Systems Inc. 'Voice Navigator'. This is a speaker dependent speech recognition device which can be interfaced to the Macintosh. It is particularly useful in developing prototypes and demonstrator systems in that it has an object-oriented scripting language which allows the programmer to pair on-screen button presses with spoken commands. Thus, the speech commands were used to 'press' the buttons on the screen. The buttons were then rendered transparent in order to reduce
52
Baber, J o h n s o n a n d C l e a v e r
screen clutter and so that the functioning of the recogniser was not apparent to the user. Like many commercially available devices, the 'Voice Navigator' allows a recognition threshold to be set, i.e., in order to be recognised utterances have to match stored templates with a confidence at or above the threshold. Setting the threshold too high can increase the number of false rejects, while setting it too low can increase the number of substitutions. For the purposes of this study, the threshold was set at 70%. Furthermore, following enrollment of the 'Voice Navigator' a practice session was conducted. Any word failing to score 90% on this session was re-enrolled. These measures were deemed successful as the overall error rate (due to both false rejections and substitutions) was 9.69%. Given that speech-based public technology will be used by untrained, inexperienced users, it is felt that a recognition accuracy of 90% would be a realistic reflection of such conditions. The first part of the study involved participants 'training' the recogniser to handle their speech. This also allowed for training of users in legal commands. While a speaker dependent device will obviously not be used for the 'real' system, it was felt to be an adequate means of controlling the Demonstrator. 'Training' the 48 word vocabulary (i.e., 4 words for each of the 8 commands, 4 payees, 8 amounts and 4 words to terminate the transaction), took around 2 minutes. 'Voice Navigator' requires three samples of each word (in the mode used in this study). If the three samples are sufficiently similar, a template is created and the word is trained. If there is any discrepancy in the samples, then the person needs to repeat that training. For the majority of words, training occurred within two repetitions (although this interacted with speaker so that two participants took considerably longer, with up to five repetitions). There were some words which caused problems for most participants, e.g., 'cash with receipt'. Once the words were trained, participants were given a practice session. This involved speaking the trained words to the recogniser in 'test' mode. Any words which were not recognised in this mode were retrained. After the training session, participants were given a set of eight tasks to perform: 1. 2. 3. 4. 5. 6. 7. 8.
Log on to the system, i.e., swipe card and say 'hello'. Check the amount of money in a bank account. Obtain a listing of recent transactions on the display. Take £10 out of the machine and obtain proof of the transaction. Pay a £30 cheque into the bank. Pay an electricity bill of £30. Ask for a chequebook/statement to be sent to you. Terminate the session.
Table 1. Proportionsof responses for task two.
Command
Screen (i)
Screen(ii)
Screen(iii)
Balance
0.89
0.62
0.73
Check balance I'd like to check the balance in my account Can I check balance?
O.13
O.16
O.19
0
0.08
0
0
0.08
0
As far as possible, the tasks were described using words other than those incorporated in the command expressions. Participants were assigned to one of the three screen conditions using a random function in the software which took the users to one of the three screens from the 'hello' prompt. This meant that neither participant nor experimenter knew in advance which screen would appear. R e s u l t s . The choice of words used to perform each of
the eight tasks is considered in relation to the screen used. In order to normalise the data, proportions were used. The tables show the proportion of commands issued using specific command expressions. Due to the sample size and skew of the data, non-parametric statistical analysis was chosen to examine the data. To this end, a Kruskal-Wallis one-way analysis of variance (ANOVA) was used. Task 1 required people to swipe their card and say the word 'hello' to the recogniser. All participants used the word 'hello' to perform this task and there was no effect of screen. Task 2 required people to check the amount of money in a bank account. From Table 1, it is clear that people use the word 'balance' to perform this transaction. Taking all of the commands issued for this task, the mean command string was 1.3 words long. However, even when people use more than one word in their command, the word 'balance' still features in the command string. The Kruskal-Wallis ANOVA revealed that there was significant variance in this data (H = 6.1, P < 0.05). From the data, one can see that screens (i) and (iii) tended to have least variations in the vocabulary, with screen (ii) showing a higher variance. The groups using screens (i) and (iii) used two distinct command expressions, while those using screen (ii) used four command expressions. Thus, one can assume that the variance was attributable to screen (ii). Task 3 required people to obtain a listing of recent transactions on the display. From Table 2, it is clear that the most popular choice of word in this condition was 'ministatement'. If commands featuring
Factors Affecting Users' Choice
Table 2.
Proportions of responses for task three,
Command
Table 4.
Screen (i)
Screen (ii)
Screen (iii)
Ministatement
0.72
0.69
0.67
Recent transactions
0.3
0.15
0.22
Can I have a ministatement
0
O.15
0
I would like a ministatement
0
0.08
0
Look at recent transactions
0
0
0.11
Table 3.
Proportions of responses used for task four,
Command
Proportions of responses used for task five.
Command
Screen (i)
Screen (ii)
Screen (iii)
Deposit
0.58
0.44
0.57
30
0.37
0.39
0.43
30 pounds
0.05
0.05
0
Pay cheque in
0
0.05
0
Right, can I deposit 30 pounds
0
0.05
0
Table 5.
Screen (i)
Screen (ii)
Screen (iii)
Cash with receipt
0.07
0.03
0.11
Withdraw
0
0
0.11
Withdraw 10 pounds
0
0
0.06
Cash
0.31
0.22
0.11
Withdraw cash
0.08
0.08
0.06
Take out money
0
0
0.06
10 pounds cash
0
0
0.06
Receipt
0,04
0.03
0
10 pounds
0.27
0.08
0.17
10
0.23
0.36
0.28
Take out cash
0
0.03
0
I'd like to have 10 pounds
0
0.03
0
Can I have 10 pounds cash
0
0.03
0
Money
0
0.03
0
Give me some money
0
0.05
0
I would like 10 pounds
0
0.03
0
ministatement, irrespective of other words, are considered, this accounts for 0.77 of the transactions. The alternative command was 'recent transactions', which accounts for the remaining transactions. The resuits of the Kruskal-Wallis ANOVA revealed no significant variance across screens. However, screen (i) would appear to have less variation in choice of command expressions, with two expressions, followed by screen (iii), with three expressions, and then screen (ii), with four expressions. While screen (i) and screen (ii) may appear quite different, there is sufficient similarity in variance between (i) and (iii) and (ii) and (iii) to make the results non-significant. In task 4, the participant was required to take £ 10 out of the machine and obtain proof of the transaction. The Kruskal-Wallis ANOVA revealed a highly significant variance in the data ( H = 246.97, p < 0.001). From Table 3, it would appear that screens (ii) and (iii) have more variation than screen (i). Furthermore, screen (i) used only six command expressions, compared with
53
Proportions of responses used for task six.
Command
Screen (i)
Screen (ii)
Screen (iii)
Pay bills
0.24
0.17
0.3
Bill
0.05
0
0
Pay
0
0
0,05
MEB
0.29
0.21
0.45
20
0.33
0.21
0.2
20 pounds
0.05
0.13
0
Can I pay my electricity bill?
0
0.04
0
I would like to pay my electricity bill
0
0.17
0
nine for screen (iii) and twelve for screen (ii). Thus, the variance seems to be highest for screen (ii). Furthermore, there is not a single command word which proves popular for this task. In task 5, participants were required to pay a £30 cheque into the bank. Table 4 shows that the word 'deposit' was used in all conditions. While the Kruskal-Wallis ANOVA showed no significant variance; screen (iii) seems to be more slightly consistent than screen (i) and both are more consistent than screen (ii). The number of command expressions used suggests that screen (i) and screen (iii) were similar (with two and three expressions), while screen (ii) had five expressions. Further, while around half of the commands for each screen used 'deposit', a high proportion of transactions employed an amount term in place of 'deposit'. In a working application, this would require the machine to 'know' that a deposit was being made, i.e., for the task of inserting the deposit envelope to constitute a dialogue act 3. This introduces questions of how speech could be interleaved with other activities in human-computer interaction. Task 6 required people to pay an bill of £20 to the local electricity supplier (Midlands Electricity Board, MEB). Table 5 shows that, in this task, 'pay bills' would seem to be the most popular command word but that users were also likely to indicate the payee (MEB) as
54
Baber, J o h n s o n a n d C l e a v e r
Table 6.
Table 7. Proportionsof responses used for task eight.
Proportionsof responses used for task seven.
Command Order chequebook Chequebook Order statement Statement
Screen (i)
Screen(ii)
Screen(iii)
Command
Screen(i)
Screen(ii)
Screen(iii)
0.5 0.5 0.3 0.7
0.63 0.37 0.37 0.63
0.46 0.64 0.14 0.86
End Finish Thank you Bye Quit
0.44 0.11 0.22 0.22 0
0.25 0.38 0.38 0 0
0.22 0.22 0 0 0.44
an initial task step. The Kruskal-Wallis ANOVA shows highly significant variance ( H = 47.67, p < 0.001), and it is apparent that screen (iii) yielded the greatest consistency. The number of expressions used across screens appears to be quite similar; screen (i) had five, screen (ii) six and screen (iii) four. However, inspection of the data shows that highest proportion of commands used for each screen suggests a slightly different pattern of results, e.g., consideration of the 'legal' commands for this task shows the following, screen (i) "Pay Bills" (0.24), "MEB" (0.21), "20" (0.33); for screen (ii) "Pay Bills" (0.17), "MEB" (0.21), "20" (0.21); screen (iii) "Pay Bills" (0.3), "MEB" (0.45), "20" (0.2). This suggests that the variation could be traced to the use of 'illegal' command expressions on screens (i) and (ii). In task 7, participants had to request a chequebook or a statement. From the data in Table 6, it is apparent that participants were divided between two command expressions. Interestingly, it seems that there was a tendency to use 'order chequebook' and 'statement' when using screens (i) and (ii.), but to use 'chequebook' and 'statement' when using screen (iii). Further, while the variance across the screens is statistically significant (H = 7.59, p < 0.05), the screens all used the same four commands. When one considers that the commands are synonymous, it is safe to conclude that there is no significant difference among screens for this command. Task 8 required participants to issue a command to terminate the session. Table 7 shows that most participants used 'legal' words, i.e., 'finish', 'end', 'quit', and this was most apparent in screens (ii) and (iii). However, there was no significant variance between screens ( n = 0.53).
All screens show that 'End' was used, i.e., 0.44 for screen (i), 0.25 for screen (ii) and 0.22 for screen (iii). However, screens (i) and (ii) have a high proportion of the use of 'thank you' (0.22 and 0.38), which did not terminate the transaction, but returned the user to the initial state (i.e., the machine was waiting for the next command). Screen (iii) does not show this expression, and only screen (iii) shows 'quit' (0.44).
Table 8.
Proportionof commands comprising 3 or fewer words. Task number
Screen (i) (ii) (iii)
1
2
3
4
5
6
7
8
1.0 1.0 0.75 0.93 1.0 1.0 1.0 1.0 1.0 0.78 0.84 0.77 0.88 0.72 1.0 1.0 1.0 0.92 0.89 0.79 1.0 1.0 1.0 1.0
Mean 0.96 0.87 0.95
In order to consider the degree of consistency by which the tasks were performed using the different screens, the command length criterion was applied to the data. A high proportion of the commands 'fit' the criterion (see Table 8). This is especially true for tasks 1, 7, and 8. Comparing the screens, it appears that screen (ii) produced commands which least closely followed the 'rule', with screens (i) and (iii) having more such commands. If one compares these proportions with those obtained from study one (using the proportions for tasks 2 and 4 gives a mean of 0.3 for study one and 0.87 for study two), one can conclude that people tend to employ a more consistent vocabulary when performing these tasks in study two. The effect could be due to either the forced vocabulary or the fact that people worked within the constraints of the speech recogniser. If the effect was simply due to the fact that participants had been forced to use a specific, restricted command set, one would not expect to see words or expressions being used which lay outside this set. However, it is clear from the preceding tables that this is not the case. Around a quarter of commands were invented by participants (i.e., 0.24). Thus, the majority of the commands was selected from the set provided, and it would appear that provision of the command set played a significant role in determining participants choice of words but was not the sole influence. The occurrence of the 'invented' commands despite the provision of the vocabulary is interesting in that it suggests that people either forget the vocabulary or become 'sidetracked' during the interaction.
Factors Affecting Users' Choice
Conclusions. From these data, one can distinguish two types of command sets; those that showed some significant variance across screens and those that did not. The command sets which showed no variance were task 3 (ministatement), task 5 (deposit), task 7 (order chequebook/statement) and task 8 (end transaction). One characteristic of tasks 3, 5 and 8 was that, for the participants in this study, these were novel functions and had not been used on a conventional ATM. Thus, one might anticipate that the lack of variance across screens was related to the novelty of the task. On the other hand, the more familiar task 7 was performed using only two command expressions. Indeed, the command structures adopted in tasks 3 and 5 also appear quite similar, i.e., there is not a great deal of variation in the choice of commands to perform these tasks. This suggests that some command sets are influenced more by the nature of the task than by the design of the screen. The remaining three command sets did show significant variance, i.e., task 1 (balance), task 4 (withdraw money) and task 6 (pay bills). For the 'balance' command set, screens (i) and (iii) appear to have less variation than screen (ii); for the 'withdraw money' command set, screens (i) and (iii) appear to have less variation than screen (ii). The fact that balance enquiries and cash withdrawals are the most common form of ATM transactions (Burford and Baber, 1994) raises an interesting question for these data. While there is significant variance across screens, there are also relatively more command structures for these tasks than for the other tasks in the study. This suggests that screen design only partly explains the result. One possible explanation is that words chosen reflect the goal level of the transaction, i.e., if the user's goal is check the 'balance' then they will be likely to use the word 'balance' in their command, whereas if their goal is to obtain £10 they will be more likely to use an amount term in their transaction than to use 'withdraw cash'. In the latter instance, a command, such as 'withdraw cash', signifies the intent to act rather than the goal, i.e., what the user sees as the outcome of the task. For the 'pay bills' command set, screen (iii) appears to have less variation than screens (i) and (ii). For task 6, screen (iii) presented the options to speakers, which appears to have the effect of minimising the command set. This suggests that, for task 6 at least, the command set was influenced more by the design of the screen than by the nature of the task. It is most interesting to note the relatively poor performance of screen (ii), which was designed to be
55
similar to a conventional (UK) ATM. A possible explanation of this effect could be based on the notion of 'transfer of training'. If people are used to the physical interactions of conventional ATM transactions, it might be difficult to verbalise these actions into commands; this difficulty might be sufficient to make performance easier when there is no prompting, as in screen (i). The issue of transfer and verbalisation of actions lies outside the scope of this paper but forms the basis of ongoing work in our research group. The final study in this paper was directed at the issue of auditory feedback.
Study Three The final study focussed on the use of auditory feedback as a means of constraining choice of words. In this study, participants used the 'speech-menu' display, which was screen (iii) in study two. Auditory feedback was provided by simply replaying speech messages when a button had been activated; in effect this meant that participants had mixed modality feedback (auditory plus visual). The speech messages were speech coded recordings of a male speaker, compressed by a ratio of 3 : 1 and stored in HyperCard. When a button was pressed on the screen, the speech messages were played. In all cases, the speech messages contained the same words as the screen buttons. An additional prompt was designed so that, in response to "ok" or "thank you", the user received the following message "Which service do you require?"
Method. 10 participants (age range 23-34) performed the same enrolment procedure, and performed the 8 tasks from study two. In addition to choice of words, participants comments were elicited. Results. In this condition there was a very high level of agreement in terms of choice of words; participants tended to use the legal commands. For task one, all participants used the word 'hello'. For task two, 1 participant used 'check balance' and the other 9 used 'balance'. For task three, participants used the 'recent transactions' command. For task four, participants used 'cash' followed by amount. For task five, participants used 'deposit' followed by amount. For task six, participants used 'pay bills' followed by payee followed by amount. For task seven, participants used the 'chequebook' or 'statement' commands. For task 8, 4 participants used 'end' and 4 used 'finish'. All commands obeyed the command length criterion. An interesting point to note concerns the proportion of ASR device
56
Baber, Johnson and Cleaver
Table 9. PropOrtionof errors made by users and ASR with visual only and combinedvisual and auditory feedback,using the speech-menuscreen.
User errors ASR errors Visual feedbackonly Combined feedback
0.13 0.016
0.14 0.12
and user errors observed between study two and study three as shown in Table 9. The results in Table 9 suggest that the performance of the ASR device was relatively unaffected by the mixed feedback condition, but that user errors were dramatically reduced. When user errors occurred in the mixed feedback condition they were not attributable to use of out-task vocabulary but to inappropriate response to prompts, as illustrated by the following examples, Example 1: Machine: Please speak the amount User: Cash Example 2: Machine: Put cheque in slot User: OK In Example 1, the speaker uses a command word when an amount is required. In Example 2, the speaker used "OK" to signify accord with an instruction (the consequence of this 'error' was that the deposit function was superseded by the 'Which service do you require?' prompt. The user said 'deposit' and proceeded with the transaction). All participants believed that auditory feedback would significantly enhance usability. For example, one participant noted that using auditory feedback was more 'natural' than talking to a machine. The choice of words, in the auditory feedback condition, tended to mirror those of the visual feedback condition, with one important exception: there were fewer 'extra' commands in the auditory condition and people tended to employ the popular words. Conclusions. The use of mixed auditory and visual
feedback improved user performance, such that outtask vocabulary was removed, but had no impact on recognition accuracy. An interesting feature of the resuits is that users did not necessarily mimic either the visual display or the auditory feedback in their choice of words, e.g., to obtain a listing of recent transactions, rather than use the 'mini-statement' command on the speech-menu participants used 'recent transac-
tions', and to obtain money, participants used 'cash' rather than 'money' on the speech-menu. This adds weight to the assertion that people do not simply copy feedback presented to them but construct commands which appear appropriate to satisfy their goals and the task demands.
Discussion Auditory feedback appears to be an essential component of a speech-based ATM. Indeed, participants felt that the lack of auditory feedback made the task more difficult. The proportion of commands which followed the command length criterion was around three times higher for command sets in study two than for commands used in study one. While the use of a restricted vocabulary could be seen to play a role in this effect, it is proposed that the results were not solely due to this. The fact that people were using a speech recogniser was felt to have a bearing on how they spoke (which, perhaps, calls into question the validity of the initial 'Wizard of Oz' study). The effects of screen design seemed to have a bearing on command sets following the command length criterion, with screen (ii) producing the lowest proportion of speech which complied with the rule. However, screen layout alone was not sufficient to explain the choice of words. It would seem that choice of words for performing even quite simple tasks with speech recognition is influenced by a number of factors. The provided vocabulary plays a role but does not completely determine the choice of words. This is rather worrying in that it implies that people will attempt to use 'out-task' words, or will use legal words inappropriately. The use of a speech recognition device leads to a reduction in the average length of utterances, i.e., people tend to produce shorter expressions. There seems to be a relationship between type of task and the variation found in choice of words, with some tasks producing little significant difference across screens and others producing significant levels. The studies indicate that, for the most part, people can restrict their choice of words to perform the task with speech recognition, that a conventional ATM design screen (ii) produced inferior performance to the other screens and that mixed modality feedback produced the best performance. For the purposes of machine design, this suggests that it would not be appropriate to 'bolt' speech recognition onto existing public technology, but that new designs were be required.
Factors A f f e c t i n g U s e r s ' C h o i c e
T h e m a i n c o n c l u s i o n f r o m this paper is that the c h o i c e o f words for p e r f o r m a n c e o f speech-based transactions with m a c h i n e s is influenced by a set o f factors. T h e s e factors include the vocabulary set provided, the nature o f the speech recogniser, the design o f the feedback, the nature o f the tasks being p e r f o r m e d and the relationship b e t w e e n user goals and task demands. S o m e o f these factors can be successfully handled by s y s t e m design. H o w e v e r , 'out-task' words will p r o b a b l y be used despite the best efforts of the designers. This m e a n s that it is essential to consider h o w such words can be gracefully handled. Word-spotting
could possibly c o p e with around 80% to 9 0 % o f the transactions in the study, but this still leaves a minority of transactions for w h i c h there does not appear to be a simple solution.
Appendix 1: Commands Used in Wizard of O z S t u d y N O T E : in the f o l l o w i n g tables ' - - ' signifies no response, ' ( . ) ' signifies a r e s p o n s e e m b e d d e d in the p r e c e e d i n g utterance.
Task 1. Check contents of account.
Speaker
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
A
Balance
Balance
Balance
Balance
Balance
B
Balance, please
Balance, please
Balance, please
Balance of account, please
Balance, please
C
Hello, I'd like to know my balance, please
I'd like to see the balance of my account, please
Hello, can I have the balance of my account, please
Hello, can I have the balance of my account, please
Hello, can I have the balance of my account, please
D
Could I have the balance, please
Could I have the balance, please
Could I have the balance, please
Can I have the balance, please
Could I check the balance, please
E
Can I have a balance, please
Can I have a balance, please
Can I have a balance, please
Can I have a balance, please
Can I have a balance, please
F
I'd like a balance, please
Could I have a balance, please
A balance, please
Could I have my balance, please
The balance
Task 2. Take five pounds from the machine.
Speaker
Trial 1
Trial 2
57
Trial 3
Trial 4
Trial 5
A
Five pounds
Withdraw cash [4] five pounds
Cash withdrawal [4] five pounds
Withdraw cash [4] five pounds
Withdrawal [4] five pounds
B
Five pounds cash, please
And five pounds cash
Five pounds, cash
And can I have five pounds cash with a receipt
Five pounds cash, please
C
OK, could I withdraw five pounds
And can I take take out five pounds
I'd like to withdraw five pounds, please
OK, can I withdraw some money, please [4] five pounds, please
And can I have five pounds
D
Could I take out five pounds, please
Could I take out five pounds, please
Could I have five pounds, please
And could I take out cash, please [4] five pounds, please
Could I have five pounds, please
E
Can I have cash, please [4] five pounds
Can I have cash please with a receipt [4] five pounds
Can I have cash please with a receipt [4] five pounds
Can I have cash please with a receipt [4] five pounds
Can I have cash please with a receipt [4] five pounds
F
Can I take five pounds out please
Right, can I take five pounds out of that please with a receipt
Can I take five pounds please
Five pounds please
Can I have five pounds please
58
Baber,Johnson and Cleaver
Task 3. Obtain proof of transaction. Speaker
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
A
Receipt
Receipt
Receipt
B
And can I have a receipt with that
And a receipt also please
And a receipt
Receipt
Receipt
C
OK, could I have a receipt with that too please
With a receipt please
Can I have a receipt for that please
Can I have a receipt as well
Can I have a receipt too
D
And could I have a receipt please
And could I have a receipt please
And can I have a receipt with that please
And can I have a receipt
And a receipt
E
Thank you, can I have a receipt
(*)
(*)
(*)
F
A receipt
(*)
(.)
And a receipt as well
And a receipt
(*)
And a receipt as well
And a receipt as well
Task 4. Terminating the transaction. Speaker
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
A
--
--
Thank you
--
B
Thank you
Thank you
Thank you
Thank you
Thank you
C
Thank you
Thank you
Thank you
Thank you
OK, thank you
.
.
D
.
E
--
Thank you [1] OK
Thank you
Thank you
--
F
--
Thank you
--
--
Thank you
Notes 1. Kamm (1994) notes that given the prompt "Will you accept the charges (for a collect call)?" only 51% of users responded with the desired word ('yes'), but when the prompt was rephrased as "Say Yes if you will accept the call; otherwise say No", 81% of users said the desired word. While the increase is impressive, suggesting desired words before each question could introduce numerous problems. 2. Dillon et al. (1993) used an initial 'Wizard of Oz' study to define appropriate vocabulary sets. A similar approach is used in study one of this paper. 3. In observations conducted in retail banks as part of another project, it was noted that the majority of over-the-counter transactions involve little verbal communication, with the action of passing payment books, etc., to the cashier constituting the transaction 'request'.
References Baber, C. (1991). Speech Technology in Control Room Systems: A Human Factors Perspective. Chichester: Ellis Horwood. Baber, C. (1993). Developing interactive speech technology. In C. Baber and J.M. Noyes (Eds.), Interactive Speech Technology. London: Taylor and Francis, pp. 1-18.
.
End
.
Baber, C. and Stammers, R.B. (1989). Is it natural to talk to computers: An experiment using the Wizard of Oz technique. In E.D. Megaw (Ed.), Contemporary Ergonomics. London: Taylor and Francis, pp. 234-239. Baber, C., Stammers, R.B., and Usher, D.M. (1990). Error correction requirements in ASR. In E.J. Lovesey (Ed.), Contemporary Ergonomics 1990. London: Taylor and Francis, pp. 454-459. Brems, D.J., Rabin, M.D., and Waggett, J.L. (1995). Using natural language conventions in the user interface design of automatic speech recognition systems. Human Factors, 37(2):265-282. Bunt, H.C., Leopold, F.E, Muller, H.E, and van Katwijk, A.EV. (1978). In search of pragmatic principles in man-machine dialogues. IPO Annual Progress Report13. Eindhoven: Institute voor Perceptie Onderzoek, pp. 94-98. Burford, B.C. and Baber, C. (1994). A user-centred evaluation of a simulated adaptive autoteller. In S.A. Robertson (Ed.), Contemporary Ergonomics 1994. London: Taylor and Francis, pp. 46-51. Button, G. (1990). Going up a blind alley: Conflating conversation analysis and computational modelling. In P. Luff, N. Gilbert, and D. Frohlich (Eds.), Computers and Conversation. London: Academic Press, pp. 67-90. Button, G. and Sharrock, W. (1995). On simulacrums of conversation: Toward a clarification of the relevance of conversation analysis for human-computer interaction. In P.J. Thomas (Ed.), The Social and Interactional Dimensions of HumanComputer Interaction. Cambridge: Cambridge University Press, pp. 107-125.
Factors Affecting Users' Choice
Cookson, S. (1988). Final evaluation of VODIS: Voice operated database inquiry system. Proceedings Speech'88 -7th. FASE Symposium. Edinburgh: Institute of Acoustics, pp. 1311-1320. Dillon, T.W., Norcio, A.E, and DeHaemer, M.J. (1993). Spoken language interaction: Effects of vocabulary size and experience on user efficiency and acceptability. In G. Salvendy and MJ. Smith (Eds.), Human-Computer Interaction: Software and Hardware Interfaces. Amsterdam: Elsevier, pp. 140-145. Fraser, N.M. and Gilbert, G.N. (1991). Simulating speech systems. Computer Speech and Language, 5:81-99. Giles, H. and Powisland, P.E (1975). Speech Styles and Social Evaluation. London: Academic Press. Grice, H.P. (1975). Logic and communication. In P. Cole and J.L. Morgan (Eds.), Syntax and Semantics III: Speech Acts. New York: Academic Press, pp. 41-58. Hauptmann, A.G. and Rudnicky, A.I. (1988). Talking to computers: An empirical investigation, lnternationalJournalofMan-Machine Studies, 28:583-604. Hollingum, J. and Cassford, G. (1988). Speech Technology at Work. Berlin: Springer-Verlag. Kamm, C. (1994). User interfaces for voice applications. In D.B. Roe and J.G. Wilpon (Eds.), Voice Communication between Humans and Machines. Washington, DC: National Academy Press, pp. 422-442. Leiser, R.G. (1989). Exploiting convergence to improve natural language understanding. Interacting with Computers, 1:284-298. Noyes, J.M., Baber, C., and Frankish, C.R. (1992). Industrial applications of ASR. Journal of the American Voice I/0 Society, 12:51-68. Peckham, J. (1989). VODIS-a voice operated database enquiry system. In J. Peckham (Ed.), Recent Developments and Applications of Natural Language Processing. London: Kogan Page, pp. 117128.
59
Richards, M.A. and Underwood, K.M. (1984). Talking to machines: How are people naturally inclined to speak? In E.D. Megaw (Ed.), Contemporary Ergonomics 1984. London: Taylor and Francis, pp. 63-67. Ringle, M.D. and Halstead-Nussloch, R. (1989). Shaping user input: A strategy for natural language dialogue design. Interacting with Computers, •:227-244. Rogers, Y. (1989). Evaluating the meaningfulness of icon sets to represent command operations. In M.D. Harrison and A.E Monk (Eds.), People and Computers: Designing ~br usability. Cambridge: Cambridge University Press, pp. 586-603. Searle, J.R. (1969). Speech Acts. Cambridge: Cambridge University Press. Sperber, D. and Wilson, D. (1985). Relevance: Communication and Cooperation. Oxford: Basil Blackwell. Waterworth, J.A. (1982). Man-machine speech dialogue acts. Applied Ergonomics, •3:203-207. Wooffitt, R. and MacDermid, C. (1995). Wizards and social control. In P.J. Thomas (Ed.), The Social and Interactional Dimensions of Human-Computer Interaction. Cambridge: Cambridge University Press, pp. 126-141. Zoltan, E., Weeks, G.D., and Ford, W.R. (1982). Natural-language communication with computers: A comparison of voice and keyboard inputs. In G. Johannsen and J.E. Rijnsdorp (Eds.), Analysis, Design and Evaluation of Man-Machine Systems. Oxford: Pergamon, pp. 255-260. Zoltan-Ford, E. (1984). Reducing the variability in natural language interactions with computers. Proceedings of the 28th. Annual Meeting of the Human Factors Society. Santa Monica, CA: Human Factors Society, pp. 768-772. Zoltan-Ford, E. (1991). How to get people to say and type what computers can understand. International Journal of Man-Machine Studies, 34:527-547.