Algorithmica (2001) 30: 503–526 DOI: 10.1007/s00453-001-0026-6
Algorithmica ©
2001 Springer-Verlag New York Inc.
An Approximate Determinization Algorithm for Weighted Finite-State Automata1 A. L. Buchsbaum,2 R. Giancarlo,3 and J. R. Westbrook2 Abstract. Nondeterministic weighted finite-state automata are a key abstraction in automatic speech recognition systems. The efficiency of automatic speech recognition depends directly on the sizes of these automata and the degree of nondeterminism present, so recent research has studied ways to determinize and minimize them, using analogues of classical automata determinization and minimization. Although, as we describe here, determinization can in the worst case cause poly-exponential blowup in the number of states of a weighted finite-state automaton, in practice it is remarkably successful. In extensive experiments in automatic speech recognition systems, deterministic weighted finite-state automata tend to be smaller than the corresponding nondeterministic inputs. Our observations show that these size reductions depend critically on the interplay between weights and topology in nondeterministic weighted finite-state automata. We exploit these observations to design a new approximate determinization algorithm, which produces a deterministic weighted finite-state automaton that preserves the strings of a weighted language but not necessarily their weights. We apply our algorithm to two different types of weighted finite-state automata that occur in automatic speech recognition systems and in each case provide extensive experimental results showing that, compared with current techniques, we achieve significant size reductions without affecting performance. In particular, for a standard test bed, we can reduce automatic speech recognition memory requirements by 25–35% with negligible effects on recognition time and accuracy. Key Words. Algorithms, Determinization, Weighted automata.
1. Introduction. Automatic speech recognition (ASR) is typically implemented as a cascade of transformations, depicted in Figure 1. Input speech is digitized by various forms of signal processing, resulting in a sequence of frames: each frame is a feature vector of real numbers representing some window (e.g., 10 milliseconds) of the acoustic waveform. A set of acoustic models is applied to transform the speech frames into a lattice of phones. Each phone is a basic unit of sound, e.g., the /d/ in “dog” or the /b/ in “bog.” Informally, a lattice is a nondeterministic finite-state automaton, with probabilities augmenting the labels on the transitions. For example, a phone lattice, which has transitions labeled by phones, captures the fact that more than one sequence of phones can correspond to the observed speech. A lexicon or 1 This paper is abstracted by “Shrinking Language Models by Robust Approximation” in Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-98), Seattle, WA, vol. 2, pp. 685–688, 1998. The second author’s work was supported by AT&T Labs. 2 AT&T Labs, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932, USA. {alb,jeffw}@research. att.com. 3 Dipartimento di Matematica ed Applicazioni, Universit´ a di Palermo, Via Archirafi 34, 90123 Palermo, Italy.
[email protected].
Received March 31, 1998; revised January 29, 1999. Communicated by G. F. Italiano. Online publication April 30, 2001.
504
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
Fig. 1. Block diagram of a speech recognizer. Input speech is digitized into a sequence of feature vectors. An acoustic-phonetic recognizer transforms the feature vectors into phones. A word recognition module transforms the phones into words, with the help of a lexicon. Finally, a grammar is applied to pick the most likely sequence of words.
dictionary, which models how various words in the language are pronounced, is then used to transform the phone lattice into a word lattice. Finally, a language model or grammar, which typically encodes the probability that one word follows a small group of preceding words, is applied to transform the word lattice to reflect those sentences most likely to have been uttered. An important goal of ASR model engineering is to produce small models that guarantee fast and accurate ASR. In practice, smaller models have usually produced less accurate recognition. There has been recent progress, however, on automatic methods for reducing the sizes of language models while preserving overall recognition performance (accuracy and speed). Lacouture and De Mori [12] and Kenny et al. [11] discuss the size reduction of lexical trees in order to speed ASR systems, but they deal only with unweighted trees. Pereira et al. [16], [17] take an approach that models human language using weighted finite-state automata (WFAs). Informally, a WFA accepts a weighted language, i.e., a set of strings, each of which has a weight assigned [9]. An advantage of this approach is that classical techniques for automata determinization and minimization can be extended to produce WFAs that are smaller yet formally equivalent to their inputs; i.e., each input string is assigned the same weight in both the input and the reduced models. Mohri [15] develops these ideas and reports significant size reductions from their application to ASR word lattices. Given that determinization of classical NFAs can increase the state space exponentially, Mohri’s results are surprising and raise intriguing questions. Why does determinization yield smaller WFAs in ASR, and can understanding this phenomenon lead to algorithms that produce even smaller WFAs that still perform well in ASR systems? In a companion paper [6], we provide the first rigorous asymptotic analysis of the potential state expansion caused by Mohri’s approach. We [6] also give additional analyses and observations about the behavior of Mohri’s algorithm. Our results show that expansion of WFAs caused by determinization is governed by both the transition weights
An Approximate Determinization Algorithm for Weighted Finite-State Automata
505
and the topology of the graph underlying a given WFA. In this paper we exploit these observations to extend Mohri’s approach by relaxing the requirement that the output model be formally equivalent to the input. We introduce a technique that we call approximate determinization, which produces a deterministic WFA that accepts precisely the same strings in a given language but with possibly different weights. By allowing weights to change arbitrarily, we could, for example, reduce the problem to that of determinizing classical nondeterministic finite-state automata (NFAs) (by setting all weights to be zero). Since weights are crucial to the ASR application, however, such arbitrary changes typically yield poor recognition performance. Our algorithm therefore accepts an approximation parameter, which controls the amount by which weights may change. Applied to WFAs in ASR, the parameter controls the tradeoff between output size and recognition performance. In comparison, previous papers [11], [12] ignore the weights on the WFAs, develop various techniques to reduce the corresponding NFAs, and then re-introduce weights via ad hoc heuristics. Our technique is completely general and is applicable to any ASR model. Any WFA determinization algorithm must incur exponential size expansion in the worst case, since there are NFAs for which all deterministic equivalents must be exponentially larger. Given this and that we are concerned primarily with the use of WFAs in ASR, we studied the performance of our algorithm experimentally. We applied our algorithm to two different types of ASR lattices. In the first experiment we applied our algorithm to ASR language models and installed the resulting models in an ASR system, which we then applied to process a standard test bed of speech inputs. We achieved 25– 35% size reduction compared with Mohri’s exact algorithm, yielding similar reductions in ASR memory requirements without affecting ASR time or accuracy. These results were invariant under different parameters used to tune ASR search algorithms. In the second experiment we measured the size reduction of word lattices output by a recognizer applied to a second standard test bed of speech inputs. We found that our algorithm reduced WFA sizes by up to 45% compared with Mohri’s exact algorithm, without affecting which sequences of words remained the “best”—most highly weighted—in the WFAs, a measure we motivate in Section 5. We begin in Section 2 by defining WFAs, describing Mohri’s [15] WFA determinization algorithm, and reviewing our observations [6] on the importance of weights to WFA determinization. Section 3 then describes our new approximate determinization algorithm. Sections 4 and 5 report experimental results. We first measure language-model size reduction and the ensuing effects on ASR time, dynamic memory consumption, and word accuracy. We then consider size reduction and preservation of best strings in word lattices. We conclude in Section 6. While the technical discussion is self-contained, some knowledge of ASR is helpful to motivate consideration of the problem of determinizing WFAs. We therefore provide a suitable tutorial in an appendix. We also provide information on how to obtain software and data to run ASR experiments.
2. Weighted Finite Automata and Determinization. A WFA is a tuple A = (Q, q, ¯ , δ, Q f ) such that Q is the set of states, q¯ ∈ Q is the initial state, is the set of labels, δ ⊆ Q × × (+ ∪ {0, ∞}) × Q is the set of transitions, and Q f ⊆ Q is the set of final states. A is also called a lattice in the ASR literature. The topology of A is the
506
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
projection π Q××Q (δ): i.e., the transitions of A without respect to the weights. Denote by #A = |Q|, the number of states of A, and let |δ| be the number of transitions of A. We define the size of A, also denoted |A|, to be |Q| + |δ|. A deterministic, or sequential, WFA has at most one transition t = (q1 , σ, ν, q2 ) for any pair (q1 , σ ); a nondeterministic WFA can have multiple transitions on a pair (q1 , σ ), differing in target state q2 . Let t = (t1 , . . . , t ) be some transition sequence, such that ti = (qi−1 , σi , νi , qi ); w
t induces string w = σ1 · · · σ . We also denote by q0 → q that there is a transition sequence from q0 to q that induces w; if the context is unambiguous, we drop the w and write q0 → q . String w is accepted by t if q0 = q¯ and q ∈ Q f ; w is accepted by A if some t accepts w. Let c(ti ) = νi be the weight or cost of ti . Then the weight c(ti ). Let T (w) be the set of all transition sequences that or cost of t is c( t ) = i=1 accept string w. Then the weight or cost of w is c(w) = min{c( t ) : t ∈ T (w)}. The language of A is the set of weighted strings accepted by A: L(A) = {(w, c(w)) : w is accepted by A}. Not all nondeterministic WFAs admit deterministic equivalents, a fact we discuss in Section 2.2. A determinization algorithm, A, takes a nondeterministic WFA, A, as input and produces a deterministic WFA, A , such that L(A ) = L(A), if such an A exists; if A does not admit a deterministic equivalent, then A’s behavior on A is undefined. Let π∗ (L) = {w : ∃c(w), (w, c(w)) ∈ L}, i.e., the set of strings contained in the weighted ˜ as one that takes a language L. We define an approximate determinization algorithm, A, nondeterministic WFA, A, as input and, if A admits a deterministic equivalent, produces a deterministic WFA, A , such that π∗ (L(A )) = π∗ (L(A)); if A does not admit a ˜ behavior on A is undefined. deterministic equivalent, then A’s A motivating example from ASR is the use of WFAs to model human language. A tri-gram grammar, for example, is a stochastic model, over an alphabet of words from a natural language, that assigns a probability Pr(x3 | x1 , x2 ) to each triple (x1 , x2 , x3 ) of words: Pr(x3 | x1 , x2 ) is the probability that x3 follows x1 x2 in a sentence. Such probabilities can be derived from analyzing a corpus of sentences from a particular ASR task [10]. A WFA A can realize a tri-gram grammar as follows. There is a state for each pair (x1 , x2 ) of words. For each such state and each word x3 , there is transition ((x1 , x2 ), x3 , − log Pr(x3 | x1 , x2 ), (x2 , x3 )): i.e., a transition labeled x3 from state (x1 , x2 ) to state (x2 , x3 ), the cost of which is the negative log of the corresponding tri-gram probability. Let # be a sentinel value, denoting a “null” word. Then (#, #) is the start state, transitions out of which are weighted by the probabilities that the corresponding words occur first in a sentence. Each state except (#, #) is a final state. A transition sequence, t , in A therefore induces a string (sentence) X of words, and the cost of t is the negative log of the joint probability that all the tri-grams induced by X occur. This probability is used to approximate the probability that X itself occurs as a sentence. This type of WFA is also called a word lattice. Similarly, the output of an ASR system itself can be given as a word lattice, paths through which induce sentences, the probabilities of which being the probabilities that the corresponding sentences were uttered by the speaker. Appendix A discusses the application of such WFAs in a real ASR system and describes why reducing the size of and determinizing them is important. We next explain Mohri’s [15] algorithm for determinizing WFAs.
An Approximate Determinization Algorithm for Weighted Finite-State Automata
507
2.1. Determinization of Weighted Finite Automata. Mohri’s [15] automatic method for determinizing and reducing the size of a WFA is based on classical automata theory. First the input WFA, A, which is normally nondeterministic, is replaced by an equivalent deterministic WFA, A . From A , an equivalent deterministic WFA, A , of minimum size is computed. When applied to WFAs from ASR, most of the size reduction comes during the process of determinization. We first describe Mohri’s algorithm for WFA determinization, which we call D. We then present observations about its behav˜ which we describe in ior that motivate our approximate determinization algorithm, D, Section 3. Since D is a determinization algorithm, it takes a WFA, A, and produces a deterministic WFA, A , such that L(A ) = L(A) (if such an A exists). This implies that, for any string w accepted by A, the weight of the unique path accepting w in A equals the weight of the minimum weight path accepting w in A. D therefore needs to maintain information about path lengths while building A . We illustrate D with an example and then describe it in general. Consider the WFA, A, in Figure 2(a). Mohri generalizes the classical power-set construction to determinize A into A as follows. From the initial state q0 , we can reach states q1 and q2 using the input symbol a. Analogously to the determinization of finite-state automata, we establish a new state {q1 , q2 } in A , reachable from q0 with input symbol a. The transitions to q1 and q2 , however, have different weights in A. Mohri solves this problem by choosing the smaller weight to be the weight of the transition to {q1 , q2 } and recording the difference between the two weights in the new state. In the example, the weight of the q0 → q1 transition is 3, and that of the q0 → q2 transition is 1. Therefore, the new transition to {q1 , q2 } gets weight 1, and the difference of 2 = 3 − 1 is assigned as a remainder to component q1 . For completeness, a remainder of 0 = 1 − 1 is assigned to q2 . The new state is thus encoded as {(q1 , 2), (q2 , 0)} in A . Similarly, from state q0 in A, we can reach states q1 and q2 via symbol b. Again the minimum weight among these transitions is 1, so we assign this weight to the new arc and encode the remainder weights (0 and 3, respectively) in the new state {(q1 , 0), (q2 , 3)} in A .
Fig. 2. (a) A nondeterministic WFA, A. Arcs labeled σ/w correspond to transitions labeled σ with weight w. (b) A : the result of applying D to the WFA of (a). This is derived from Figures 11 and 12 of Mohri [15].
508
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
In general, the deterministic WFA, A , has states of the form qˆ = {(qi1 , ri1 ), . . . , (qi , ri )}. We refer to {qi1 , . . . , qin } as the state tuple of qˆ and {ri1 , . . . , rin } as the remainder tuple of q. ˆ Each such qˆ is interpreted as follows. Consider any string w ∈ ∗ such that there is a (single) path inducing w from the start state, q0 , to q. ˆ As in classical automata determinization, there is at least one path inducing w from q0 to each qi j in the nondeterministic input, A. Let c j be the weight of the minimum weight path inducing w from q0 to qi j in A. Let c be the weight of the path from q0 to qˆ in A . The remainders are constructed so that ri j = c j − c. In this way, all necessary path length information is encoded into the transition weights and remainders in A . Returning to the example, consider state {(q1 , 2), (q2 , 0)} in A and the input symbol b. In A, we can reach state q3 from both q1 and q2 . Recalling the above discussion of remainders, we consider the sum of the weight of the transition in A (3 for the q1 → q3 transition and 1 for the q2 → q3 transition) plus the remainder associated with the original source state encoded in state {(q1 , 2), (q2 , 0)} in A . That is, we consider the sums 3 + 2 = 5 and 1 + 0 = 1. We take the minimum among those values, i.e., 1, as the weight of the transition from {(q1 , 2), (q2 , 0)} to {(q3 , r )} (r to be determined) in A . Since there is only one destination state (q3 ) in the original machine, the remainder r is 0, so we encode the new destination state as {(q3 , 0)}. Similarly, we construct an arc with weight 3 on symbol b from {(q1 , 0), (q2 , 3)} to {(q3 , 0)}. (3 + 0 = 3, 1 + 3 = 4, and we take the minimum, which is 3.) The end result is shown in Figure 2(b). Generalizing to an arbitrary WFA A = (Q, q, ¯ , δ, Q f ), D builds an equivalent but deterministic A as follows. The start state of A is {(q, ¯ 0)}, which forms an initial set + P. While P = ∅, we remove any state q = {(q1 , r1 ), . . . , (qn , rn )} ∈ 2 Q×( ∪{0,∞}) from P. The remainders encode path length information, as described above. For each σ ∈ , let {q1 , . . . , qm } be the set of states reachable by σ -transitions out of all the qi . For each transition on σ from a qi to q j in A, we want to consider the weight of that transition plus the remainder ri . Among all such σ -transitions entering q j , we want the minimum such sum. For 1 ≤ j ≤ m, therefore, let ρ j = min{ri + ν : (qi , σ, ν, q j ) ∈ δ, 1 ≤ i ≤ n} be the minimum of the weights of σ -transitions into q j from the qi plus the respective ri . Let ρ = min{ρ j : 1 ≤ j ≤ m}; ρ is thus the weight of the new transition from q to q = {(q1 , s1 ), . . . , (qm , sm )} in A , where s j = ρ j − ρ, 1 ≤ j ≤ m, are the new remainders. If q is new, then it is added to P. Transition (q, σ, ρ, q ) is added to A . This is the only σ -transition out of state q, so A is deterministic. Note that a given state tuple may occur multiple times, with different remainder tuples. Let T A (w) be the set of transition sequences in A that accept a string w ∈ ∗ ; let
t A (w) be the (one) transition sequence in A that accepts the same string. Mohri [15] shows that c( t A (w)) = min{c( t ) : t ∈ T A (w)}, and thus L(A ) = L(A). Moreover, let T A (w, q) be the set of transition sequences in A from state q¯ to state q that induce string w. Again, let t A (w) be the (one) transition sequence in A that induces the same string;
t A (w) ends at some state {(q1 , r1 ), . . . , (qn , rn )} in A such that some qi = q. Mohri [15] shows that c( t A (w)) + ri = min{c( t ) : t ∈ T A (w, q)}. Thus each remainder ri encodes the difference between the weight of the shortest path to some state that induces w in A and the weight of the path inducing w in A , as described above. At least one remainder in each state is thus zero.
An Approximate Determinization Algorithm for Weighted Finite-State Automata
509
Fig. 3. A nondeterministic WFA, A. Arcs labeled σ/w correspond to transitions labeled σ with weight w. A accepts the language L A = {(abn c, 0), (abn d, n) : n ≥ 0}.
2.2. Termination and Expansion. Unfortunately, not all WFAs admit deterministic equivalents, and, furthermore, D halts on only a proper subset of those that do. Analogously, Choffrut [7], [8] and Weber and Klemm [21] show that not all string-to-string transducers are determinizable. (Berstel [2, Section IV.6] recasts the results of Choffrut [7] in English.) We demonstrate a WFA, A, shown in Figure 3, that admits no deterministic equivalent. A accepts the language L A = {(abn c, 0), (abn d, n) : n ≥ 0}. Assume there exists a deterministic WFA, A , accepting L A . By arguments analogous to those in the proof of the pumping lemma for classical regular languages, A must contain a cycle at some state that generates bk for some k > 0. Some state on the cycle must have an outgoing transition on c, and some state on the cycle must have an outgoing transition on d. (They may be the same state.) If no such cycle exists, then A cannot accept all the strings contained in L A . Let the weight assigned to the bk generated by the cycle be w. If w > 0, then A accepts pair (ab j c, w ) for some j > 0 and w > 0, which does not occur in L A . If w = 0, however, then A accepts pair (ab j d, w ) for some j = w , which also does not occur in L A . So no such A can in fact exist; i.e., A admits no deterministic equivalent. Now observe the behavior of D when given A in Figure 3 as input. Leaving the details to the reader, D produces an infinite sequence of states of the form {(1, 0), (2, i)}: for i ≥ 1, there is a 0-weighted transition on b from {(1, 0), (2, i)} to {(1, 0), (2, i + 1)}. Therefore, D does not halt. The problem is that A has two cycles that are labeled with the same string but have different costs. Mohri [15] formalizes this observation on cycles and provides conditions under which D halts. These conditions tend to encompass the cases presented in ASR. Fix two states q and q and a string v ∈ ∗ . Let c(q, v, q ) be the minimum of c( t ), taken over all transition sequences t from q to q inducing v. We refer to c(q, v, q ) as the optimal cost of inducing v from q to q . DEFINITION 2.1. Two states, q and q , of a WFA G are twins if ∀u, v ∈ ∗ such that u u v v q¯ → q, q¯ → q , q → q, and q → q , the following holds: c(q, v, q) = c(q , v, q ). G has the twins property if all pairs q, q ∈ Q are twins.
510
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
That is, if states q and q are reachable from q¯ by a common string, then q and q are twins only if any string that induces a cycle at each, induces cycles of equal optimal cost. Note that two states having no cycle on a common string are twins. A WFA is trim if every state appears in an accepting path for some string and no transition is weighted ∞. A WFA is unambiguous if there is exactly one accepting path for each accepted string. Mohri [15, Theorems 11 and 12] proves that any WFA G is determinizable if it has the twins property; furthermore, if G is trim and unambiguous, the twins property becomes a necessary and sufficient condition. These conditions tend to encompass the cases presented in ASR, and D will terminate on any such WFA. The twins property for WFAs is analogous to that defined by Choffrut [7], [8] and (in different terms) by Weber and Klemm [21] to identify necessary and sufficient conditions for a string-to-string transducer to admit a sequential transducer realizing the same rational transduction. We separately [6] give the first polynomial time algorithm for testing the twins property for trim and unambiguous WFAs. We also [6] give the first nontrivial bounds for the potential size expansion incurred by D when it halts. Let D(G) be the result of applying D to G, assuming D halts on G. Let G be the classical, unweighted NFA corresponding to G. #D(G) must be at least the number of states obtained when G is determinized. Because of weights, though, state tuples can occur multiple times, with different remainder tuples, and so the classical 2 determinization upper bound does not apply. We show that #D(G) < 2n(2 log n+n log||+1) , for n = #G. Furthermore, if the transition weights are rationals, let C be the maximum such weight, and define ρ = log2 C to be the number of bits needed to encode each 2 weight. In this case, #D(G) < 2n(2 log n+1+min{n log||,ρ}) . Finally, if the graph underlying G is acyclic, then #D(G) < 2n log|| , independent of any assumptions on weights. When the graph underlying G is cyclic, it is an open problem to establish how tight the above bounds are on #D(G). We offer the following intuition about the role of cycles in establishing those bounds. Notice that #D(G) is upper bounded by the number of ways to label subsets of Q with values from the set R of remainders generated by the algorithm. As shown by the WFA, A, in Figure 3, R can be infinite. When A satisfies the twins property, however, |R| is bounded by a poly-exponential function of n [6]. The proof is based on the following observations: (a) a remainder value generated by D is the difference between the costs of two paths labeled with the same string; and (b) two cycles labeled with the same string do not affect the value of a remainder because, by the twins property, they have the same cost and therefore their contribution to the difference in cost of two paths containing them is canceled. Therefore, the twins property guarantees that cycles labeled with the same string are well behaved. When the graph underlying G is acyclic, the bound on #D(G) is tight. Indeed, we give a matching constructive lower bound [6]. Salomaa and Yu [20] have shown that there exist acyclic, n-state NFAs such that the corresponding deterministic automata require (2n log||/(log||+1) ) states. (They prove a matching upper bound as well.) Their result also provides a lower bound for #D(G) in the acyclic case. Given the potential state-space expansion that D can incur, we separately [6] explore why D causes WFAs that occur in ASR to shrink rather than expand. We derived a class of acyclic WFAs whose expansion, when determinized, depends solely on the choice of weights on their transitions. In particular, with uniform weights (equivalent to a classical, unweighted NFA), no expansion occurs, but with appropriate weights, exponential
An Approximate Determinization Algorithm for Weighted Finite-State Automata
511
expansion occurs. Thus, transition weights in addition to topology (the effect of cycles described above) together govern the expansion possible by WFA determinization.
3. Approximate Determinization of Weighted Finite Automata. Given the importance of weights in WFA determinization, we studied the effects of D on a few word lattices generated by the AT&T speech recognizer with a grammar for the Air Travel Information System (ATIS) task, a standard 5000–word vocabulary ARPA test bed [4], in which sentences are restricted to the domain of travel agent interaction: e.g., “Show me flights to Boston,” and, “What is the fare for a first class ticket?” (We give a full definition of the ATIS domain in Section 4.) We found that when state tuples occurred multiple times with different remainder tuples, the remainders tended to be “clustered” in the following sense. Most corresponding remainders were equal; only one or two differed between the tuples, and the differences were small relative to magnitude. Consider an input WFA A and two states in the corresponding deterministic WFA, A , whose remainders are so clustered. Unifying the states would effectively perturb the remainders by small constant factors. If the constants are orders of magnitude smaller than the remainder values themselves, we would not expect this to have an appreciable effect on the overall path weights. In particular, if we consider the language accepted by A to be ordered by weight, then we would expect that order to be mostly preserved by such small perturbations. ˜ which generThis suggests the following approximate determinization algorithm, D, alizes Mohri’s algorithm. In addition to a WFA A, D˜ takes an approximation parameter, ε, as input. When constructing a state q = {(q1 , r1 ), . . . , (q , r )} in the determinized WFA A , if there exists some previously constructed state q = {(q1 , r1 ), . . . , (q , r )} such that, for 1 ≤ i ≤ , |ri − ri | ≤ ε · min{ri , ri }, we use q in place of q . We call this relative remainder clustering; the effect is to unify states whose corresponding remainders are relatively close, as described above. A is still deterministic and accepts exactly the same strings as A but may assign different weights to those strings. Relative remainder clustering ensures that as remainder values get smaller, the allowable differences between corresponding remainders also decreases. This exploits the intuition that states with small remainders are more likely than those with large remainders to be on the final shortest path for a given string. By allowing a larger degree of unification among states with larger remainders, we do not expect to affect the shortest paths through the WFA. ˜ The while loop (steps 5–27) removes states from Figure 4 gives pseudocode for D. P and for each such q ∈ P, in turn, constructs all the outgoing transitions, via the for loop (steps 23 and 24). In the process, new states may be constructed and added to P. Step 9 determines which states in A are reachable via symbol σ from the states encoded in q. Steps 17–25 were devised by Mohri [15] to determine the appropriate remainders corresponding to each state. Step 17 forms set T1 of all states so far constructed for A that contain state tuples identical to q . Steps 17–25 apply our approximation. Step 17 forms set T2 of all states in T1 such that the corresponding remainders to those in q are within the error tolerance ε. If there is no such state, then q is genuinely new, and, in steps 19–21, it is added to P , marked
512
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
Fig. 4. Approximate determinization algorithm.
final if appropriate, and an appropriate σ -transition from q to q is constructed. If, on the other hand, there exists some state q ∈ T2 , the σ -transition from q is constructed to q instead of q (steps 23 and 24). In the next two sections we assess the quality of D˜ experimentally. We first show how it affects recognition performance when applied to language models. We then consider how it preserves the ordering of the best strings in word lattices.
4. Assessment of Recognition Performance. ASR systems are typically evaluated by measuring word accuracy for a domain, which is established by fixing the following variables (see [18] for a thorough discussion of each): Technology: isolated word, connected word, continuous speech, read speech, fluent speech, etc. Task: digit strings, airline travel, business newswire, etc. Task syntax: none, fixed string length, finite-state grammar, etc. Mode: speaker independent or dependent. Vocabulary: number and type of words (digits, airline words, unrestricted, etc.). Two standard ARPA-sponsored test beds are the ATIS (Air Travel Information System) and NAB (North American Business) domains, defined in Table 1.
An Approximate Determinization Algorithm for Weighted Finite-State Automata
513
Table 1. Definition of ATIS and NAB ASR domains. Domain
Technology
Task
ATIS
Continuous speech Airline travel
NAB
Read speech
Task syntax
Finite-state grammar Business newswire Finite-state text grammar
Mode
Vocabulary
Speaker 5,000 Airline words independent Speaker 60,000 Business words independent
Once a domain is defined, a representative corpus of text is assembled. ATIS has a 20,000-sentence corpus; each sentence is 10 words long on average. Examples are “Show me a flight to Boston” and “I’d like a first class ticket.” NAB has a 250-millionword corpus of longer sentences, such as might be found in The Wall Street Journal, e.g., “AT&T’s directors today announced their intention to repurchase up to $4 billion of AT&T common stock and also split the stock 3-for-2 pending the closure of their planned merger with Tele-Communications, Inc.” The corpus, along with a development set of input speech files, is distributed to evaluation participants. The development set might be further divided into training and test sets; the former is meant to be used for system development, and the latter for system testing. Each speech file in the development set contains digitized speech from a known (to the participant) utterance. Each participant then develops an ASR system for the domain. The corpus is used to develop language models, and the speech files are used to develop acoustic models and evaluate the recognizer during development. Final evaluation is performed by distributing an evaluation set of speech files, each of which is input to the recognizer. The evaluation set might also be divided into training and test sets, to provide additional training material to the participant. In any case, the original transcriptions of the utterances in the test set remain unknown to the participant. The test set is run through the final ASR system, and the output text is evaluated by an independent committee, which knows the utterance represented by each input file. Scoring is done with the word accuracy measure. Let X be one of the test utterances, let |X | be the number of words in X , let X˜ be the corresponding recognizer output, and let d(X, X˜ ) be the Levenshtein distance [14] between X and X˜ . The word accuracy of X˜ with respect to X is 1 − d(X, X˜ )/|X |. (Note that negative word accuracy is possible if sufficient spurious insertions are made.) Word accuracy is averaged over the evaluationtest set to calculate a final score. For example, state of the art, near-real-time ASR typically achieves around 70% word accuracy for the NAB domain. The acoustic and language models developed usually remain proprietary. Participants publish the mechanics behind their system, as well as any novel ideas used in building the ASR system and models themselves. Reproducibility of results is thus limited to trying the same algorithmic, engineering, and model-building ideas with one’s own models for the same domain, or else assessing how well they extend to other domains. For ideas to be extensible, therefore, they must at least apply generally to any model for a given domain. Determinization (exact or approximate) applies generally to any weighted finite-state model. We would like to assess the effect of approximate determinization on ASR performance. If A is a nondeterministic WFA and A is the deterministic equivalent (L(A ) =
514
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
L(A)), an ASR system will yield the same results whether A or A is used as a language model, because exact determinization (ε = 0) is language preserving. (As discussed in Appendix A, though, the system will run faster using A .) As we increase the approximation parameter, we expect that word accuracy will degrade somewhat, since some path length information in the model will be lost. On the other hand, we expect the resulting extra model compression to have favorable effects on the computational performance (time and space) of the recognizer. 4.1. Word Accuracy. We measured the effects of approximate determinization on word accuracy by running a fixed set of speech files from the NAB domain through the AT&T speech recognizer [19], using two collections of language models. Each collection was obtained by running a base nondeterministic language model through D˜ with various settings of ε. By recording the resulting word accuracy using each model in a collection, we can observe the effect of varying ε on word accuracy. We call the two base NAB language models the small model, which was used for near real-time performance, and the large model, which was used for off-line experiments with the goal of higher accuracy. “Small” and “large” refer to model size; the base small model was 14 MB, and the base large model was 87 MB. The models were extended lexicons, which are described in Appendix A.We ran each model through D˜ with the following settings of ε: 0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, and 20. We ran 300 utterances using these models in the AT&T speech recognizer and determined the word accuracy of the result with respect to both the correct answer and that given by ASR using the corresponding exact model (generated by setting ε to 0, which makes D˜ equivalent to D). For each model, we averaged the results over the 300 utterances. The 300 utterances comprised a complete NAB development-test set, described in Appendix A.3. Figures 5 and 6 plot the results for the two different NAB models. The middle curves show the relative sizes of each approximate model compared with the corresponding exact model. The bottom curves show that recognition word accuracy did not degrade appreciably. The top curves show that the output of ASR with each approximated model was almost the same as the output with the corresponding exact model: we did not just substitute one mistake for another. The results suggest that while approximate determinization has a considerable effect on model size (up to 25% compression with respect to the exact small model and 35% compression with respect to the exact large model), the effects on word accuracy are negligible. In particular, word accuracy with respect to the correct answer was degraded by less than one percentage point even with an approximation parameter of 20. 4.2. Word Accuracy Versus Computational Resources. ASR systems employ heuristic search methods such as the Viterbi search algorithm, described in Appendix A, which uses a parameterized threshold to reduce the search space. Computational demands, therefore, are expressed as a tradeoff between accuracy and resources. By varying the threshold—the beam width in the Viterbi algorithm—the user can opt for better word accuracy or speedier computation. To measure the effect of approximate determinization on ASR computational performance, therefore, we adapted the experimental setup described in Section 4.1 to use 13
An Approximate Determinization Algorithm for Weighted Finite-State Automata
515
Fig. 5. Recognition performance for the small NAB model. The middle curve (circles) shows the ratio of approximated model size to minimum exact model size. The top curve (triangles) shows the word accuracy using the approximated models with respect to the results of ASR using the exact model. The bottom curve (squares) shows the word accuracy using the approximated models with respect to the correct answer.
Fig. 6. Recognition performance for the large NAB model. The middle curve (circles) shows the ratio of approximated model size to minimum exact model size. The top curve (triangles) shows the word accuracy using the approximated models with respect to the results of ASR using the exact model. The bottom curve (squares) shows the word accuracy using the approximated models with respect to the correct answer.
516
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
Fig. 7. Accuracy versus time for the large NAB model. Results from the exact model are indicated with circles, through which a baseline reference is fitted. Results from approximated models are indicated with crosses. A point (x, y) indicates that by spending x seconds per utterance (averaged over the 300 utterances), we can achieve a word accuracy of y.
different beam widths in the Viterbi algorithm for each run with a given language model. The beam widths ranged from 50% to 150% of the default value used in the AT&T speech recognizer. Because approximation parameters greater than 10 had little observable effect in the previous experiment, we used only the following approximation parameters in the new experiment: 0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1, 2.5, 5, 7.5, and 10. Furthermore, because the previous experiment showed no characteristic difference between the small and large NAB models vis-`a-vis the effects of approximate determinization, we used only the large model in the new experiment. Each run was therefore defined by an approximated model and a beam width. For each run, we averaged not only the word accuracy for the 300 utterances but also the time and dynamic memory—peak memory consumed by the recognizer as it was running—used by the system. We measured both peak virtual and in-core memory. Figure 7 plots accuracy as a function of time. Figures 8 and 9 plot accuracy as a function of dynamic memory; the former measures maximum virtual memory, and the latter measures maximum in-core memory. In each of the three figures, the results using the exact model are indicated with circles—one for each beam width—with a reference curve fitted through the data points. Each result using an approximated model appears as a cross, one for each combination of approximation parameter and beam width. A mark at point (x, y) in the plots indicates that with x resources, we achieved a word accuracy of y. Thus, crosses above or to the left of the baseline curve represent performance improvements (same accuracy in less time or with less memory) over the exact model. Figures 10 and 11 similarly plot accuracy as a function of the product of time and memory, a measure of raw resources consumed. The results suggest that approximate determinization does not affect the tradeoff between accuracy and time but does provide a dramatic improvement in dynamic memory
An Approximate Determinization Algorithm for Weighted Finite-State Automata
517
Fig. 8. Accuracy versus maximum virtual memory for the large NAB model. Results from the exact model are indicated with circles, through which a baseline reference is fitted. Results from approximated models are indicated with crosses. A point (x, y) indicates that with x MB of virtual memory, we can achieve a word accuracy of y.
Fig. 9. Accuracy versus maximum in-core memory for the large NAB model. Results from the exact model are indicated with circles, through which a baseline reference is fitted. Results from approximated models are indicated with crosses. A point (x, y) indicates that with x MB of in-core memory, we can achieve a word accuracy of y.
518
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
Fig. 10. Accuracy versus time-memory product for the large NAB model (using virtual memory). Results from the exact model are indicated with circles, through which a baseline reference is fitted. Results from approximated models are indicated with crosses. A point (x, y) indicates that by spending x seconds * MB per utterance (averaged over the 300 utterances), we can achieve a word accuracy of y.
Fig. 11. Accuracy versus time-memory product for a large NAB model (using in-core memory). Results from the exact model are indicated with circles, through which a baseline reference is fitted. Results from approximated models are indicated with crosses. A point (x, y) indicates that by spending x seconds * MB per utterance (averaged over the 300 utterances), we can achieve a word accuracy of y.
An Approximate Determinization Algorithm for Weighted Finite-State Automata
519
consumption. Each horizontal level of marks in Figures 8–11 corresponds to a collection of approximated models (one for each setting of ε) used with a particular beam width. That the accuracy remains unchanged for a given beam reflects the static results shown in Figure 6: the memory but not the accuracy decreases as the approximation parameter increases. Figures 10 and 11 suggest an overall improvement in the tradeoff between accuracy and total computational requirements using the approximated models. Considering horizontal slices of the plots suggests that with a fixed beam width, we can achieve the same accuracy with fewer resources using an approximated model instead of the exact model. One can also consider the knee of each plot to depict the point at which increasing the beam width further has no appreciable effect on accuracy. (As we increase the beam width, accuracy increases at the cost of computational resources.) Approximate determinization moves the knee upwards and to the left, decreasing the point at which larger beams have a positive effect. Summarizing, the results of both experiments suggest that, when applied to language models, approximate determinization with reasonable (perhaps even surprising large) approximation parameters has negligible impact on ASR accuracy and time but offers significant improvement on static and dynamic memory requirements.
5. Assessment of Language Preservation. Our definition of approximate determinization is quite liberal, as it allows an arbitrary setting of weights to strings in the output. Even a fixed uniform weighting would satisfy the definition, although intuition suggests that an approximate determinization algorithm that weighted all strings uniformly would not be useful in ASR. It would therefore be useful to devise some language theoretic distance measure comparing two weighted languages on the same underlying set of strings. An appropriate measure from the ASR perspective is based on the n-best paradigm. Rather than report just the most likely sentence uttered, many recognizers produce the n most likely sentences, for some given n. As described in Section 2, the recognizer output can be a word lattice, in which case the n most likely sentences correspond roughly to the n best (minimum cost) strings in the word lattice. We therefore measure the degree of language preservation by calculating how well D˜ preserves the n best strings of word lattices. Let X be a word lattice and let X˜ be the result of applying D˜ to X . The n-best preservation measure of X˜ with respect to X is the maximum n such that the n best strings in X appear, in the same order, as the n best strings in X˜ . One alternative measure might be an edit distance between the ordered string se˜ quences of the two weighted languages (the original lattice and that produced by D). Another might be some norm comparing the weights of corresponding strings in the two weighted languages. Since ASR systems based on stochastic modeling assign a weight to almost every possible string, however, the weighted languages are very large, so computing these measures might be computationally prohibitive. Furthermore, most strings have very low weights compared with the best ones; i.e., the number of sentences in the resulting lattice that are potentially “good” is very small. An edit distance or norm over the entire string sequences, therefore, would mostly measure effects among strings
520
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
Fig. 12. Lattice size reduction, showing the ratio of number of states (or transitions) in the approximately determinized lattice to that in the original lattice, aggregated over the 100 ATIS word lattices.
that are meaningless to the ASR process. Our n-best preservation measure addresses the “top” of the string sequence, i.e., the strings that are likely to be good answers. It asks for the maximal n such that the edit distance between the first n strings in the two string sequences is zero. To measure n-best preservation, we applied D˜ to 100 word lattices, using the following approximation parameters: 0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, and 20. The lattices were generated by running 100 utterances randomly selected from a 1000-utterance ATIS development-test set, described in Appendix A.3, through the AT&T speech recognizer. For each lattice, we recorded the resulting size reduction as well as the n-best preservation measure. We aggregated the results over the test set. Figure 12 depicts the size reduction obtained from determinizing the word lattices. The x-axis shows different values of ε, and the y-axis compares sizes of the results with those of the input word lattices. Even small values of ε gave reduction factors better than 0.4 for states and 0.2 for transitions. Figure 13 plots n-best preservation as a function of the approximation parameter. Even using large values of ε, D˜ did not affect the best strings induced by the lattices. Recall that setting ε to 0 reduces D˜ to D. The results suggest that, compared with exact determinization, approximate determinization provides considerable extra compression on word lattices without affecting the best strings induced by the lattices, an important consideration for ASR systems.
6. Conclusion. Our experiments suggest that approximate determinization produces language models that are considerably smaller than the minimal formally equivalent models, with negligible effects on ASR performance. The 25–35% saving in model space significantly reduces the dynamic memory consumption by the recognizer as well as the
An Approximate Determinization Algorithm for Weighted Finite-State Automata
521
Fig. 13. Preservation of best strings, showing the maximum n (up to n = 20) such that the n-best sentences in the approximately determinized lattice are identical, in order, to those in the original lattice, aggregated over the 100 ATIS word lattices.
off-line model-building time. The lack of effect on ASR time can be predicted by the discussion in Section A.3. Determinization reduces the number of outgoing transitions, which determines in large part the search time taken by the Viterbi algorithm, to one transition for each alphabet symbol. Approximate determinization does not further reduce this number and does not affect the likelihood calculation. As bigger models are required, however, we expect the size reduction achieved by D˜ to have a positive effect on caching and paging. Developing a more formal, language theoretic measure of approximation would be useful. In particular, the definition of the approximate determinization algorithm in Section 2 is very general. More useful might be the notion of an ε-approximate determinization algorithm, which would preserve weights of strings up to a factor of ε. Although we discount computing such a measure empirically in Section 5, proving that an algorithm guaranteed such language preservation might enable one to guarantee theoretically such ASR results as we demonstrate experimentally. It would be interesting to extend the idea of approximation to all the WFA operations used in ASR and natural language processing [16], [17]. Finally, exploring approximation methods other than unifying states based on relative remainder clustering might yield other, useful approximate determinization algorithms.
Acknowledgements. We thank David Johnson, Mehryar Mohri, Fernando Pereira, and Giuseppe Riccardi for fruitful discussions. Additionally, Fernando provided us with the ATIS lattices. Mike Riley helped us prepare the experiments of Section 4. We also thank the referees for many constructive suggestions.
522
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
Appendix A. A Brief Tutorial on ASR. We give a brief tutorial on ASR to introduce necessary background and terminology and to motivate our consideration of WFA determinization. Rabiner and Juang [18] give an excellent description of the entire ASR process. Buchsbaum and Giancarlo [5] distill some of the algorithmic issues in ASR from a computer science perspective. Recall the cascade of transformations described in Section 1 and depicted in Figure 1. Each of the transformations can be nondeterministic. For example, the same speech frame can be realized in many different phones, and a word can have multiple pronunciations. Stochastic models are therefore used to implement the various phases in ASR. Two that are important here are weighted finite-state machines, which are used to transform phones to (eventually) sentences, and hidden Markov models, which are used to transform speech frames to phones. A search engine combines the two models to process the input sequence of speech frames. As we detail the process below, it will become evident that WFA determinization and size reduction are critical to effective ASR. A.1. Weighted Finite-State Machines. Pereira et al. [16], [17] show how to model the ASR cascade as a composition of transductions. Their approach, which we summarize, introduces the use of weighted finite-state machines to realize the various stages of the ASR process. We first introduce the general concepts of weighted transductions and languages, and we then show how they can be applied to ASR. Given two alphabets and and a semiring R = (K , ⊕, ⊗, 0, 1), a weighted transduction is a function T : ∗ × ∗ → K . Intuitively, a transduction assigns a weight to each pair of strings in ∗ × ∗ . For example, the lexicon can be realized as a transduction between phones and words, with weights allowing a stochastic assignment of probabilities to multiple pronunciations of single words. A weighted language is a function L : ∗ → K . Intuitively, a weighted language assigns a weight to each string in ∗ . For example, tri-gram grammars, described in Section 2, are weighted languages that assign probabilities to sentences. Given a commutative semiring R = (K , ⊕, ⊗, 0, 1) and two transductions S : ∗ × ∗ → K and T : ∗ × ∗ → K , the composition S ◦ T is defined as follows (we assume that collecting over a countable set, e.g., taking the ⊕-wise sum over strings in ∗ , is well defined for R): (S ◦ T )(x, w) =
S(x, y) ⊗ T (y, w).
y∈ ∗
Given a transduction S : ∗ × ∗ → K and weighted languages L 1 : ∗ → K and L 2 : ∗ → K , the applications of S to L 1 and L 2 form another type of composition and yield weighted languages over ∗ → K and ∗ → K , respectively: (L 1 ◦ S)(y) =
x∈ ∗
L 1 (x) ⊗ S(x, y);
(S ◦ L 2 )(x) =
S(x, y) ⊗ L 2 (y).
y∈ ∗
Finally, given two weighted languages L 1 , L 2 : ∗ → K , their intersection is viewed
An Approximate Determinization Algorithm for Weighted Finite-State Automata
523
as a composition, L 1 ◦ L 2 : ∗ → K , defined by (L 1 ◦ L 2 )(x) = L 1 (x) ⊗ L 2 (x). While the above definitions are sound for arbitrary semirings [2], [3], [9] under the assumption regarding collecting over countable sets, the ASR application in particular uses the min-sum semiring, (+ ∪ {0, ∞}, min, +, ∞, 0). We therefore restrict our attention to the min-sum semiring for the rest of this tutorial. Weighted languages and transductions can be represented by weighted finite-state automata and transducers, respectively. We defined WFAs in Section 2. Weighted finitestate transducers are defined similarly, adding a second alphabet. We omit the formal definition, since we consider only WFAs in our algorithm and experiments. Pereira et al. [16], [17] apply this general formalism to model ASR as a composition of transductions. Let O be the sequence of speech frames, realized by a weighted language mapping feature vectors to + ∪ {0, ∞}. Let A be the acoustic models, realized by a weighted transduction from feature vectors to phones. Let D be the lexicon, realized by a weighted transduction from phones to words. Let M be the grammar, realized by a weighted language mapping word strings to + ∪ {0, ∞}. As with the tri-gram grammars described in Section 2, each transition t in one of the above finite-state machines is weighted by the probability of taking t, given the source state. The product of the probabilities along a path thus equals the probability of inducing the corresponding string along that path. Using negative log probabilities reduces the products to sums, and the minimum weight (or minimum cost) path then corresponds to the most likely sequence of transitions. Using the min-sum semiring, the composition G = ((O◦ A)◦D)◦M encodes the ASR result. Specifically, the string induced by the minimum cost path through G is used as an approximation to the most likely sentence uttered. Since the size of the composition of two weighted finite-state automata or transducers X and Y is |X ◦ Y | = (|X | · |Y |), size reduction of these machines is crucial to fast and effective ASR. Unfortunately, since acoustic features are realized by continuous measurements, the alphabet of feature vectors is infinite, precluding the instantiations of O and A by WFAs. To implement a transduction from feature vectors to phones, therefore, we use a different type of stochastic model. A.2. Hidden Markov Models. A hidden Markov model (HMM) is a stochastic device that can match or generate strings over some, potentially infinite alphabet. An HMM M = (Q, , A, b, π) consists of a set of states Q = [1..N ], an alphabet , a transition matrix A, an output probability distribution b, and an initial probability distribution π . 1. Ai j is the probability that M will enter state j upon leaving state i. 2. bi (σ ) is the probability that M will output symbol σ ∈ when in state i. 3. πi is the probability that M will start in state i. For ASR, we consider M as a string matcher or observer. The probability Pr(X |M) that M matches a string X = σ1 · · · σT ∈ T can be computed via dynamic programming: Pr(X |M) =
T N t=1 i=1
Pr(qt = i)bi (σt ),
524
where
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
πj , Pr(qt = j) = N i=1
t = 1, Pr(qt−1 = i)ai j ,
t > 1.
In ASR, HMMs are usually used to transform speech frames into phones. These HMMs typically have between three and seven states and are created (“trained,” in the literature) by iterated stochastic inference [1]. The advantage of using HMMs in this stage is that the alphabet of speech frames can be infinite; we need only be able to compute b(σ ) efficiently for any speech frame σ . There are many types of discrete and continuous distributions that can comprise b; Rabiner and Juang [18] discuss this in detail. A.3. The Search Engine. Recall the cascade of compositions G = ((O ◦ A) ◦ D) ◦ M that realized the ASR process applied to the sequence O of speech frames. We can exploit the associativity of composition to realize G as (O ◦ A) ◦ (D ◦ M); we refer to D ◦ M as an extended lexicon. A set of HMMs can realize the transduction O ◦ A, as described in Section A.2, and D ◦ M can be computed as described in Section A.1. A search engine is used to intersect the two resulting weighted languages. For example, the AT&T speech recognizer [19] employs the following two-level Viterbi search algorithm, as described by Lee and Rabiner [13]. Let D = D ◦ M. The algorithm maintains a list of active states of D ; initially, the start state is active. The algorithm proceeds in a time-synchronous manner, matching speech frames from the input. For each frame f and each transition t = (q, σ, ν, q ) out of each active state q, the HMMs are used to compute the probability that f matches the label, σ , on t. O ◦ A is thus computed as needed; this is referred to as the likelihood calculation. That probability is used together with the transition probability, ν, on t to determine the probability of taking transition t to its terminal state, q . (Only the highest probability—lowest cost—path, and the corresponding induced string, to q is preserved.) This is referred to as the search phase. To reduce the search space, at the cost of accuracy, a threshold is applied: if and only if the probability of reaching q is above the threshold, then q is made active. The next frame is then processed, with respect to the then currently active pairs. The threshold is usually implemented as a beam width, B. The probability of the optimal state transition sequence is maintained, and states with probabilities within a factor of B of optimal are kept active. Determinizing D , therefore, has a direct role in reducing the time spent in the search phase of the Viterbi algorithm. Because the search time is directly proportional to the number of outgoing transitions from the active states, reducing this number reduces the search time. In real applications, there is at least one transition on each symbol (phone) out of each state, so in some sense determinization reduces the search time to near optimal for a given extended lexicon. Appendix B. Obtaining Systems and Data. Most commercially available speech recognition products are tailored more toward applications developers than researchers. One product, though, called HTK, provides a low-level toolkit for experimenting with speech recognition algorithms in addition to an application-building interface. It is available from Entropic, Inc., http://www.entropic.com.
An Approximate Determinization Algorithm for Weighted Finite-State Automata
525
A binary version of the finite-state toolkit used to develop the AT&T speech recognizer is available at http://www.research.att.com/sw/tools/fsm. Although approximate determinization is not provided as part of the toolkit, it is easily implemented following the algorithm we present. The NAB and ATIS corpora, development, and evaluation sets are available from the Linguistic Data Consortium (LDC), http://www.ldc.upenn.edu. The experiments reported in Section 4 used the 300 utterances from the NAB development test set that was comprised of Subjects 710–71j, included in LDC’s CSR-IV Hub 3 collection. The experiments reported in Section 5 used 100 utterances selected at random from the 1000utterance test set that was used in the December 1993, ATIS3 evaluation, included in LDC’s ATIS3 Training Data. References [1]
[2] [3] [4] [5] [6]
[7] [8] [9] [10]
[11] [12] [13] [14] [15] [16]
[17]
L. E. Baum and J. A. Eagon. An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bulletin of the American Mathematical Society, 73:360–363, 1967. J. Berstel. Transduction and Context-Free Languages, volume 38 of Leitfaden der angewandten Mathematik und Mechanik LAMM. Springer-Verlag, Berlin, 1979. J. Berstel and C. Reutenauer. Rational Series and Their Languages, volume 12 of EATCS Monographs on Theoretical Computer Science. Springer-Verlag, Berlin, 1988. E. Bocchieri, G. Riccardi, and J. Anantharaman. The 1994 AT&T ATIS CHRONUS recognizer. In Proc. ARPA Spoken Language Technology Workshop, pages 265–268, 1995. A. L. Buchsbaum and R. Giancarlo. Algorithmic aspects in speech recognition: An introduction. ACM Journal of Experimental Algorithmics, 2, 1997. http://www.jea.acm.org. A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook. On the determinization of weighted finite automata. In Proc. 25th Internat. Coll. on Automata, Languages, and Programming, volume 1443, of Lecture Notes in Computer Science, pages 482–493. Springer-Verlag, Berlin, 1998. To appear in SIAM Journal on Computing. C. Choffrut. Une caracterisation des fonctions sequentielles et des fonctions sous-sequentielles en tant que relations rationnelles. Theoretical Computer Science, 5:325–337, 1977. C. Choffrut. Contributions a´ l’´etude de quelques familles remarquables de function rationnelles, Ph.D. thesis, LITP-Universit´e Paris 7, Paris, 1978. S. Eilenberg. Automata, Languages, and Machines, volume A. Academic Press, San Diego, CA, 1974. F. Jelinek, R. L. Mercer, and S. Roukos. Principles of lexical language modeling for speech recognition. In S. Furui and M. M. Sondhi, editors, Advances in Speech Signal Processing, chapter 21, pages 651–699. Marcel Dekker, New York, 1992. P. Kenny, R. Hollan, V. N. Gupta, M. Lenning, P. Mermelstein, and D. O’Shaughnessy. A∗ -Admissible heuristics for rapid lexical access. IEEE Transactions on Speech and Audio Processing, 1:49–57, 1993. R. Lacouture and R. De Mori. Lexical tree compression. In Proc. 2nd Euro. Conf. on Speech Communication and Technology, volume 2, pages 581–584, 1991. C.-H. Lee and L. R. Rabiner. A frame-synchronous network search algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11):1649–1658, 1989. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707–710, 1966. M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269–311, 1997. F. Pereira and M. Riley. Speech recognition by composition of weighted finite automata. In E. Roche and Y. Schabes, editors, Finite-State Language Processing, chapter 15. MIT Press, Cambridge, MA, 1997. F. Pereira, M. Riley, and R. Sproat. Weighted rational transductions and their application to human language processing. In Proc. ARPA Human Language Technology Conf., pages 249–254, 1994.
526
A. L. Buchsbaum, R. Giancarlo, and J. R. Westbrook
[18] L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice-Hall Signal Processing Series. Prentice-Hall, Englewood Cliffs, NJ, 1993. [19] M. D. Riley, A. Ljolje, D. Hindle, and F. C. N. Pereira. The AT&T 60,000 word speech-to-text system. In Proc. 4th Euro. Conf. on Speech Communication and Technology, volume 1, pages 207–210, 1995. [20] K. Salomaa and S. Yu. NFA to DFA transformation for finite languages over arbitrary alphabets. Journal of Automata, Languages and Combinatorics, 2(3):177–186, 1997. Also appears in Proc. WIA ’96, volume 1260 of Lecture Notes in Computer Science, pages 149–158. Springer-Verlag, Berlin, 1997. [21] A. Weber and R. Klemm. Economy of description for single-valued transducers. Information and Computation, 118:327–340, 1995.