ISSN 10642269, Journal of Communications Technology and Electronics, 2013, Vol. 58, No. 12, pp. 1292–1301. © Pleiades Publishing, Inc., 2013. Original Russian Text © V.N. Sorokin, I.V. Geras’kin, 2013, published in Informatsionnye Protsessy, 2013, Vol. 13, No. 2, pp. 35–47.
INFORMATION TECHNOLOGIES IN TECHNICAL AND SOCIALECONOMICAL SYSTEMS
VocalTract Length Estimation V. N. Sorokin and I. V. Geras’kin Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia Received April 25, 2013
Abstract—Two methods for estimating the vocaltract length equivalent to the homogeneous acoustic tube length are investigated. One method is based on calculating the tract length from the difference between the frequencies of the adjacent local spectral maxima, which exceed 4 kHz. In the other method, the vocaltract length is calculated according to the average frequency of the second formant determined by the frequencies of first three formants. In addition, various variants of analysis are discussed irrespective of the context and with allowance for known vowels. The probability that the speaker gender is correctly recognized via two methods is about 13%, and its value is almost independent of the knowledge of the context. The probabilities that male and female voices are correctly recognized according to the difference of higher formants are, respectively, 31 and 25.5% regardless of the context and 37 and 31% with allowance for it. The probabilities of correct recognition of male and female voices reach to 27 and 21.5%, respectively, if contextindependent recognition is performed from the average frequency of the second formant and 43 and 35.5% after context dependent recognition with the known vowel type. Keywords: vocaltract length, gender recognition, and speaker recognition DOI: 10.1134/S1064226913120164
1. INTRODUCTION When the inverse problem concerning the vocal tract shape is solved via the variational method, the parametric tract model involving vocaltract length L is employed. With the aim of reducing the volume of search for desirable model parameters, it is advisable to estimate the most probable tract length of the cur rent pronunciation in advance. Each speaker is char acterized by the definite vocaltract length in the neu tral state, and his control of the larynx height can differ from other speakers when different voices are articu lated. Hence, vocaltract length estimation can be applicable to speaker recognition. In the problem of automatic speech recognition regardless of a speaker, it is reasonable to preliminarily determine the speaker gender because the acoustical characteristics of male and female voices are different. Anatomical measurements demonstrate that the dif ference between the vocaltract lengths of men and women can reach 25% [7]. When the speaker gender is identified before a recognition process, the probability of correct speech recognition can be improved. In these problems, instead of directly estimating the speaker’s vocaltract length, the technology of com pensating the difference between the vocaltract lengths of various speakers is widely used. This com pensation can be performed by transforming speech signal parameters in time, frequency, and cepstrum domains. In the time domain, the Mellin transform is employed together with the extension or compression of the time axis [18]. In the frequency domain, the lin
ear or nonlinear transformation of the frequency axis is implemented [1, 8]. In a number of cases, such transformation is applied to either the averaged fre quency of the third formant [5] or the average fre quency of first three formants [19]. In the cepstrum domain, the cepstrum averaged over a prolonged time interval or the cepstrum calculated during the previous frame is subtracted [13]. The speaker’s vocaltract length compensation is the indirect method for normalizing anatomical parameters. When the tract length is immediately esti mated together with the compensation of its varia tions, it is possible to enhance the speech recognition system stability to individual features of speaker anat omy. The vocaltract length was estimated with the help of tract models differing in complexity. In [11], a twotube tract model was employed to achieve this propose. In [2, 3], the heuristic estimate L = (L1 + L2 + L3)/3, where (1) L k = c 0 ( 2k – 1 )/4F k , was suggested. Here, c0 is the sound speed (≈350 m/s) and Fk is the kthresonance frequency. This expression is valid only for a homogeneous acoustic tube, but the calculated estimate correlated with the true vocal tract length, as was reported by the authors. In [4], the vocaltract length is estimated by determining the parameters of a hidden Markov model with input for mant frequencies. In this case, the error is about 3%. In this study, vocaltract length was estimated via two methods based on determining the higher formant frequencies and estimating the second formant fre
1292
VOCALTRACT LENGTH ESTIMATION
1293
quency as the average frequency of first three formants in the vowel interval. 2. HIGHERORDER RESONANCE FREQUENCIES The vocaltract length is a distance along its medial line from the glottis to the last coordinate of lips. The medial line is, in turn, the geometrical locus belonging to the middle of the shortest straight line between any point on the immovable surface of the vocal tract and the corresponding point on a movable surface. The medial line position and its length depend on both the larynx height and a vocaltract shape (Fig. 1). The effective vocaltract length increases due to the features of sound radiation in space. Such an increase depends on the mouth opening area and can reach 5% at a large opening [14]. It has been shown [15] that the resonant frequency of an inhomogeneous acoustic tube grows or dimin ishes depending on the tract narrowing position. For convenience, it is accepted that the initial area S(x) of cross section is excited at the single point with coordi nate x0; i.e., the excitation is described by the δ func tion. Thus, if S1(x) = εδ(x – x0), the corrections for the system excitation eigenvalues are found from the expression (0)
(0)
2
(0)
2
ε { [ ψi ' ( x0 ) ] – [ λi ψi ( x0 ) ] } η 1 = , 2 (0) λi (0)
(0)
where eigenfunctions ψ i and their eigenvalues λ i characterize the initial area of a vocal tract. Hence, the ithresonance frequency is affected mainly by the excitation applied to either the ith eigenfunction node (0) (0) ( ψ i (x0) = 0) or its antinode ( ψ 'i (x0) = 0). The ith resonance frequency of an excited system increases or decreases with the sign of ε. For example, in the first and second cases, the resonant frequency diminishes and grows, respectively, in the narrowing process if ε < 0. The ithresonance frequency remains unchangeable when (0)
2
(0)
(0)
2
[ ψi ( x0 ) ] = [ λi ψi ( x0 ) ] . In this case, if the tract area is excited in the node and antinode of a certain eigenfunction, its eigenv calue is constant. Hence, such area excitation weakly affects the HF tract resonances, and these frequencies depend only on the tract length as in a homogeneous acoustic tube. This property can be used to estimate the vocaltract length when the higherorder resonant frequencies are measured in the voice signal. Variations in the tract crosssection area corre sponding to its shape in Fig. 1, as well as first three tract eigenfunctions and homogeneous tube eigen functions with the same length, are depicted in Fig. 2. Certain difficulties are encountered when the vocaltract lengths are estimated as the homogeneous
Fig. 1. Vocaltract shape and its medial line for vowel |a| [15].
tube lengths according to HF resonances. On the one hand, the higher the resonant frequency, the weaker the dependence on the vocaltract shape. On the other hand, transverse waves with their own resonances almost indistinguishable from those of plane waves are observed at high frequencies. As was estimated in [16], plane waves, for which the resonant frequency is cou pled with tube length (1), travel along the homoge neous tube with hard walls at frequencies of up to 8 kHz. At the same time, according to the estimates established in [6], the transverse waves of a vocal tract arises beginning with 4 kHz. Since a speech signal has the small amplitudes of HF components, they are sub jected to the influence of noise and depend on excita tion source characteristics. For certain voices, espe cially female ones, the speech signal spectrum is restricted, i.e., has no resonances of higher than the third order. An additional problem is that the ordinal reso nance number cannot be easily estimated in the speech signal spectrum. In a number of cases, it is preferable to ignore expression (1) and estimate the tract length from the difference between neighboring formant frequencies in the HF range: L = c 0 /2 ( F k – F k – 1 ). (2) 3. ESTIMATES BASED ON THE ARTICULATION MODEL To compare different methods for estimating the vocaltract length in accordance with its resonant fre quencies, this length and resonant frequencies must be simultaneously measured. In practice, such measure ments can be performed with the help of magnetic res onance tomography. However, access to such data is impeded, and their errors are related to the plate thickness (about 0.5 cm). In this method, when reso nant frequencies are estimated according to the calcu
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
1294
SOROKIN, GERAS’KIN cm2 9 8 7 6 5 4 3 2 1 0 0
2
4
6
8
10
12
14
16
18
20 cm
0
2
4
6
8
10
12
14
16
18
20 cm
0
2
4
6
8
10
12
14
16
18
20 cm
0
2
4
6
8
10
12
14
16
18
20 cm
8 7 6 5 4 3 2 1 0 6 4 2 0 –2 –4 –6
6 4 2 0 –2 –4 –6
Fig. 2. Variations in tract crosssection area during the articulation of sound |a|. Continuous curves designate the tract eigenfunc tions, and dashed curves correspond to the homogeneous tube eigenfunctions. JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
VOCALTRACT LENGTH ESTIMATION
1295
Table 1. Vocaltract length estimates obtained from its resonant frequencies Vowel A O U I Y E Average absolute error
Vocaltract length 19.36 19.36 21.12 16.28 20.68 19.80
L ΔF = c 0 /2ΔF 56
L F = 3c 0 /4F 2
cm
ε, %
cm
18.64 25.09 21.28 16.09 25.38 26.89
–3.7 29.6 0.8 –1.2 22.7 35.8 15.6
21.44 21.12 23.8 13.67 21.07 19.3
L = ( L 1 + L 2 + L 3 )/3
L6
ε, %
cm
ε, %
cm
ε, %
10.7 9 12.7 –16 19 –2.5 8.8
19.94 21.73 24.47 15.51 20.13 18.93
3.0 12.2 15.8 –4.7 –2.6 –4.3 7.1
20.32 20.35 21.70 15.31 21.76 22.77
5.0 5.1 2.7 –6.0 5.2 15.0 6.5
ΔF56 = F6 – F5 is the difference between the frequencies of sixth and fifth resonances, and F 2 = (F1 + F2 + F3)/3 is the average fre quency of the second formant.
lated formant frequencies, errors are caused by the insufficient duration of a speech signal. Restricted, but quite accurate, data can be obtained using an articula tory speech synthesizer. In [15], data on the resonant frequencies of up to the sixth resonances and the length and shape of the vocal tract of Russian vowels can be found. These data and the vocaltract length estimates calculated by different methods are summa rized in Table 1. Here, ΔF56 = F6 – F5 is the difference between the sixth and fifth resonance frequencies and F 2 = ( F 1 + F 2 + F 3 )/3 is the mean frequency of the second formant.
increases up to –8.7%. In this case, the average abso lute error is reduced to 4.4%. It is assumed that the tract length estimated from the resonant frequencies of the given voice must best correspond to the true tract length. This is explained by the fact that the tract crosssection area is most uni form during the articulation of vowel |E|. However, the tract of this voice considerably narrows below an epig lottis. Hence, the tract length estimation error obtained from L6 and LΔF in Table 1 turns out to be at a maximum in comparison with the estimates based on the average resonant frequencies.
The vocaltract length estimated as the average length for each of three formant frequencies is approx imately 1.7% greater than the estimate corresponding to the average frequency of the second formant. How ever, when the formant frequencies are determined in accordance with speech signals, their values diverge appreciably from the real frequencies. If these diver gences are random and independent, the expression L = 3c0/4 F 2 can turn out to be advantageous over the formula L = (L1 + L2 + L3)/3.
4. FORMANT FREQUENCY MEASUREMENTS Until now, severe difficulties have been met when formant frequencies were determined as local ampli tudes in the speech signal spectrum. These difficulties are associated with both the features of sound genera tion in the vocal tract and external conditions. As a result of the interaction with acoustic oscillations in the subglottal region, either local extrema caused by the resonances of the subglottal region can appear in the vowel spectrum or the resonant oscillations of the mouth opening, which resemble resonances in the subglottal region, can be suppressed. By analogy, if the nasal cavity duct is incompletely shut off by a veil of palate, antiresonances and additional peaks can arise in the signal spectrum. In the reverberation room, speech signal reception is accompanied by the sup pression of certain frequencies and the appearance of false spectral peaks. Difficulties related to the estima tion of resonant frequencies increase with the use of telephones. There is a variety of algorithms for estimating the resonant frequencies of a vocal tract, in which the fre quencies of the peaks of an amplitude spectrum, the tract transfer function poles found via the linear pre diction method, and the intervals between the inter sections of zero signals (in the certain frequency
The least error corresponded to the sixthreso nance frequency. However, it is often difficult to deter mine the ordinal formant number in the real speech signal, which can sometimes have no formants. The formant indexing problem can be partially solved using the difference between the adjacent formant fre quencies of greater than 4 kHz. However, there are voices whose frequency spectrum does not reach this value. It should be noted that only tract lengths repre sented as distances along a medial line are contained in Table 1. For example, the final correction for 3% diminishes the tract length estimation error based on F6 up to 2 (|A|), 2 (|O|), 0.2 (|U|), 2.1 (|Y|), and 11.6% (|E|). At the same time, the error corresponding to |I|
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
SOROKIN, GERAS’KIN
8000 5750 4080 2850 1900 1250 740 370 0
3.4 3.2 3.0 Intensity
Frequency, Hz (mel scale)
1296
2.8 2.6 2.4 2.2
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Time, s
2.0 0
370 740 1250 1900 2850 4080 5750 8000 Frequency, Hz (mel scale)
Fig. 3. Spectrogram of word odin.
Fig. 4. Amplitudes of the spectrogram.
bands) are employed most frequently [17]. All approaches have advantages and disadvantages that depend on speech signal reception conditions, a speaker voice type, and a specific phonetic composi tion of speech. In this study, formant frequencies are sought using a filter comb, the properties of which almost coincide with the properties of the basilar membrane of a human auditory analyzer. These filters were proposed
in [12] and called gammatone filters. In the time domain, the filter response is defined as
4
3
2 2
3
n – 1 – bt
g ( t ) = t e cos ( ωt + ϕ ) , where n is the function order (as a rule, n = 4); b char acterizes the transmission bandwidth; ω is the central circular frequency; and ϕ is the phase constant, which is commonly taken to be zero. For the given filter, the Laplace transform is defined as [9] 4
2
2
2
2
2
4
6 ( –b – 4b s – 6b s – 4bs – s + 6b ω + 12bsω + 6s ω – ω ) G ( s ) = . 2 2 4 4 ( b + 2bs + s + ω ) In the gammatone system, the frequency scale can be arbitrarily selected, ensuring the flexibility of develop ment of speech analysis methods. The dynamic spectrogram, which was obtained by calculating the envelope of the positive signals of each filter, is presented in Fig. 3. Positive signals are employed because internal hair cells of a spiral organ are sensitive only to the positive shifts of a basilar membrane. The spectral profile corresponding to the time t = 0.845 is illustrated in Fig. 4. The system of gammatone filters provides a high res olution in frequency, which is an advantage when the local spectral maxima are estimated at relatively high frequencies. At the same time, the fundamental tone harmonics are simultaneously separated at low frequen cies, masking the firstformant position because an energy peak always corresponds to some harmonic. To eliminate the phenomenon in estimating the firstfor mant frequency, the cepstral transformation must be carried out. The discrete cepstral transform coefficients are calculated from the formula M
cn =
∑ [ log Y ( m ) ] cos
m=1
πn ⎛ m – 1⎞ , M⎝ 2⎠
where Y(m) is the output signal of the mth filter, cn is the nth cepstral coefficient, and M is the number of filters. After the number of coefficients was reduced to 7 and 11 (to estimate the first and secondformant fre quencies, respectively), the inverse cepstral transfor mation was performed, and local energy peaks were revealed in the obtained spectrograms. It was shown [10] that the problem concerning the resonant frequencies of a vocal tract is illposed and its solution requires additional information. Since the values of the first three formants of Russian vowels are known, the fraction of incorrectly found formants can be calculated by directly estimating the formant fre quencies according to the spectrogram. Such a frac tion is about 30%. If the combined probabilistic distri bution of three formants is employed in the Russian language, the error turns out to be very high (about 20%). The formant frequency estimation error was reduced to about 5% only when the formant distribu tion was chosen severally for each vowel. The discretely estimated formant frequencies of vowel |a| from word dva, which are synchronized with the samples of the frequency of a fundamental tone, are depicted in Fig. 5. The formant frequency was
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
VOCALTRACT LENGTH ESTIMATION
determined by averaging the most often observed dis crete estimates in the specified time interval.
5750 Frequency, Hz (mel scale)
5. EXPERIMENTAL RESULTS AND DISCUSSION Gender Recognition Experiments were performed using the database with the univocal numerals of the Russian language. The speech signals of 216 men and 177 women were recorded under different acoustic conditions and by means of various microphones and ADCs. 285361 pronunciations was used. Since the sixth formant fre quency was unsteadily determined, the tract lengths were estimated according to different frequencies of higher formants or first three formants. The first stage consisted in estimating the male and female vocaltract lengths. The distributions of male and female vocaltract lengths, which were calculated from the difference between the higher formants corresponding to all pro nunciations of numerals, are presented in Fig. 6. As can be seen, the most probable vocaltract length of men (16 cm) is on average somewhat greater than that of women (15.2 cm) and the distribution of men voices is shifted to larger tract lengths. These results completely agree with anatomical data. The average male and female vocaltract lengths (16.39 and 15.97 cm, respectively) are quite verisimi lar, but the minimum and maximum lengths (10 and 29 cm, respectively) probably exceed the real range of anatomical data of adult people. For all pronunciations (i.e., irrespective of the con text), the error of gender recognition based on this parameter is about 87%. The minimum gender recog nition error is attained for impact vowels in words pyat’ and devyat’ (83 and 84%, respectively), whereas the error averaged over the estimates of each word is approximately 86.5% and the maximum error (90.2%) was revealed for vowel |a| in word dva. The distributions of male and female vocaltract lengths, which were obtained according to all pronun ciations with the use of the frequencies of first three formants, are depicted in Fig. 7. It is seen in Fig. 7 that the most probable vocal tract length of men exceeds that of women, as in Fig. 6, but their values diminished by approximately 2 cm, i.e., are 14 and 13.5 cm, respectively. The aver age male and female vocaltract lengths are, respec tively, 15.85 and 15.27 cm, and the estimate range reduced to some extent: from 10.5 to 28.5 cm. The most probable tract length is shifted to smaller values because the words involving impact vowels |i, y, ya, and e| with the high average frequency F 2 of the sec ond formant predominate in the vocabulary. In the neighborhood of 19 cm, the second peak of estimates corresponds to words dva and shest’, the vowels of which have the relatively low second formant. This
1297
4080 2850 1900 1250 740 370 100 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 Time, s Fig. 5. Tracks of the formant frequencies of impact vowel |a| in word dva. Crosses and circles designate the estimates obtained from the initial spectrogram and after cepstral transformations. Dots correspond to the fundamental tone and its first harmonic.
Probability 0.12 0.10 0.08 0.06 0.04 0.02 0 10
12
14
16
18
20
22
24
26 28 30 Length, cm
Fig. 6. Distributions of vocaltract lengths in accordance with the difference between the higher frequencies. Con tinuous and dashed curves designate the tract lengths of men and women.
implies that the vocaltract length estimation error depends on a particular vowel. The error of gender recognition based on this parameter is about 87.7% regardless of the context. The minimum gender recognition error is achieved for the impact vowels of words chetyre, pyat’, and devyat’ (about 82%), whereas the error averaged over the esti mates of each word is approximately 86.5%. Moreover, the maximum error (94.5%) was found for the impact vowel |o| in word vosem’. It follows from the performed experiments that the gender recognition errors obtained during vocaltract length estimation according to both the difference
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
1298
SOROKIN, GERAS’KIN
% 0.12 0.10 0.08 0.06 0.04 0.02 0 10
12
14
16
18
20
22
24
26
28
30 cm
Fig. 7. Distributions of vocaltract lengths in accordance with the average secondformant frequency. Continuous and dashed curves designate the tract lengths of men and women.
between higher formants and first three formants are almost identical and reach approximately 87–88%. Speaker Recognition In contrast to gender recognition, the speaker rec ognition based on the vocaltract length exhibits sub stantially different characteristics. The average errors of speaker recognition with and without allowance for the context and the recognition errors corresponding to particular vowels, which were
calculated by estimating the vocaltract length from the difference ΔF56 between higher formant frequen cies, are presented in Figs. 8a and 8b. It is seen that the small number of male and female speakers are well dis tinguished by this parameter. Moreover, when the vocaltract length is estimated regardless of the con text, the average errors of recognition of male and female speaker are about 68.3 and 74.5%, respectively. The respective averaged errors obtained for each vowel turn out to be somewhat smaller: 63.1 and 68.8%, respectively. In other words, in contextdependent analysis, the gain is on average about 5%. The mini mum errors of male recognition are attained for impact vowels in words pyat’ and devyat’, where the average errors are 59.4 and 59.1%, respectively. By analogy, the minimum errors of female recognition correspond to impact vowels in words dva and pyat’, where the average errors are 62.6 and 66.4%, respec tively. The influence of contextdependent analysis man ifests itself in search for the pairs of best distinguished speakers. The minimum errors of recognition of each speaker, i.e., the errors achieved for at least one pair of comparable speakers, are depicted in Figs. 9a and 9b. The speaker numbering is specified so that the con textdependent recognition error grows (upper curves). The minimum recognition errors correspond to lower curves. In this case, the sequence of speaker numbers is identical to that of upper curves. It is obvi ous that, in contextdependent recognition, each speaker always has at least one resembling competitor. For men and women, the resemblance probabilities are approximately 15 and 25%, respectively. When speakers are recognized by estimating the vocaltract length according to the frequencies of first three formants, the average errors related to context
(a)
0.8
0.9 Average recognition error
0.7 Average recognition error
(b)
1.0
0.6 0.5 0.4 0.3 0.2 0.1
0.8 0.7 0.6 0.5 0.4 0.3 0.2
0
50
100 150 Speaker number
200
250
0
20 40 60 80 100 120 140 160 180 Speaker number
Fig. 8. Errors arising when (a) male and (b) female recognition is performed. Upper curve designate the average errors obtained irrespective of the context, and lower curves correspond to each of the vowels. The speaker numbers are plotted on the abscissa axis. JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
VOCALTRACT LENGTH ESTIMATION (a)
0.30 0.25 0.20 0.15 0.10 0.05 0
50
100 150 Speaker number
(b)
0.7 Average recognition error
Average recognition error
0.35
200
0.6 0.5 0.4 0.3 0.2 0.1 0
250
1299
20 40 60 80 100 120 140 160 180 Speaker number
Fig. 9. Minimum errors appearing when (a) male and (b) female recognition is implemented.
(a)
(b)
0.9
0.7
Average recognition error
Average recognition error
0.8
0.6 0.5 0.4 0.3 0.2 0.1
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
50
100 150 Speaker number
200
250
0
20 40 60 80 100 120 140 160 180 Speaker number
Fig. 10. Errors arising when (a) male and (b) female recognition is based on estimation according to the average formant. Upper curves designate the errors averaged over all pronunciations, and lower curves correspond to each of the vowels.
dependent analysis are 73 and 78.5% for men and women, respectively. These values are worse than those obtained from analysis based on the difference of higher formants. At the same time, contextdependent analysis indicates a considerable decrease in recogni tion error. For men and women, the errors reach on average 57.1 and 64.5%, respectively, as is shown in Fig. 10. Moreover, in contextdependent analysis, the minimum errors of recognition of male and female speakers turned out to be close to zero. The average differences between the most probable vocaltract lengths of men and women, which were calculated from all pronunciations of numerals, were 1.86 and 1.98 cm, respectively. The minimum errors of
male recognition are attained for impact vowels in words nol’ and chetyre, where the average errors are 52.7 and 51.3%. By analogy, the minimum errors of female recognition correspond to impact vowels in words sem’ and devyat’, where the average errors are 58.1 and 60.2%, respectively. The vocaltract lengths corresponding to male and female speakers, which were estimated from both the difference of higher formants and first three formants for impact vowels in different words, are compared in Tables 2 and 3, respectively. The lest discrepancy between the estimates from Tables 2 and 3 is achieved in the contextdependent analysis of vowels |i and a| in words odin and dva.
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
1300
SOROKIN, GERAS’KIN
Table 2. Most probable vocaltract lengths for a male speaker Word Nol’ Odin Dva Tri Chetyre Pyat’ Shest’ Sem’ Vosem’ Devyat’ Average value
L ΔF = c 0 /2ΔF 56 , cm
L F = 3c 0 /4F 2 , cm
Difference of estimates LF – L, cm
16.5 16 15.5 15.5 16 16.5 16.5 17 16 16.5 16.2
19.5 14.5 18 14 12.5 15.5 15.5 14.5 20.5 13 14.4
3 –1.5 2.5 –1.5 –3.5 –1 –1 –2.5 4.5 –3.5 2.2
Table 3. Most probable vocaltract lengths for a female speaker Word Nol’ Odin Dva Tri Chetyre Pyat’ Shest’ Sem’ Vosem’ Devyat’ Average value
L ΔF = c 0 /2ΔF 56 , cm 15.5 15 15.5 15.5 15.5 15.5 15.5 15.5 15.5 15.5 15.4
6. CONCLUSIONS When the vocaltract length is estimated from both the averaged frequencies of first three formants and the difference of higher formants, substantial variations must appear because different vowels correspond to different larynx heights, mouth openings, and vocal tract shapes. This feature has clearly manifested itself in experiments using the contextdependent approach whereby the probability of correct recognition of a speaker is 25–35% if estimation is based on higher for mants and about 10% if analysis is performed accord ing to first three formants. Soft vowels characterized by that a larynx rises dur ing articulation are dominant in the database of the experiments described above. Hence, it is not surpris ing that the tract length is shifted to smaller values and both a speaker gender and voices belonging to each gender are badly identified. The more phonetically balanced database will scarcely appreciably improve the probability of the correct gender recognition. In
L F = 3c 0 /4F 2 , cm 19 14 16.5 13 13.5 14 14 12.5 18.5 12 14.7
Difference of estimates LΔF – L, cm 3.5 –1 –1 –2.5 –2 –1.5 –1.5 –3 3 –3.5 2.2
our experiments, its value is about 13%. Even when contextdependent recognition is performed using the known vowel type, the probability of gender recogni tion is less than 17%, i.e., is much smaller than the probabilities known from the literature. For example, in [17], the probability that the gender can be correctly recognized according to a single vowel is about 90%. However, such a weak feature can be useful in combi nation with other parameters. The vocaltract length estimation is more promis ing in the problems of speaker recognition especially if the context is an independent quantity. In this case, when the tract length is estimated from the difference between the higher formant frequencies, the probabil ities of correct recognition of male and female voices are 31 and 25%, respectively. If there are no higher order resonances in the speech signal spectrum, this parameter must be excluded. Moreover, if the lower formant frequencies are determined with an error leading to the substantial deviation from the tract
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
No. 12
2013
VOCALTRACT LENGTH ESTIMATION
lengths of the given speaker, which are obtained during training, this parameter must also be eliminated. In contextdependent recognition where the vowel type is previously known and vocaltract length is esti mated according to first three formants, the probabil ities of correct recognition of male and female voices reach about 43 and 35.5%, respectively. Under the given conditions, the vocaltract length becomes a suf ficiently strong property suitable for the combined use with other properties such as voicesource model parameters. REFERENCES 1. B. Atal, “Effectiveness of linear prediction characteris tics of the speech wave for automatic speaker identfica tion and verification,” J. Acoust. Soc. Am. 55, 1304– 1312 (1974). 2. J. Bachorowski and M. J. Owren, “Acoustic correlates of talker sex and individual talker identit are present in a short vowel segment produced in a running speech.” J. Acoust. Soc. Am. 106, 1054–1063 (1999). 3. S. Dusan, “Estimation of speaker’s height and vocal tract length from speech signal,” in Proc. of EURO SPEECH’2005 (9th Eur. Conf. on Speech Communica tion and Technology), Lisbon, Portugal, 2005 (Bonn Univ., Bonn, 2005), pp. 1989–1992. 4. S. Dusan and Li. Deng, “Vocal tract length normaliza tion for acoustictoarticulatory mapping using neural nets,” J. Acoust. Soc. Am. 106, 2181 (1999). 5. E. Eide and H. Gish, “A parametric approach to vocal tract length normalization,” in Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP'96), Atlanta, Georgia, May 7–10, 1996 (IEEE, New York, 1996), pp. 346–348. 6. J. L. Flanagan, Speech Analysis, Synthesis, and Percep tion (Academic, New York, 1965; Svyaz’, Moscow, 1968). 7. W. T. Fitch and J. Giedd, “Morphology and develop ment of the human vocal tract: A study using magnetic resonance imaging,” J. Acoust. Soc. Am. 106, 1511– 1522 (1999).
1301
8. L. Lee and R. Rose, “A frequency warping approach to speaker normalization,” IEEE Trans. Acoust., Speech, Signal Process. 6, 49–60 (1998). 9. Y. R. Leng, H. D. Tran, N. Kitaoka, and H. Li, “Selec tive gammatone filterbank feature for robust sound event recognition,” Interspeech, 2246–2249, 2010. 10. A. S. Leonov and V. N. Sorokin, “Unique determina tion of vocal tract resonance frequencies from a speech signal,” Dokl. Math. 84, 740 (2011). 11. M. Naito, Li. Deng, and Y. Sagisaka, “Speaker cluster ing for speech recognition using vocal tract parame ters,” Speech Commun. 36, 305–315 (2002). 12. R. D. Patterson and J. Holdsworth, “A functional model of neural activity patterns and auditory images,” Adv. Speech, Hearing and Language Process. 3, 547– 563 (1996). 13. M. Pitz and H. Ney, “Vocal tract normalization equals linear transformation in cepstral space,” IEEE Trans. Speech Audio Process. 13, 930–944 (2005). 14. V. N. Sorokin, Thory of Speech Production (Radio i svyaz', Moscow, 1985) [in Russian]. 15. V. N. Sorokin, Speech Synthesis (Nauka, Moscow, 1992) [in Russian]. 16. V. N. Sorokin, Speech Processes (Narod. Obraz., Mos cow, 2012) [in Russian]. 17. V. N. Sorokin and I. S. Makarov, “Gender Recognition from Vocal Source,” Acoust. Phys. 54, 571 (2008). 18. R. E. Turner, M. A. AlHames, D. D. R Smith, H. Kawahara, T. Irino, and R. D. Patterson, “Vowel normalisation: Timedomain processing of the internal dynamics of speech,” in Dynamics of Speech Production and Perception, Ed. by P. Divenyi (IOS Press, Amster dam, 2005), pp. 153–170. 19. P. Zhan, M. Westphal, “Speaker normalization based on frequency warping,” in Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP'97), Munich, Germany, Apr. 21–24, 1997. (IEEE, New York, 1997), pp. 1039–1042.
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 58
Translated by S. Rodikov
No. 12
2013