DOI 10.1007/s11018-017-1135-1 Measurement Techniques, Vol. 59, No. 12, March, 2017
AN ALGORITHM FOR MEASUREMENT OF THE PITCH FREQUENCY OF SPEECH SIGNALS BASED ON COMPLEMENTARY ENSEMBLE DECOMPOSITION INTO EMPIRICAL MODES
A. K. Alimuradov
UDC 004.934
The problem of increasing the precision with which the pitch frequency of speech signals is measured is considered. Existing algorithms for determining this frequency are presented and a new algorithm based on complementary ensemble empirical mode decomposition is developed. The results of the investigations confirm the robustness of the algorithm in the presence of frequency modulation of the pitch of speech signals. Keywords: processing of speech signals, complementary ensemble empirical mode decomposition, pitch frequency.
A speech signal is a nonstationary acoustic signal of complex form, amplitude, and frequency characteristics that vary rapidly over time. Such signals consist of vocalized and noise segments formed correspondingly as a result of periodic and aperiodic vibrations of the vocal cords. The frequency of the vibrations of the vocal cords is an important information-bearing parameter of speech signals and is referred to as the pitch frequency [1]. A large number of methods of determining the pitch frequency are known today. These may be classified as algorithms for performing measurements in the time, frequency, and frequency-time domains [2–7]. Algorithms based on the autocorrelation function and modifications of this function (YIN) [2], the Robust Algorithm for Pitch Tracking (RAPT) [5], and the Sawtooth Waveform Inspired Pitch Estimator (SWIPE) [7] are widely applied in practical applications. The widespread use of these algorithms is due to their functionality, low percentage of coarse errors, and availability of public-access software implementations. However, the capabilities of these algorithms are limited by the low frequency and time resolution of ther processing methods, which are not adapted to nonstationary speech signals. Therefore, an adaptive method of processing nonstationary signals, that is, complementary ensemble empirical mode decomposition (CEEMD) [8], is used in measurements of the pitch frequency. The objective of the present article is to develop an algorithm for measurements of pitch frequency with the use of the adaptive method of complementary ensemble empirical mode decomposition. The article is a further development of previously published studies devoted to the application of complementary decomposition in the problems of processing speech signals [9, 10]. Methods of empirical mode decomposition. Decomposition is an adaptive methodology for expanding a signal into internal functions, called empirical modes [11]. The special feature of decomposition is that the basic functions used for the decomposition are extracted directly from the initial signal. In a decomposition, a model of the signal is not specified in advance and the empirical modes are calculated in the course of the decomposition procedure in light of local features of the signal, such as extrema and zeros, as well as the internal structure of each concrete signal. Thus, empirical modes do not possess a rigorous analytic description, but must satisfy conditions that guarantee a well-defined symmetry and narrow-bandedness of
Penza State University, Penza, Russia; e-mail:
[email protected]. Translated from Izmeritel’naya Tekhnika, No. 12, pp. 53–57, December, 2016. Original article submitted February 9, 2016.
1316
0543-1972/17/5912-1316 ©2017 Springer Science+Business Media New York
the basis functions [11]. That is, the total number of extrema must correspond to the total number of zeros to within several units and the mean value of the upper and lower envelopes which interpolate correspondingly the local maxima and local minima must be approximately zero. An analytic description of a decomposition into empirical modes has the following form: I
x(t) = ∑ IMFi (t)+r1 (t) i=1
where x(t) is the initial signal; i, ordinal number of empirical mode; I, number of empirical modes; IMFi(t), finite number of extracted empirical modes; and r1, resultant residue. A drawback of such a decomposition that limits its practical application is the mixing of empirical modes, i.e., the existence of segments of a signal in some mode that are incommensurable in terms of amplitude and frequency scale. To solve this problem, we wish to propose a method of ensemble decomposition into empirical modes, based on repeated addition of white noise of low amplitude to the initial signal and calculation of the mean value of the resulting modes as the final result [12]. The analyzed signal is the sum of the signal and noise: yj(t) = x(t) + wj(t); J
IMFi (t) = ∑ IMFji (t) J ; j=1 J
r1 (t) = ∑ rjI (t) J , j=1
where wj(t) is white noise; IMFji(t) and rjI (t), respectively, empirical modes and the resultant residue obtained in different decompositions; and j = 1, 2, ..., J, number of decomposition cycles (additions to white noise to the signal). The advantage of the statistical characteristics of white noise for the detection of weak periodic or quasiperiodic components of signals is used in this approach. However, the use of white noise in the method of ensemble empirical mode decomposition results in the appearance of a new noise residue in the signal, and this residue affects the reconstruction of the initial signal [8]. A method based on the addition of white noise with direct and inverse values of the amplitude, i.e., complementary ensemble empirical mode decomposition, i.e., ⎡ y (t) ⎤ ⎤ ⎡ ⎢ j ⎥ ⎡ 1 1 ⎤ ⎢ x(t) ⎥ , × = ⎥ ⎢ ⎢ *⎥ 1 −1 ⎦ ⎢⎣ w j (t) ⎥⎦ ⎢⎣ y j (t) ⎥⎦ ⎣ has been proposed as a way of eliminating this drawback [8]; here yj(t)* is a speech signal which is the noisy inverse in terms of sign to white noise. As a result of a decomposition of a signal by means of complementary ensemble empirical mode decomposition, a set of modes that are free of the above drawbacks of a decomposition into empirical modes or of the ensemble empirical mode decomposition is formed. Calculation of the mean value of the modes obtained yields complete removal of the residual white noise by means of a pair of direct and inverse values, independently of how many signals in the noise are used. The amplitude of the added white noise and the number of decomposition cycles are also important in the implementation of the mathematical apparatuses of ensemble empirical mode decomposition and complementary ensemble empirical mode decomposition. Algorithm for measuring the pitch frequency of speech signals on the basis of complementary ensemble empirical mode decomposition. A survey of algorithms for determining the pitch frequency of speech signals that utilize tools for the analysis of nonstationary data is presented in [13–15]. These studies utilize methods of classical and ensemble
1317
Fig. 1. Block diagram of algorithm for measurement of pitch frequency based on complementary ensemble decomposition into empirical modes (EM).
decomposition. Under conditions characterized by the pronounced effect of mixing of modes and residual noise, the precision with which the pitch frequency is determined depends on a correct determination of the empirical modes containing the pitch. An algorithm for measuring the pitch frequency based on complementary decomposition (Fig. 1) [16, 19] was developed as a result of these investigations. The algorithm basically involves an adaptive decomposition of the speech signal into empirical modes by means of complementary ensemble empirical mode decomposition (CEEMD). From the set of empirical modes thus obtained, the mode containing the pitch is determined following an analysis of the energy logarithm. Next, the frequency is measured by applying the function of the Teager Energy Operator (TEO). Let us consider in more detail the basic stages of the operation of the algorithm. Recording of the speech signal x(n) is performed using the following parameters: length of record, not more than 10000 msec; digitization frequency, 8000 Hz; quantization word length, 16 bits. Here, 0 < n ≤ N, where n is the ordinal number of the discrete time reading and N the number of readings in a signal. Complementary ensemble decomposition. Decomposition of the speech signal produces a specific number of empirical modes IMFi(n) and a resultant residue r1(n). Determination of energy characteristics of empirical mode. The amplitude distribution of the signals of empirical modes found in the time domain may be described by means of the time function of the short-term energy. In the new algorithm, a process of taking logarithms of the energy in order to compress the amplitude of the signal over a broad dynamic range is carried out which causes the algorithm to function as close as possible to the operation of the human speech organ:
where LEi is the energy logarithm of an empirical mode. 1318
Fig. 2. Example illustrating determination of empirical mode containing the pitch: a) energy logarithms of empirical mode; b) spectral distribution of fifth empirical mode (1) and initial speech signal (2).
Determination of empirical mode with pitch. This step is constructed under the assumption that the vocalized modes possess greater energy than do the unvocalized modes. The process consists in successive calculation of the modulus of the difference in the values of the energy logarithms of the current and the succeeding empirical mode:
where di,i+1 is the result of subtracting the energy logarithms of the current and the succeeding empirical mode. Using threshold processing, we determine from a sequence of obtained data the greatest value di,i+1. A sharp drop in energy between the vocalized empirical mode containing the pitch and an unvocalized mode corresponds to this value [14]. A graphical interpretation of the process of determining a mode containing the pitch is presented in Fig. 2. In accordance with the above determining rule, it follows from Fig. 2a, that the fifth empirical mode contains the pitch. An analysis of the spectral distribution of this mode and the initial speech signal confirms that the mode containing the pitch has been correctly determined. That is, the unique harmonic component of the pitch frequency corresponds to the first component of the formant frequencies of the initial speech signal (Fig. 2b). Segmentation of empirical mode with pitch. This step constitutes a linear division into components of the segment, or fragments. The new algorithm is based on the assumption that the properties of the speech signal vary slowly over the course of time. This assumption leads to a short-term analysis in which fragments of the empirical mode are divided and processed as if they were short segments of speech signals with distinctive properties. Segmentation of the empirical mode into fragments is performed on the basis of the following formulas: S = IMFi(n)/L; IMFi,s+1(n) = IMFi[(s·L) + 1; (s + 1)·L], where S is the number of fragments in the empirical mode; L, the number of discrete readings in a single fragment; IMFi, s+1(n), fragment of ith empirical mode; and s = (0, 1, 2, ..., S – 1), the ordinal number of a fragment. Determination of pitch frequency. This step is performed with the use of a function which measures the instantaneous energy of a signal, a function which has been called the Teager operator and which is simple and efficient and sensitive to a sudden variation in the signal: 1319
Fig. 3. Curve of Teager operator function (1) and oscillogram of fifth empirical mode containing the pitch (2).
TEO = IMFi,PF(n)2 – IMFi,PF(n – 1)IMFi,PF(n + 1), where TEO is the Teager operator function and MFi,PF(n), discrete reading of an empirical mode containing the pitch in the nth reading. Closely situated maxima of the Teager operator function (Fig. 3) ƒPF = Tmax,m+1 – Tmax,m /ƒd, are used to measure the pitch frequency, where ƒPF is the pitch frequency; Tmax, m+1 and Tmax, m, maxima of function of the Teager operator; m, ordinal number of maximum of Teager operator function; and ƒd, digitization frequency. Investigation of algorithm for measurement of the pitch frequency of speech signals based on complementary ensemble empirical mode decomposition. An experimental investigation of the new algorithm was carried out in the MATLAB 7.0.1 package of application programs in order to estimate the precision of measurements of the pitch frequency. The technique of the experiment presupposes implementation of three stages: generation of input multi-harmonic signal; measurement of pitch frequency; and comparison of the result to the input signal. Such parameters of the input multi-harmonic signal as the frequency of the signal of the pitch ƒ and the rate v of variation of the pitch frequency along with the parameter of the mathematical apparatus of complementary ensemble empirical mode decomposition, i.e., the amplitude A of the added white noise, are varied in the investigation. The number of decomposition cycles j was fixed and equal to 100. The error coefficients GPE and MEPE (Gross Pitch Error and Mear Fine Pitch Error) [17] were used as the criteria of the estimate. The GPE coefficient is a dimensionless quantity, equal to the ratio of the number of fragments SGPE with deviation of the measured value of the pitch frequency greater than 20% from the true value of this frequency to the total number of fragments SPF containing the pitch: GPE = (SGRE /SPF)·100. The mean coefficient MFPE is also a dimensionless quantity, equal to the mean value of the ratio of the difference between the true value ƒPF t and the measured value ƒPF of the pitch frequency to the true value of fragments containing the pitch without errors GPE:
where SFPE is the number of fragments containing the pitch without errors GPE, and s is the ordinal number of a fragment containing the pitch. A test sample was generated for the investigation from 100 synthesized multi-harmonic signals in the form of a sum of several harmonic components of the pitch with previously known frequency. Based on the technique adopted for the investigations, on the first stage each test signal is subjected to frequency modulation in which the pitch frequency is varied in the 1320
TABLE 1. GPE and MFPE as Functions of the Amplitude of Added White Noise for Different Rates v v, Hz/msec
0
0.5
1.0
1.5
2.0
2.5
Amplitude of added white noise A, mV
Estimation criteria
0.10
0.25
0.50
1.00
2.00
3.00
GPE
0
0
0
0
0
0
MFPE
1.68
1.20
1.31
1.37
2.67
3.95
GPE
0
0
0
0
0
0
MFPE
4.61
4.50
4.61
5.27
5.86
6.21
GPE
0
0
0
0
0
0
MFPE
5.37
5.23
5.31
5.87
6.31
6.78
GPE
0
0
0
0
0
0
MFPE
6.41
6.23
6.39
6.92
7.24
7.85
GPE
0
0
0
0
0
5.4
MFPE
8.94
8.75
8.83
9.21
9.84
10.26
GPE
0
0
0
0
7.90
8.60
MFPE
10.35
10.14
10.24
10.90
11.25
11.60
TABLE 2. Comparative Analysis of Results of Measurement of the Pitch frequency Algorithms for measurement of pitch frequency v, Hz/msec
0
0.5
1.0
1.5
2.0
2.5
Estimation criteria YIN
RART
SWIPE
New algorithm
GPE
0
0
0
0
MFPE
3.10
2.56
2.35
1.10
GPE
0
0
0
0
MFPE
6.56
6.65
6.34
3.21
GPE
0
0
0
0
MFPE
7.30
7.25
6.81
6.54
GPE
0
0
0
0
MFPE
8.21
8.50
7.90
8.00
GPE
2.05
0
0
0
MFPE
10.20
10.10
9.67
9.54
GPE
7.60
5.30
6.20
6.50
MFPE
14.20
13.10
13.50
13.20
range 0–2.5 Hz/msec with step 0.5 Hz/msec. On the second stage of the technique, the amplitude of the added white noise is varied in the mathematical apparatus of complementary ensemble empirical mode decomposition in order to determine the optimal value at which the most precise results of a measurement of the pitch frequency are attained. Results of investigations of the new algorithm, i.e., the error coefficient GPE and the mean of the error coefficient MFPE as functions of the amplitude of the added white noise for different rates v of variation of the pitch frequency, are presented in Table 1. 1321
From Table 1, it follows that the most precise result of measurements of the pitch frequency (shaded in gray) is achieved with an amplitude A = 0.25 mV. A study of the new algorithm using actual speech signals was carried out on the basis of optimal parameters of the mathematical apparatus of complementary ensemble empirical mode decomposition (A = 0.25 mV, j = 100). Initial data: 100 speech signals (five tonal sounds “a,” “o,” “u,” “e,” and “i” each in 20 records) pronounced by 20 speakers. The results were estimated and compared with the previously considered algorithms for the measurement of the pitch frequency (YIN, RAPT, and SWIPE; Table 2). The results of the investigations (shaded in gray) show that with low frequency modulation the new algorithm significantly surpasses the existing algorithms used in the measurement of the pitch frequency. However, in the case of high rates of variation of the frequency, all the algorithms exhibit roughly comparable values of GPE and MFPE. This indicates that the new algorithm may be recommended for use in problems involving the measurement of pitch frequency. Conclusion. The problem of increasing the precision of measurements of the pitch frequency of speech signals has been considered in the article. Algorithms used to measure pitch frequency (YIN, RAPT, and SWIPE) which have become quite common were considered and results gained with the use of these algorithms were presented. The need to increase the measurement precision through the use of adaptive processing methods was established. An algorithm for determining the pitch frequency on the basis of an adaptive method of processing nonstationary signals, or complementary ensemble empirical mode decomposition, was developed. The basic advantage of the method is its robustness in the presence of frequency modulation of the pitch. A block diagram of the algorithm and a detailed mathematical description of the basic stages of operation of the algorithm were presented. An investigation of the algorithm was carried out in order to estimate the precision of measurements of the pitch frequency. Based on the results obtained, it is concluded that the algorithm surpasses existing analogs in the case of low frequency modulation.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11.
1322
V. G. Mikhailov and L.V. Zlatousova, Measurement of the Parameters of Speech, Radio i Syyaz, Moscow (1987). A. Camacho and J. G. Harris, “A sawtooth waveform inspired pitch estimator for speech and music,” J. Acoust. Soc. Amer., 123, No. 4, No. 9, 1638–1652 (2008). E. G. Zhilyakov and A. A. Firsova, “Estimation of the period of the pitch of the sounds of human speech,” Nauch. Vedom. Belgor. Gos. Univ., Ser. Istor. Politol. Ekon. Informat., No. 1 (144), Iss. 25/1, 173–181 (2013). E. Azarov, M. Vashkevich, and A. Petrovsky, “Instantaneous pitch estimation based on RAPT framework,” in: Proc. 20th Europ. Signal Proc. Conf., EUSIPCO (2012), pp. 2787–2791. T. Abe, T. Kobayashi, and S. Imai, “Robust pitch estimation with harmonics enhancement in noisy environment based on instantaneous frequency,” in: Proc. ICSLP96 (1996), Vol. 2, pp. 1277–1290. M. K. Hasan, C. Shahnaz, and S. A. Fattah, “Determination of pitch of noisy speech using dominant harmonic frequency,” in: Proc. IEEE Int. Symp. Circuits and Systems (2003), Vol. 2, pp. 556–559. A. Chevigne and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Amer., 111, No. 4, 1917–1930 (2002). J. R. Yeh, J. S. Shieh, and N. E. Huang, “Complementary ensemble empirical mode decomposition: A novel noise enhanced data analysis method,” Adv. Adapt. Data Analysis, 2, No. 2, 135–156 (2010). A. K. Alimuradov and F. Sh. Murtazov, “Methods of increasing the efficiency of recognition of speech signals in voice control systems,” Izmer. Tekhn., No. 10, 20–24 (2015). A. K. Alimuradov and P. P. Churakov, “Noise-robust speech signals processing for the voice control system based on the complementary ensemble empirical mode decomposition,” in: Int. Sib. Conf. Control and Communications, SIBCON 2015, Omsk, Russia, May 21–23, 2015. N. E. Huang, S. Zheng, and R. L. Steven, “The empirical mode decomposition and the Hilbert spectrum for nonlinear and nonstationary time series analysis,” Proc. Roy. Soc. London. A, 454, 903–995 (1998).
12. 13. 14.
15. 16. 17.
W. Zhaohua and N. E. Huang, “Ensemble empirical mode decomposition: A noise-assisted data analysis method,” Adv. Adapt. Data Analysis, 1, No. 1, 1–41 (2009). Sh. Bhawna and K. Sukhvinder, “Distnction between EMD & EEMD algorithm for pitch detection in speech processing,” Int. J. Eng. Trends and Technol., 7, No. 3, 119–125 (2014). G. Schlotthauer, M. E. Torres, and H. L. Rufiner, “A new algorithm for instantaneous f0 speech extraction based on ensemble empirical mode decomposition,” in: 17th Europ. Signal Proc. Conf., EUSIPCO 2009, Glasgow, Scotland (2009), pp. 2347–2351. G. Priyanka and P. Mahendra Kumar, “Determine the pitch markers in speech signal using ensemble empirical mode decomposition,” Int. J. Adv. Res. Comp. Sci. and Software Eng., 2, No. 7, 90–96 (2012). A. K. Alimuradov, “Investigation of frequency-selective properties of methods of decomposition into empirical modes for the purpose of estimating the pitch frequency of speech signals,” Tr. MFTI, 7, No. 3, 56–68 (2015). X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing. Guide to Algorithms and System Development, Prentice-Hall, Upper Saddle River (2001).
1323