Vol.3
No. 1 Mar. 1999
JOURNAL OF SHANGHAI UNIVERSITY
Application of Cochlear Model in Speech Analysis/Synthesis Using Sinusoidal Representation Yuan Jingxian
Wan Wanggen
Y u Xiaoqing
(School of Communication & Information Engineering, Shanghai University) A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developedwith the model. The computer simulation shows that speech can be synthesized with only 12 parameters per frame on the average. The method has the advantages of few parameters, low complexity and high performance of speech representation. The synthetic speech has high intelligibility. Abstract
Key words
1
speechanalysis/synthesis, sinusoidal representation, cochlear model, auditory spectrum
Introduction
With the development of information technology, speech transmission and speech storage is becoming increasingly important. According to the present development of digital speech processing, development of speech coding methods with high coding efficiency and low distortion is one of the major research topics. One approach for speech coding is to use a speech production model. In order to simplify the processing and implementation, the model should be simple, and the extracted parameters should represent the most basic characteristics of speech. Traditionally, speech can be viewed as the result of passing a glottal excitation waveform through a time-varying linear system. In many speech applications, it is sufficient to assume that the glottal excitation only takes one of two possible states, i. e . , periodic pulse and white noise, corresponding to voiced or unvoiced speech respectively. Many approaches exist such as multipulse excited linear prediction ( M P L P ) [ I I , code excited linear prediction(CELP) [21, etc. In stead of using traditional methods, a sinusoidal representation of speech is used to analyse speech in this paper, so the procedure of speech analysis/synthesis is simplified and the complexity of algorithm is decreased.
In order to further increase the efficiency of coding and to decrease redundancy the cochlear model is used together with the sinusoidal representation. The purpose of processing speech with cochlear model is to take advantage of some characteristics of the auditory system such as the absence of limitation for the auditory object, high sensitivity of auditory system, and very low bitrate information stream from auditory system to the brain. As this paper is mainly focused on low bit-rate speech coding or speech analysis/synthesis, much attention is paid to the last characteristic of auditory system, and a new idea of auditory spectrum is proposed thereafter. Such auditory spectrum is regarded as the main auditory information transferred from auditory system to the brain. In this paper, a speech analysis/synthesis system based on a sinusoidal representation and an auditory model is developed, and satisfactory results are presented. Because of the low complexity of the speech analysis/synthesis algorithm, the real time implementation of the system can be achieved using single chip DSP.
2
Speech Analysis/Synthesis Model Based on Sinusoidal Representation
The goal of speech analysis/synthesis is to extract parameters which represent the speech most effectively, Received May 5, 1998 Project supported by the National Natural Science Foundation of China( 69501007) Yuan Jingxian, Postgraduate student, School of Communication and Information Engineering, Shanghai University, 149 Yanchang Road, Shanghai 200072
and to rebuild a waveform as close to the original speech as possible. According to M c A u l a y ' s theory [3'47, a frame of speech can be expressed by a sine series with finite terms. We can detect the peaks in the frequency spectrum for any frame of speech, extract the frequen-
48
Journal of Shanghai University
cies, amplitudes and phases at all peaks, and use these as the transmitting parameters in the channel. At the synthesis end we need to do some overlapping and adding to form the synthetic speech. The frequency of speech is mainly distributed in the region of 200 Hz-3400 Hz. Therefore a low-pass filter with a cut-off frequency at 4 kHz is used at the input of analog speech. Digital speech is obtained by sampling the input analog speech at 8 kHz. The original digital speech must be pre-emphasised. The purpose is to extract more speech information in high-frequency region so that the synthetic speech is more natural. The acquired speech is then analysized frame by frame, with frame length being 30 ms, and overlapping interval being 10 ms. The Hamming window is applied to each frame. The frequency spectrum of each frame is obtained by 256 points FFT. Finally, the amplitude, frequency and phase of each spectrum peak is extracted. At the synthesis end, a sine wave component is obtained by using the frequency and phase as the input of the sine wave generator, and then multiplied by the corresponding amplitude. A frame of speech is synthesized by summing all sine wave components. Finally, adding the Hamming window to the frame and appending it to the previous frame, the synthetic speech is obtained. It is not necessary to make the voice/unvoice decision and perform pitch detection with the above speech analysis/synthesis approach. Such approach can deal with not only speech but also many other types of audio signal, such as music [5'6]. Its drawback is that it must transmit too many speech parameters, because there are too many peaks in the frequency spectrum for a frame of speech. It is not useful in the low bit-rate speech coding as all the frequency peaks are transmitted without selection.
3
Cochlear Model and Extraction of Auditory Spectrum
Based on the above discussion, it is necessary to extract the frequencies, amplitudes and phases of all spectrum peaks. According to auditory physiology, much redundancy exists in these parameters. That is to say, when removing some less important peaks, the synthesized speech still preserves quite good intelligibility. Based on auditory theory, .sound is mainly analysized by cochlea, whose characteristics can be represented by
the following transfer function ~7] : Hk(z)
= Aka0k(1- z - 2 ) / ( 1 + b l k z -1 + b 2 k z - 2 ) ,
(1) where A k = 2 T F k / ( akaok ) , aok = (1 - b i k ) / 2 , b~k = bk / a ~ , b2k = c k / a k ,
Fk = 2 p [ x k
-
L + ( 8 L / H 2)
~.[a~(2i- 1) 2]-~.
cos (2i - 1 ) • x k / 2 L ] , a k = kkT 2 + 2rkT + 4m k , b~ = 2 k k T 2 - 8 m ~ , ck = k k T 2 - 2 r k T + 4 r n , .
The corresponding second-order difference cochlear model is zk( rl ) + blkZk(n -- 1)
+
A k a o k [ u s ( n ) - u~( n
bakzk(n - 2 )
-2)],
=
(2)
where zk ( n ) is the displacement of basilar membrane (BM) in position x~, us ( n ) is motion velocity of stapes. According to [7 ], the characteristics of mechanical vibration of a location along BM is a band-pass function. With the excitation signal of unit amplitude, the vibration amplitude of BM at one location is changed with the frequency of excitation signal, the amplitude-frequency characteristic curves of 50 equidistant locations along BM are depicted in Fig. 1. When a signal with a single frequency component is input to the cochlea, it will excite a region on the BM. There is the biggest amplitude point in this region, called the resonant point. At each side of this point, the amplitude becomes smaller and smaller. Because a location on BM corresponds to a cochlear filter, the bigger region will include more cochlear filters. In other words, the amplitude of a single frequency signal can be determined with the number of cochlear filters. If the input signal has many frequency components, there will always be a frequency component which has the biggest output at the output of a filter. It is called the dominant frequency of this filter [8] . If this frequency component is the dominant frequency of many filters, then more such filters correspond to the bigger amplitude of this frequency component. Set a threshold M for the filters number. If the number of filters with the same domi-
Vol.3
No. 1 Mar.1999
Yuan J.X. : Applicationof Cochlear Model in Speech...
49
ing the same dominant frequency meet the threshold requirement, this dominant frequency will be chosen as a frequency component in the auditory spectrum. The method of the auditory spectrum extraction is depicted in Fig. 2.
nant frequency is not less than M, this dominant frequency is chosen as a frequency component of the auditory spectrum, whose amplitude in the auditory spectrum is just the number of the filters. Obviously, the different frequency component can be the dominant frequency of different filter. Only if the number of the filters hav- IO(dB)
-...~v
.
.:.-. ¢~.
.~
..--.l
, ,
,.
-.
-100 / j 0
1
,
....
,
-:<
2 Frequency(kHz)
,~
;.
-~y,,.
x.. , / N
,
'
t
~
',
3
4
Fig.1 The amplitude-frequencycharacteristics of second-order difference cochlear model
___•
Cochlear
filter 1
H
Extractionof dominant frequency
__q Cochlear H filter 2
1 Extraction
Extractionof frequency
of
auditory
dominant
spectrum
Auditory spectrum
Speech power
output
spectrum
H och'e H dominant ac"onof " 50 filter
frequency
F
Fig.2 Extractionof auditory In analysizing speech, we only need to extract the frequencies, amplitudes, and phases at all dominant frequency points. Synthetic speech close to the original can be reconstructed. While preserving the main advantages of the conventional sinusoidal representation, the described method decreases the redundancy in the speech, by omitting unimportant spectral peaks, therefore is useful in achieving low-bit-rate speech coding.
4
Speech Analysis/Synthesis Approach Based on the Sinusoidai Representation and Cochlear Model
In this approach, the power spectrum of the speech is input is to the cochlear filters. The reason of using the
power spectrum as input to increase the gap between small and big amplitudes in the spectrum and simplify the peak-tracking algorithm. At the output of each filter, the peak is detected and dominant frequencies are extracted. All the dominant frequencies are then put together, and the number of filters with the same dominant frequency is found. If the number of filters having the same dominant frequency is not smaller than the threshold value M, it is chosen as a component in the auditory spectrum. In the present work, the threshold is set to be M = 3, the total number of filters is 50. The speech energy is normally concentrated in low frequency and mid-frequency regions, which determine the basic content of speech. The high frequency components mainly affect the clearness of synthetic speech. If
50
Journal of Shanghai University
the high frequency components are attenuated excessively, the clearness of synthetic speech will degraded, making the listener feel depressed. Experiments show that, although most speech energy is concentrated in the low-frequency region, it does not contribute to the clearness of speech very much [9] . If we get rid of speech components below 1000 Hz using high-pass filter, speech energy will lose about 80 % , but the clearness of speech will only lose about I0 %. In order to extract enough high frequency components to improve the clearness of synthetic speech, we not only extract those frequencies whose amplitudes are the biggest at the output of the cochlear filters as the primary dominant frequencies, but also successively extract those frequencies whose amplitudes are the second and the third biggest at the output of the cochlear filters as the second and third dominant frequencies of those filters. All of these dominant frequencies will be used to determine the auditory spectrum. The auditory spectrum of a voiced speech frame, extracted using the above method, is depicted in Fig. 3.
densely certain frequency regions. According to the auditory physiology, there exist frequency difference limens [9] in human perception. These frequency difference limens are related to specific frequency regions as well as to the sound strength. Frequency difference limens are easy to be detected at the low-frequency regions. Experiments have revealed that, if the interval between two spectral lines is smaller than the frequency difference limen, it is unnecessary to transmit both spectral lines. In practice, the frequency resolution is limited in extracting spectial peaks. Therefore only the basic components need to be transmitted. This is depicted in Fig. 4, in which the interval between two adjacent spectral lines is limited. The location of the auditory spectrum lines are compared with the envelop of the spectrum.
z Frequency(kHz) Fig.4
2 Frequency(kHz) (a)
3
4
240) t~
At the synthesis end, all frequency components are first added together to produce a frame of speech. It is then weighted by a Hamming window, and appended to the previously formed frame with a 10ms overlapping. This process is carried on at a speed of 50 frames/s to generate the synthesized speech.
5 0
i
~ Frequency(kHz) (b)
3
Fig.3 The auditory spectrum of a voiced speech frame (a) Power spectrum of a voiced speech frame, (b) The auditory spectrum corresponding to (a) in which the multi-dominant-frequenciesare extracted It is observed from the spectrum that there are 13 spectral lines , some of which are closely lined up in
The auditory spectrum of a voiced speech frame corresponding to Fig. 3
Experiment Results
Combining the sinusoidal representation with auditory spectrum, we have built a speech analysis/synthesis system depicted in Fig. 5. Using the above speech analysis/synthesis system, the power spectrum of a frame of synthetic voiced speech corresponding to Fig. 3 (a) is obtained and depicted in Fig. 6. A section of original speech depicted in Fig. 7 ( a ) is analysed using the system, and a section of synthetic speech depicted in Fig. 7(b) is obtained with high intelligibility, enough clearness and naturalness.
Vol.3
No. 1 Mar. 1999
Yuan J.X. :
S(n)
H
Extracting speech parameters
Amplitude ~ spectrum
Hamming window
Power spectrum
H
Phases b Amplitudes
t TI
Cochlear filter
auditory spectrum
:I
Frequencies
51
(a)
Origi~at
speechSo(n)
"
Application of Cochlear Model in Speech...
.0i
Sinewavegenerator
Overlapping ]s._~echoutput & adding
De-emphasis Hamming window
Amplitudes
Fig. 5 Speech analysis/synthesis system (a) Speech analysis system based on the cochlear model, (b) Speech synthesis system based on the auditory spectrum
6
Conclusion
This paper has proposed a speech analysis/synthesis method based on sinusoidal representation and cochlear model. According to the experiment result, the syn-
~-
thetic speech can be fully understood, and has certain clearness and naturalness. The quality of synthetic speech is close to toll quality ( M O S 4 . 0 ) . Analysing
z 1
2
3
4
Frequency(kHz)
speech using the proposed method, it is not necessary to extract frequencies , amplitudes and phases at all
Fig.6
Power spectrum of a frame of synthetic voiced speech corresponding to Fig. 3(a)
(a)
(b) Fig.7 A section of speech (a) A section of original speech, (b) The synthetic speech corresponding to (a)
52
Journal of Shanghai University
spectral peaks. Loss of high frequency component and over e x t r a c t i o n of low frequency component are avoided when
the
auditory spectrum
is extracted
using the
4
cochlear model.
International Conference on Acoustics, Processing, 1987:649 652
U s i n g t h e proposed m e t h o d , voiced/unvoiced decision and pitch detection are avoided in the speech analysis.
5
C o m p u t e r simulations show t h a t , the average n u m b e r of a u d i t o r y s p e c t r u m lines for each frame is no more than 4. Considering the extraction of frequencies, a m p l i t u d e s
6
and phases for each line, speech can be well synthesized w i t h no m o r e than 12 parameters.
References 1
2
3
Ozawa K . , et a l . , A study on pulse search algorithms for multipulse excited speech coder realisation, IEEE JSAC, 1986, SAC-4(1): 133- 141 ITU-T, COM: Draft Recommendation G. 729-Coding of Speech at 8 kbit/s Using Conjugate-Structure Algebraic-CodeExcited-Linear-Prediction(CS-CELP), 1995 : 15 - 152 McAulay R.J. and Ouatieri T. F. , Speech analysis/synthesis
based a sinusoidal representation, IEEE trans. A S S P , 1986, 34 : 744 - 754 Quatiteri T . F . and McAulay R. J. , Mixed-phase deeonvolution of speech based on sine-wave model, Proceedings of the
7
8
9
Speech and Signal
George E. G . , An Analysis-by-synthesis Approach to Sinusoidal Modelling Applied to Speech and Music Signal Processing, Ph.D. Thesis, Georgia Institute of Technology, 1991 George E.B. and Smith M. J. T . , An analysis-by-synthesis approach to sinusoidal modelling applied to the analysis and synthesis of musical tones, Journal of the Audio Engineering Society, 1992, 40 : 497 - 516 Wan W.G. and Yu X. Q. , A second order difference cochlear model, Acta Electronica Sinica, 1995, 23(7) : 6 - 10(in Chinese) Wan W. G. and Yu X. Q . , A speech analysis/synthesis method based on a secondorder difference cochlear model, Acta Electronica Sinica, 1998 (in Chinese) Yang X.J. and Chi H. S. , et al. , The Digital Processing of Speech Signal, Electronic Engineering Press, Beijing, China, 1995:34 - 40 (in Chinese)