Annals of Biomedical Engineering, Vol. 29, pp. 587–594, 2001 Printed in the USA. All rights reserved.
0090-6964/2001/29共7兲/587/8/$15.00 Copyright © 2001 Biomedical Engineering Society
A Noninvasive Estimation of Hypernasality Using a Linear Predictive Model DONG KYUN RAH,1 YOUNG IL KO,2 CHULHEE LEE,3 and DEOK WON KIM4 1
Department of Plastic Surgery, College of Medicine, 2Graduate Program in Biomedical Engineering, 3Department of Electronics Engineering, College of Engineering, and 4Department of Medical Engineering, College of Medicine, Yonsei University, CPO Box 8044, Seoul 120-752, Korea (Received 14 March 2000; accepted 8 April 2001)
Abstract—The pronunciation of a speaker with a defective soft palate is marked by hypernasality and an operation may be necessary to repair the defective soft palate to reduce this hypernasality. An assessment of hypernasality is necessary to quantify the effect of the surgery. The current clinical methods for assessing hypernasality are uncomfortable or require expensive equipment. In this paper, a new quantitative method is proposed to estimate hypernasality. This method requires only a microphone and a personal computer equipped with a sound card. Zeros in the frequency response of the vocal tract system are one of the major characteristics of hypernasality. The proposed method made use of the fact that a linear predictive model with a typical order for the human vocal tract system is not accurate when the vocal tract system has zeros in its frequency response. Hypernasality was estimated by comparing the distance between the sequences of linear predictive cepstrum of low- and high-order linear predictive models. The proposed method provides a better correlation 共0.58兲 with nasalance measured by a nasometer than Teager method 共0.44兲 for all the data. Furthermore, the proposed method showed higher correlation of 0.84 than 0.71 of the Teager method for data with a nasalance higher than 35%. Since the proposed method needs only digitized speech data, it is much less invasive and provides an easy and cost-effective evaluation of hypernasality. © 2001 Biomedical Engineering Society. 关DOI: 10.1114/1.1380422兴
priately strong resonance in nasal cavity, while hyponasal speech is associated with abnormally weak resonance in nasal cavity. Hypernasality is the characteristic nasal sound due to a defective soft palate and the term ‘‘nasality’’ as used in this paper means hypernasality. Nasality assessment methods can be classified in three categories. The first category is a method to directly observe the velopharyngeal port by x ray or endoscope, and the second one involves the assessment of nasality by measuring sound pressures. The third category utilizes a speech signal processing technique, which has not been used clinically. This relatively new technique to assess hypernasality has been proposed recently by some researchers.1 Three methods are briefly described as below. Method of Viewing the Velopharyngeal Port In this method, static lateral radiographs are used to view the velopharyngeal structures during sustained sounds.1,9 Multiview videofluoroscopy allows doctors to observe the relevant structures during connected speech from several spatial planes.1,16 Flexible fiber-optic nasendoscopy, an invasive technique, allows direct observation of velopharyngeal movements during connected speech.9 These methods have the advantage that doctors can directly observe the anatomy of the soft palate. However, the correlation of data obtained using these methods with listener’s judgment of hypernasality was often poor in the case of hypernasality.1,9,16 Moreover, this method also has the disadvantages that quantitative results cannot be acquired and that it is invasive and is carried out using expensive equipment.
Keywords—Soft palate, Nasalence, Zero, Cepstrum.
INTRODUCTION The pronunciation of a speaker with a defective soft palate is marked by hypernasality because the oral cavity is not properly separated from the nasal cavity.1 A speaker with pronounced hypernasality has a problem communicating with others. A palatoplasty may be performed to rectify the problem. If nasality could be assessed quantitatively, doctors could accurately evaluate the efficacy of an operation. Nasality is classified as hyponasality and hypernasality.5,8 Hypernasal speech is a speech with inappro-
Analysis of Pronounced Speech Sound Since hypernasality arises from an inappropriate nasal–oral cavity coupling, researchers have attempted to use a nasal–oral ratio to assess hypernasality. Horii proposed a measure of nasal coupling called the Horii Oral Nasal Coupling 共HONC兲 index.11 This index was derived from signals measured by an accelerometer attached to
Address correspondence to Deok Won Kim, Yonsei University, Seoul 120-752, Korea. Electronic mail:
[email protected]
587
588
RAH et al.
the outside of the nares, and by a microphone positioned in front of the mouth. The correlation of the HONC index with listener judgment of hypernasality was very high 共Pearson’s product moment correlation coefficient of 0.92兲.1,10 The nasometer measures a nasal–oral ratio to assess hypernasality. It uses two separate microphones: one of which is positioned in front of the mouth and the other in front of the nostrils. Mouth and nostrils are separated by a plastic plate. The microphones record both the oral and nasal sound pressures, and the nasalance, an index of the nasometer, is calculated using Eq. 共1兲: Nasalance共 % 兲 ⫽
N ⫻100, N⫹O
共1兲
where N and O represent the nasal and oral sound pressures assessed at the nostrils and mouth, respectively. A nasalance value of 32% or less is considered as normal.1,2 A statistically significant relationship has been shown between the clinical judgment of hypernasality and nasalance scores for actual hypernasal speech 共Pearson’s correlation coefficient: 0.78兲.1 The nasometer is widely used in the clinical field, but it is an expensive device and with its head set, is uncomfortable to use. Speech Signal Processing Technique Cairns et al. proposed a method to detect hypernasality using a nonlinear Teager energy operator Eq. 共2兲:1 ⌿ 关 x 共 n 兲兴 ⫽x 2 共 n 兲 ⫺x 共 n⫺1 兲 x 共 n⫹1 兲 .
共2兲
It was assumed that normal speech was composed of several resonances 共formants兲 at various frequencies, while hypernasal speech was composed not only of resonance of the oral cavity, but also of nasal cavity and antiformants 共zeros兲. If both normal and nasal speech signals are filtered by a bandpass filter 共BPF兲 with the center frequency of the first formant, the signals acquired are composed of only the first formant. If normal speech is filtered by a lowpass filter 共LPF兲 with a cutoff frequency between the first and the second formants, the acquired signal is composed of only the first formant component. In the case of hypernasal speech, however, the acquired signal is composed of the nasal formants and antiformants in addition to the first formants. Therefore, when the Teager operator is applied to the hypernasal speech signal filtered by an LPF, the cross term of the antiformants and nasal formants would be introduced due to the nonlinear property of the Teager operator.1 However, because hypernasal speech filtered by a BPF consists of only the first formant, the result of the Teager operator upon that signal may be different
FIGURE 1. A simplified model of the human vocal tract.
from the result upon hypernasal speech filtered by an LPF.1 Thus, the cross-correlation coefficient between these results of the Teager operator would be less than 1. Since a normal speech signal filtered by LPF may be composed of only the first formant, the cross-correlation of normal speech would ideally be 1. From these relationships, it is apparent that hypernasality can be detected using the Teager method by calculating the crosscorrelation between results of the Teager operation upon a signal filtered by LPF and BPF.1 LINEAR PREDICTIVE MODEL Figure 1 represents a simplified model of the human vocal tract. Hypernasality depends upon how much the velum is opened during pronunciation. The pronunciation of a speaker with a defective soft palate is hypernasal because the velum cannot firmly separate the nasal and oral cavities. For modeling the human vocal tract, it was assumed that it was composed of several lossless acoustic tubes with different cross-sectional areas.14 During the pronunciation of a vowel, the excitation signal from the vocal cord was assumed to be an ideal impulse train. Under this assumption, the transfer function of the vocal tract that is from the glottis to the lips becomes an all-pole system.14 Thus, a linear predictive 共LP兲 model using the all pole system has been widely used in speech signal analysis. Linear predictive coefficients 共LP coefficients兲 a k used in an LP model of order M are defined by Eq. 共3兲: M
s共 n 兲⫽
兺
k⫽1
a k s 共 n⫺k 兲 ⫹Gu 共 n 兲 ,
共3兲
where s(n) is a speech signal and u(n) is a normalized excitation signal produced by the glottis, which has a maximum value of 1. G is the gain of the excitation signal and also the root mean squared value of the residual error between the predicted and original speech signals. Equation 共4兲 is a Z-transformed version of Eq. 共3兲:12
An Estimation of Hypernasality Using a LP Model
共 z 兲⫽
S共 z 兲 G G ⫽ ⫽ . 共4兲 U 共 z 兲 1⫺a 1 z ⫺1 . . . ⫺a M z ⫺M A 共 z 兲
The frequency response of the vocal tract (e j ) can be obtained by replacing z with e j in Eq. 共4兲. The characteristics of hypernasality include: 共1兲 amplitude reduction of the first formant, 共2兲 presence of zeros in the spectrum due to the coupling of nasal cavity to oral cavity, 共3兲 presence of reinforced harmonics 共nasal formant兲 resulting from the sound resonance in the nasal cavity, and 共4兲 a shift of formants.1,4,15 In order to utilize the hypernasal characteristics 共1兲 and 共4兲, it is necessary to find accurate formants, which is not an easy task in practice. Therefore, the characteristics 共2兲 and 共3兲 were utilized to estimate hypernasality in this study.
METHOD If we want to model only the vocal-tract portion of the normal speech system, the model of order 8 –10 is typically used.3 As mentioned previously, a hypernasal speech signal will have zeros due to the coupling of nasal cavity to oral cavity and extra poles due to the resonance of the nasal cavity in its spectrum.1,4,15 Thus, a hypernasal speech signal cannot be modeled accurately by an LP model since the LP model does not include zeros. However, it can be shown that a zero in the spectrum can be modeled by means of an infinite number of poles as Eq. 共5兲:3
1
1⫺z 0 z ⫺1 ⫽
.
⬁
1⫹
兺
k⫽1
589
Preprocessing The Hamming window was applied to speech data sampled at the rate of 12.5 kHz to obtain a frame of 30 ms. Typically, a speech signal of 30 ms interval can be assumed to be stationary.12 For preprocessing, a preemphasis filter was applied to the speech signal. In the proposed method, the following optimal pre-emphasis filter P(z) proposed by Gray and Markel6 was used: P 共 z 兲 ⫽1⫺ z ⫺1 ,
⫽
r s 共 1;m 兲 r s 共 0;m 兲
,
共6兲
where r s ( ,m) is a short-term autocorrelation with lag of for the mth frame. The reasons for employing a pre-emphasis filter were twofold. First, it eliminates the scattering effect that is introduced when the speech signal is transmitted from the lips through the air. Second, it also removes the spectral component of the larynx from the speech signal.6 Selection of Orders of the LP Model Since it is known that four formants are effective within the frequency range 0–5000 Hz for vowels,4,13 the order of the LP model would typically be 8 –10 for the model of only the vocal tract portion of the normal speech system.3 Therefore, the order of 10 was selected as the low order of an LP model for normal speech. In the linear prediction method, LP coefficients are calculated by minimizing the mean squared error 共mse兲 between the speech signal s(n) and the estimated speech signal sˆ (ˆ n). The estimated speech signal sˆ (n) and the estimation error e(n) are defined as follows:3
共5兲
z k0 z ⫺k
Thus, if an LP model of a higher order than typical orders of 8 –10 is used to model the hypernasal speech signal, we can expect that a more accurate representation of the hypernasal speech signal will be achieved. Furthermore, since it is known that nasal resonance introduces extra poles due to the resonance of the nasal cavity in the spectrum,1,4 there needs to be a LP model with a higher order than the typical order of 8 –10 to find the LP model for the speech signal with the nasal resonance. Based on these facts, we propose an algorithm to estimate the hypernasality of speech using the LP coefficients of high- and low-order LP models. In other words, in the case of the hypernasal speech signal, there will be a large difference between the spectrums obtained using the LP coefficients of high- and low-order LP models.
M
sˆ 共 n 兲 ⫽
兺 a 共 i 兲 s 共 n⫺i 兲 ,
i⫽1
e 共 n 兲 ⫽s 共 n 兲 ⫺sˆ 共 n 兲 ,
共7兲
where M is an order of LP model and a(i) is a LP coefficient. The minimum mse can be obtained if and only if the error is orthogonal to the signal.3 Comparing Eq. 共7兲 with Eq. 共3兲, e(n) can be interpreted as the excitation signal of human glottis.14 For a vowel, e(n) is a quasiperiodic signal with the period of the pitch of speech signal. If the order of that LP model M is larger than P, which is the nearest integer produced by dividing the pitch period of speech by a sampling period, then s(n⫺ P) includes e(n⫺ P) according to Eq. 共3兲, and e(n⫺ P) would be approximately the same as e(n) due to the quasiperiodic property of e(n). In other words, the orthogonality condition would be violated. Therefore, the order of LP model M needs to be smaller than the pitch period P of s(n). Otherwise, the spectrum of the LP model may include the spectrum of the excitation signal.
590
RAH et al.
For speech signal with the sampling frequency of 12.5 kHz, we found that the spectrum of the LP model with the order of 34 –38 did not contain the excitation signal components. It was also observed that the spectrum of the LP model for high pitch speech larger than the order of 40 included the excitation signal components that are periodic harmonics in spectrum. Calculation of Distance Between LP Cepstral Sequences In order to calculate LP coefficients, we used one of the most widely used methods, the autocorrelation method, since there already exists a fast algorithm, the Levinson–Durbin recursion algorithm.3 As the proposed algorithm used the difference between the spectrums obtained using LP coefficients of high- and low-order LP models to estimate hypernasality, we needed a distance measure between the spectra. While there are several methods available to compute the difference or distance between spectra, the LP cepstrum is one of the most widely used distance measures in speech signal processing.3,10 The real LP cepstrum c(n) is defined by Eq. 共8兲:3
c共 n 兲⫽
冦
logG,
冉
n⫽0
冊
n⫺1
1 k c 共 k 兲 a n⫺k , a n⫹ 2 k⫽1 n 0,
兺
n⬎0 ,
共8兲
n⬍0
where G is a gain of the excitation signal in Eqs. 共3兲 and 共4兲, and a k is an LP coefficient of order M, which was assumed to be zero for k⬍0 or k⬎M . Let c L (n) and c H (n) be LP cepstral sequences of low- and high-order LP models, respectively. Then, the geometric distance between the LP cepstral sequences is calculated using Eq. 共9兲:3,12
兺 关 c H共 n 兲 ⫺c L共 n 兲兴 2 ⫽ 2 冕⫺ 关 log兩 H共 兲 n⫽0 ⬁
1
⫺ L共 兲兩 兴 2d ,
共9兲
where ( ) can be obtained from (z) in Eq. 共4兲 with z replaced by e j . As can be seen in Eq. 共9兲, the geometric distance between the two LP cepstral sequences represents the difference of the two spectra of the LP models. Thus, we used the geometric distance between the two LP cepstral sequences as a measure for checking the similarity of the two spectra by calculating the left side of Eq. 共9兲. In order to compute the distance using Eq. 共9兲, we need to sum an infinite number of terms. However, it has been reported that sufficient accuracy can be obtained if
FIGURE 2. Comparison between the conventional real cepstrum and the real cepstrum of LP model. „a… Real cepstrum of human speech signal; „b… real cepstrum of LP model with the order of 10; „c… real cepstrum of LP model with the order of 38.
the upper limit of the summation is set to three times the order of the LP model.7 In this paper, the upper limit of the summation was set at five times the order of the LP model to ensure accuracy. Figure 2共a兲 shows the real cepstrum of speech signal calculated by the conventional method.3,14 Figures 2共b兲 and 2共c兲 show two LP cepstral sequences calculated by Eq. 共8兲 for the same speech signal. The order of LP models of Figs. 2共b兲 and 2共c兲 is 10 and 38, respectively. In Fig. 2共a兲, a peak can be noticed near the 50th sample. This peak is thought to be caused by the excitation signal from glottis. On the other hand, no peak and a very weak peak can be seen near the 50th sample in Figs. 2共b兲 and 2共c兲, respectively. The above results suggest that the effect of the excitation signal can be minimized by using the LP cepstrum in estimating the condition of nasal and oral cavity. In this study, the speech signal was assumed as the pronunciation of a vowel, which lasted longer than 1 s. Typically, a speech signal of 1 s consisted of 65 frames when the length of a frame was 30 ms, and the frame was shifted by 15 ms with 50% overlap. Each frame with a length of 30 ms can be assumed to be stationary.12 In the experiments, we took five continuous frames in the middle of the recorded speech signal and computed the distances for each frame using Eq. 共9兲, and the calculated five distances were averaged to obtain the spectral difference. Figure 3 shows a flow chart for the preprocess and calculation of the distance for speech signals. RESULTS Simulated Speech Signal In order to evaluate the proposed algorithm, we applied the algorithm to simulated speech signals of a
An Estimation of Hypernasality Using a LP Model
591
FIGURE 4. Distance depending on the value of k for the simulated speech signal: „a… without pole-zero cancellation; „b… with pole-zero cancellation.
FIGURE 3. Flow chart for calculating the distance for speech signal.
vowel /#/. The characteristics of /#/ are shown in Table 1.3,13,15 The spectral zero of hypernasal /#/ is known to be located at 2400 Hz.15 Therefore, the transfer function of the pronunciation /#/ can be written as Eq. 共10兲: H共 z 兲⫽
共 ke j 共 2 f z / f s 兲 z ⫺1 ⫺1 兲共 ke ⫺ j 共 2 f z / f s 兲 z ⫺1 ⫺1 兲
,
3
兿 共 a le
l⫽1
j 共 2 f l / f s 兲 ⫺1
z
⫺1 兲共 a l e
⫺ j 共 2 f l / f s 兲 ⫺1
z
⫺1 兲 共10兲
where f 1 , f 2 , and f 3 are 730, 1090, and 2600 Hz, respectively, from Table 1, and f z is 2400 Hz. a 1 , a 2 , and a 3 were found by trial and error so that each formant could have its bandwidth as shown in Table 1, respectively. The sampling rate of the simulated signal ‘‘ f s ’’ in Eq. 共10兲 was selected to be 10 kHz. Hypernasality of the simulated signal varied depending on the value of k, which ranged from 0 to 2. If the zero is located on the TABLE 1. Spectral characteristics of Õ#Õ. Formant
Frequency (Hz)
Bandwidth (Hz)
First Second Third
730 1090 2600
60 50 102
unit circle (k⫽1), the strongest hypernasality would be produced. As an excitation signal, a unit impulse train with a period of 200 Hz was used. From our speech data, it was observed that the pitch frequency was 119.6⫾23.5 Hz for adult male and 241.0⫾47.3 Hz for adult female and children. Since we did not consider adult male, adult female, and children separately, we determined that the 200 Hz would be a good choice of pitch frequency that could be applied to all three cases. For the simulated speech signals, LP models of the order of 10 and 38 were selected. The distance between the spectra of two LP models was then calculated using Eq. 共9兲. In Fig. 4共a兲, the distance increases as k in Eq. 共10兲 increases from 0 to 1, and decreases as k increases from 1 to 2 as expected. It should be noted that the distance increases slowly in the range of k⫽0 – 0.6, and increases rapidly in the range of k⫽0.6– 1.0. There is a special case in which a formant is canceled by a spectral zero of hypernasal sound. Due to the cancellation or reduction, the distance is considerably reduced. In this case, the discrimination of hypernasality would be poor and the performance of the proposed method would be degraded. Figure 4共b兲 shows such an example in which the zero of /#/ at 2400 Hz and the third formant at 2440 Hz almost coincide. The distance decreases rapidly at k⫽0.94. We selected 2400 Hz as a zero frequency because it had been reported that the zeros could be located at 2400 Hz.15 The location of the zero due to nasalance, however, varies depending on the
592
RAH et al.
FIGURE 5. Distance between two spectra for k Ä0.0, 0.6, 0.8, and 1.0 with low order fixed to 10 and high orders ranging from 10 to 38.
shapes of nasal and oral cavity and the degree of the nasalance.3 The performance of our method for estimating hypernasality can be predicted from Fig. 4. As the distance increases slowly for a small k, the discrimination may be poor for weak hypernasality. However, as the distance increases rapidly for a large k, the discrimination may be good for strong hypernasality. Since the hypernasal sound of nasalance at 32% or below cannot be clearly discriminated by listeners,4 the relatively poor performance of the proposed method in the case of weak hypernasality would not be a serious problem. Figure 5 shows the distances between the LP model of low order 10 and various high orders of LP models, ranging from 10 to 38 for k⫽0.0, 0.6, 0.8, and 1.0 with f 3 ⫽2600 Hz in Eq. 共10兲. The distances increase slowly for the small values of k⫽0.0 and 0.6, but increase rapidly for the larger value of k⫽1.0. As expected from Fig. 4, the discrimination is poor for values of k smaller than 0.6. For example, the distances between the LP model of low order 10 and the LP model of high order 36 for k⫽0, 0.6, 0.8, 1.0 were calculated to be 0.23, 0.24, 0.25, and 0.33, respectively. Figure 5 shows a steady increase over the order of 40. However, it has no special meaning because the spectrum of the LP model, whose order is higher than the number of samples of the pitch period, would include the components of the excitation signal, and the inclusion of the components of the excitation signal makes the difference large between the spectra of LP models with low and high order even if the speech signal has no nasalance. In this study, we limited the maximum order to 38 – 40 to avoid this phenomenon and to maximize the sensitivity of the proposed method. Figure 6共a兲 shows the distances between the LP model of low order 10 and various high order LP models rang-
FIGURE 6. „a… Distance between two spectra for k Ä0.0 and 1.0, f s Ä2600, f s Ä12 500, and pitch frequency of 300 Hz with low order fixed to 10 and high orders ranging from 10 to 60; „b… differences of the distances for k Ä0.0 and k Ä1.0.
ing from 10 to 60. Figure 6共b兲 shows the differences of the distances for k⫽0.0 and 1.0. As an excitation signal, a unit impulse train with a period of 300 Hz was used. It was assumed that human pitch did not exceed 300 Hz in normal speech. According to the criterion for the high order of LP model proposed in this paper, the maximum high order was 42. In Fig. 6共b兲, a large increase can be noticed at the order of 44 for each k. This can be interpreted that a large distance was produced by the presence of an excitation signal component in the LP spectrum when too high an order of LP model is used. Therefore, it seems reasonable to restrict the maximum order of LP model to 40. Human Speech Signals The human speech data used in this study were the recorded signals of velopharyngeal incompetence 共VPI兲 patients at the Institute of Logopedics and Phoniatrics, Yonsei University College of Medicine 共CSL model 4300B, KAY Elemetric Co.兲. Its sampling rate was 50 kHz with 8 bits of resolution. The pronounced sound of speech was /#/. The recorded speech data of 24 VPI patients with measured nasalance scores were used to evaluate the performance of our proposed method by comparing the computed distance with nasalance as measured by the nasometer. The distribution of ages and genders of the VPI patients are as shown in Table 2. For processing the speech signal, the amplitude of the signal was normalized to the maximum value of 1, filtered by LPF with a cutoff frequency of 5 kHz, and down-sampled by 4, resulting in a sampling rate of 12.5 kHz. The reason for this down-sampling is that 10 kHz is widely used for speech signal processing to reduce pro-
An Estimation of Hypernasality Using a LP Model
593
TABLE 2. Distribution of ages and genders of 24 VPI patients.
1–10 years old 10–30 years old Over 30 years old
Male
Female
5 6 1
6 4 2
cessing time. The maximum order of the LP model with the sampling rate of 12.5 kHz was limited to 40 because it was assumed that human pitch does not exceed 300 Hz in normal speech. The size of the frame was set to 375 samples, which is equivalent to 30 ms. Figures 7共a兲 and 7共b兲 show the spectra of the LP model of normal speech 共nasalance 2%兲 with the orders of 10 and 34, respectively, while Figs. 7共c兲 and 7共d兲 show those of the hypernasal speech 共nasalance 46%兲. It can be seen that the difference between the two spectra of the two LP models is larger for the hypernasal speech than for the normal speech. Figure 8 shows the correlations between the distances calculated by the proposed method and nasalances measured by the nasometer. The vertical axis of Fig. 8 represents the measured nasalance using the nasometer and the horizontal axis the distance calculated using our method for the pronunciation of /ö/. Our method was applied to the recorded speech signal with the order of 34, 36, 38, and 40 as the high order of LP model and with the low order fixed to 10. The Pearson’s correlation coefficients between the distance and the nasalance were 0.49, 0.58, 0.54, and 0.50, respectively. The reason for the lower correlation coefficients for the high orders of
FIGURE 8. Correlation coefficients between nasalance and distance for the high orders of 34, 36, 38, and 40 with the low order fixed to 10: „a… high order of 34; „b… high order of 36; „c… high order of 38; „d… high order of 40.
38 and 40 compared to that for order 36 may be due to the unwanted contribution of the excitation signals to the LP spectrum, which resulted from violating the orthogonality condition as the order was increased too high. In Fig. 8, the correlations between the distance and the nasalance are not so high for the entire data. Although some reverse correlation could be noticed for the nasalance between 20 and 30, the correlation appears high for nasalance above 35. For the data with the nasalance higher than 35, the proposed method, with the high order 36 and the low order 10, provided a much higher correlation 共0.84兲. Since the nasalance less than 32 may be considered as normal, the low correlation for the nasalance less than 32 may not be a critical problem for detecting the hypernasality of human speech signal. Figure 9 shows the correlation between nasalance and the correlation ratio using the Teager method for the data presented in Fig. 8. Pearson’s correlation coefficient was 0.44, which is lower than that of our proposed method, 0.58. The Teager method also showed a higher correlation of 0.71 for data with a nasalance higher than 35, but lower correlation than our method 共0.84兲.
CONCLUSION
FIGURE 7. The spectra of LP models for different orders and nasalances: „a… order of 10 and 2% nasalance; „b… order of 34 and 2% nasalance; „c… order of 10 and 49% nasalance; „d… order of 34 and 49% nasalance.
In this paper, we proposed an easy and cost-effective method to evaluate hypernasality. The method is much less intrusive than the current clinical methods that require expensive equipments. The proposed method provides a correlation coefficient of 0.58 with nasalance, which is higher than that of the Teager method, 0.44. Although the correlation obtained by the proposed method is better than the Teager method, there is still a
594
RAH et al.
REFERENCES 1
FIGURE 9. Correlation coefficients between nasalance and the correlation coefficient of the Teager method using the same data as Fig. 8.
room for improvement. A possible reason for the low correlation of the proposed method might be that formant magnitude could be reduced by zeros. More study is needed for various vowels to solve this problem. However, the correlation coefficient between the distance and nasalance was 0.84 for the data with nasalance higher than 35. Since a nasalance of 32 or less is considered normal, our method could be used to screening hypernasality and evaluating the effects of surgery for VPI and cleft palate patients when a nasometer is not available. Our proposed method does not require an expensive system such as the nasometer, but only a PC and microphone. Since our method needs only digitized speech data, it would be useful to evaluate hypernasality for recorded data not assessed by a nasometer.
Cairns, D. A., J. H. L. Hansen, and J. E. Riski. A noninvasive technique for detecting hypernasal speech using a nonlinear operator. IEEE Trans. Biomed. Eng. 43:35– 45, 1996. 2 Dalston, R. M., D. W. Warren, and E. T. Dalston. Use of nasometry as a diagnostic tool for identifying patients with velopharyngeal impairment. Cleft Palate J. 28:184 –188, 1991. 3 Deller, J. R., J. G. Proakis, and J. H. L. Hansen. Discretetime Processing of Speech Signals, 1st ed. New York: Macmillan, 1993, pp. 125–127, 266 –286, 374 –377. 4 Dickson, D. R.. An acoustic study of nasality. J. Speech Hear. Res. 5:103–111, 1962. 5 Gibb, A. G. Hypernasality共rhinolalia aperta兲 following tonsil and adenoid removal. J. Laryngol. Otol. 72:443, 1958. 6 Gray, A. H., and J. D. Markel. A spectral flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. ASSP 22:207–217, 1974. 7 Gray, A. H., and J. D. Markel. Distance measures for speech processing. IEEE Trans. ASSP 24:380–391, 1976. 8 Green, M. C. L. Speech of children before and after removal of adenoids. J. Speech Hear. Disord. 22:361, 1957. 9 Hirschberg, J. Velopharyngeal insufficiency. Folia Phoniatr. 38:221–276, 1986. 10 Horii, Y. An accelermetric measure as a physical correlate of perceived hypernasality in speech. J. Speech Hear. Res. 26:476 – 480, 1983. 11 Horii, Y., and J. Lang. Distributional analysis of an index of nasal coupling 共HONC兲 in simulated hypernasal speech. Cleft Palate J. 18:279–285, 1981. 12 Mammone, R. J., X. Zhang, and R. P. Ramachandran. Robust speaker recognition. IEEE Signal Process. Mag. 13:58 –71, 1996. 13 Peterson, G. E. Parameters of vowel quality. J. Speech Hear. Res. 4:10–29, 1961. 14 Rabiner, L. R., and R. W. Schafer, Digital Processing of Speech Signal, 1st ed., Englewood Cliffs, NJ: Prentice Hall, 1978, pp. 82–99. 15 Schwartz, M. F. The acoustics of normal and nasal vowel production. Cleft Palate J. 5:125–140, 1968. 16 Skolnick, M. L., and E. R. Cohn. Videofluoroscopic Studies of Speech in Patients with Cleft Palate, 1st ed. Berlin: Springer, 1989.