Determination of a vocal source by the spectral ratio method

The inverse problem with respect to functions proportional to a voice source and volume velocity of the air flow through the glottis is solved as foll...

3 downloads 52 Views 990KB Size

Download PDF

APPLIED PROBLEMS

Determination of a Vocal Source by the Spectral Ratio Method V. N. Sorokina* and A. S. Leonovb** a

Institute for Information Transmission Problems, Russian Academy of Sciences, Bol’shoi Karetnyi per. 19, Moscow, 127994 Russia b National Research Nuclear University MEPhI, Kashirskoe sh. 31, Moscow, 115409 Russia e-mail: *[email protected], **[email protected] Abstract—The inverse problem with respect to functions proportional to a voice source and volume velocity of the air flow through the glottis is solved as follows: we compute the inverse Fourier transform of the regularized fraction of short-term speech-signal spectra at intervals with an opened (closed) glottis, minimizing the optimality criterion with respect to the regularization parameter and the glottis opening (closing) time. The optimality criterion for solutions includes the values of the volume velocity and its time derivative at the ends of the interval with an opened glottis and the total value of the negative volume velocity. To obtain an empirical error estimate for the solution, experiments using synthesized signals with various parameters, direct measurements of the glottis, and signals synchronously recorded through pairs of microphones of different types are performed. The most probable determination error for the volume velocity is less than 5% for synthetic sources; if the area of the glottis of the source is measured experimentally, then the said error is about 10%. The discrepancy of solutions for the same signal synchronously recorded through a pair of microphones of different types is less than 10%. Keywords: voice source, inverse problem, short-term spectrum DOI: 10.1134/S105466181701014X

1. INTRODUCTION A voice source bears information about individual biological characteristics of the speaker (e.g., the gender, peculiarities of the articulation-control system, and the emotional and physical state). This makes it possible to use the pulse shape of voice sources for application. For example, in speech (speaker) recognition problems, a decrease in recognition errors can be obtained if the gender is recognized in advance by means of the voice source. When checking the state of traffic controllers, pilots, and drivers, one needs to estimate the level of stress, fatigue, or drug intoxication by means of the voice source. Voice source parameters can be used to detect pathologies of the larynx. Since we assume that voice source parameters slightly depend on the phonetic content of the speech signal, it follows that the said parameters could be used to develop speaker verification (identification) systems independent of language and context. A voice source can be defined as the force exciting acoustic oscillations in the vocal tract while an airflow is passing through the glottis. This force is proportional to the derivative of the airflow volume velocity with respect to time. It depends on variations of the area of the glottis caused by auto-oscillations of vocal folds. For

Received March 10, 2016

applications, time variation data for the volume velocity, its derivative, and the area of the glottis are used. To develop methods for finding the pulse shape of a voice source, the following assumption is usually imposed: the source characteristics do not depend on articulatory and acoustic processes in the vocal tract. Another assumption is as follows: the formant frequencies (or poles of the z-transformation) are equal to the resonance frequencies of the vocal tract in both intervals (i.e., of the opened and closed glottis). Usually, the latter assumption is not satisfied because, for an opened glottis, the boundary-value conditions (and, therefore, the resonance frequencies of the tract) vary (see [1, 2]). Also, the resonance frequency substantially affects the measured frequency of the first resonance the vocal tract (see [3]). However, the said assumptions treated as the first approximation allow us to use relatively simple methods of analysis, based on the linear speech production theory (see [4]). According to that theory, the so-called residual signal obtained after rejection of the resonance frequencies of the vocal tract in the speech signal in vocalized segments contains information about vocalized excitation. The computation of the residual signal is called inverse filtering. Several methods using the inverse filtering estimate the resonance frequencies of the tract in the interval of the closed glottis inside the fundamental period (see [5]). To do this, the fundamental period has to be segmented into the intervals of the opened and closed glottis (see segmentation meth-

ISSN 1054-6618, Pattern Recognition and Image Analysis, 2017, Vol. 27, No. 1, pp. 139–151. © Pleiades Publishing, Ltd., 2017.

140

SOROKIN, LEONOV

ods in [3, 6]). A headbanding error occurs when we find the closing time for the glottis; this error substantially affects the analysis results. Moreover, the duration of the interval of the closed glottis can be very small (or equal to zero). Other inverse filtering methods use a simultaneous iterative estimate of the voice source and parameters of the vocal tract (see [7–9]). This decreases the accuracy requirements to the finding of the intervals of the opened and closed glottis. In [10], another method (different from inverse filtering methods) of finding the pulse shape of a voice source is proposed; it assumes that the speech signal is the maximum-phase function of time if the glottis is opened and the minimum phase function of time if the glottis is opened. Using this method, one has to determine only the closing time for the glottis, while the voice source is reconstructed as the inverse transform of the complex cepstrum such that all its components with positive indices are set equal to zero. In [11], it is shown that the general form of the problem of finding the shape of a voice source is illposed: as a rule, its solution is not unique and unstable with respect to data perturbations. To decrease the instability of solutions, the investigated pulses of a voice source are approximated by parametric mathematical models. In [9, 12–15], the so-called LF model (see [16]) is used. It turns out that this model is more suitable for describing male voices in contrast to female ones. Our experiments show that the search for the parameters of the LF model is frequently an ill-posed problem and the obtained approximate solution is unstable. It is possible to impose more restrictions for the shape of the area of the glottis than for the pulse shape of the voice source. In [17, 18], this is taken into account and the following two-stage procedure is proposed. At the first stage, the inverse filtering method is used to find the volume velocity Wvs of the voice source. At the second stage, it is recalculated to the area of the glottis by inversion of the equation of flow through the glottis (see Eq. (1) below). Then the area of the glottis is approximated by the Svs model, described as follows:

⎧ S ν s (t ) = ⎨S max sin p πt , 0 ≤ t ≤ t1; 2t1 ⎩ ⎫ q π(t − t1 ) S max cos , t1 ≤ t ≤ t 2; 0, t 2 ≤ t ≤ T0 ⎬ , 2(t 2 − t1) ⎭ where t1 is the time when the area of the glottis is the largest, t2 is the duration of the interval of the opened glottis, p and q are parameters, and T0 is the fundamental period. In [19], properties of this model are considered. There are other perturbation sources for speech signals: additive noises in the communication channel, the distance between the speaker and the microphone, the amplitude-frequency characteristics of the

microphone, the reverberation of the housing, and errors of the task to find the initial (final) time of the opened glottis. This might lead to the conclusion that no exact reconstruction of the pulse shape of a voice source via the speech signal is possible. In the present paper, we attempt to find the pulse shape of a voice source numerically; the algorithm is based on comparison of the spectra for a speech signal in the intervals of the closed and opened glottis. The solution error is empirically estimated in experiments using synthesized signals with various parameters, direct area measurements for the glottis, and signals synchronously recorded with a pair of microphones of different types. 2. PHYSICAL MEASUREMENTS AND COMPUTER SIMULATIONS OF VOICE SOURCES It is known that there are cases where an inverse problem has a unique stable solution if sufficient information about its properties is used to find it. In the voice source problem, those properties have to be obtained from an analysis of the physics of the inverted process. Therefore, to solve the inverse problem of finding the pulse shape of a voice source, one has to find its corresponding properties determined by the physics of phonation.

2.1. Intermediate Measurements In [20], a high-speed shooting method was first proposed to register the shape of the glottis in the auto-oscillation regime. Further, digital processing methods for images are applied to measure the area of the glottis during auto-oscillations (see [21]). Also, area variations of the glottis can be measured by the transillumination method (see [22]) and high-speed endoscopy of oscillations of the vocal folds. In those experiments, oscillatory asymmetry is found for the left and right hand folds; also, it was found that their phases are translated with respect to each other. Bending vibrations along vocal folds are observed; sometimes, the second and third eigenfunctions of the elastic deformations are visible. Various methods have been developed to measure the airflow velocity directly in the vocal tract: the socalled Rothenberg mask (see [23]), a reflectionless tube (see [24, 25]), a thermometric sensor (see [26]), etc. In [27], a Rothenberg mask is applied to record the flow velocity for phonation in different regimes. It is found that complete closing of the vocal folds does not always occur. Measurements of the pulse shape of the volume velocity obtained by means of the reflectionless tube show that the maximum of its derivative can exceed its minimum and an additional extremum of the volume velocity might occur right after opening of the glottis. Not much data are known about the properties of a voice source, because methods of measuring the air-

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 27

No. 1

2017

DETERMINATION OF A VOCAL SOURCE

flow and the area of the glottis are complicated. The known data do not cover all possible kinds of dependences of the vocal tract parameters on the resonance frequency, phonation regime, gender and physiological characteristics of the speaker. Only by unifying all accessible research methods for voice sources can more or less error-free information be obtained. 2.2. Excised Larynx It is possible to observe the auto-oscillation shape of the vocal folds with an excised human larynx (see initial experiments in [28–30]). In [31], it was found that the displacement amplitude for the upper surface of vocal folds can exceed their oscillation amplitude by two to four times. Similar phenomena are observed in [32]. Further experiments demonstrate various kinds of auto-oscillation instability (see [33, 34]) including theoretically predicted bifurcations [35]. This means that the shapes of neighboring pulses of a voice source can substantially differ from each other. This effect worsens the estimates of voice source parameters when several pulses are averaged. On the other hand, the presence of differences between neighboring pulses and the size of those differences can characterize the individuality of the speaker. 2.3. Physical Models of Voice Sources In excised larynx experiments, the impact of various mechanical and geometric properties of voice sources is restricted because it is hard to set up such experiments. Another direction to investigate the properties of voice sources is to use physical models instead of an excised larynx. Experimenting with such models, one can manipulate the model geometry and parameters to refine various oscillation aspects of the vocal folds. Pioneering experiments with physical models of voice sources are presented in [36–38]. Those models have proved quite useful for studying the behavior of an aerodynamic flow entering and exiting the glottis. It has been found that flow vortex formation occurs above the upper surface of the vocal folds. In [39–44], the impact of mechanical parameters of vocal fold tissues has been estimated. In particular, bending deformations along vocal folds are observed. 2.4. Mathematical and Computer Models of Voice Sources Physical models are also limited in their abilities to study parameter variations of voice sources. Therefore, several mathematical models (of various complexity levels) have been developed. The initial models (the one-mass in [45] and the two-mass in [46]) treat vocal folds as a system with lumped parameters. In spite of its simplicity, the one-mass model is useful to study, e.g., the impact of fold asymmetry on the pulse PATTERN RECOGNITION AND IMAGE ANALYSIS

141

shape of the volume velocity (see [47]). The two-mass model takes into account the phase displacement between the oscillations of the lower and upper edges of vocal folds. Various modifications of the two-mass model are used. Three-mass and five-mass models have been developed (see [48, 49]). The properties of many-mass models are close to those with distributed parameters used in many works as well (see [50–52]). In [1, 53, 54], an analytic description is presented for 3D elastic deformations of the vocal folds via eigenfunctions with respect to each dimension. In [55], a detailed study of auto-oscillations of the vocal folds is described: it was found that more than one maximum of the flow volume velocity can exist in the opened glottis interval. 3. FLOWS THROUGH THE GLOTTIS: MATHEMATICAL MODELS Experiments with an excised larynx and physical and mathematical models of a voice source provide essential information about its properties. However, to solve the inverse problem about the pulse shape of the volume velocity through the glottis or its time derivative with respect to a speech signal, phonation must be studied in more detail. Acoustic oscillations are excited both for the opened and closed glottis. If the glottis is open, than a force Fνs proportional to the derivative of the volume velocity of the airflow occurs at the exit from the glottis (this is the voice source itself). If the glottis is closed, than a force Fp (a so-called piston source) proportional to the derivative of the volume velocity of the airflow occurs for the vertical motion of the vocal folds: F = Fνs + Fp,

Fνs = ρ 0hν sW ν's (t ),

Fp = ρ 0hνS νV ν'(t ) ,

where ρ 0 is the air density, hνs is the depth of the glottis, W ν s (t ) is the flow volume velocity through the glottis, S ν is the upper surface area of the vocal folds, hν is the average displacement of the upper surface of the vocal folds, and V ν is the average displacement velocity of the upper surface of the vocal folds (see [1]). It was found in [31] that the displacement amplitude of the upper surface of the vocal folds exceeds their oscillation amplitude by several times. In [23, 56], excitation is observed on the closed glottis interval. This phenomenon is described further in other works. In [57], the properties of piston sources are studied in detail with a 3D model of elastic oscillations of the vocal folds; it is found that it essentially affects the shape of the voice source. In particular, there exist relations between the parameters of vocal fold tissues and the resonance frequency treated as a derivative of the flow W ν s (t ) such that additional extrema can occur in the closed glottis interval. The presence of a piston source contradicts the assumption

Vol. 27

No. 1

2017

142

SOROKIN, LEONOV

about the existence of free damped oscillations in the closed glottis interval and causes errors in the method used in the current paper for estimating the pulse shape of a voice source. In the general case, a flow through the glottis is described by the Navier–Stokes equation. It is possible to solve it analytically only under specific conditions, which are not satisfied in the case of the glottis. In a real speech signal, the exact three-dimensional shape of the glottis is not known; therefore, the numerical solution to this equation contains an error that cannot be estimated. In [1], the following onedimensional airflow equation is obtained (it takes into account the geometric parameters of the voice source and acoustic characteristics of the vocal tract):

ρ 0hν sW ν's + k ν sW ν s +

ρ 0c x 2 W ν s = (Pl − Pν t )S ν s , (1) 2S ν s

where k νs is the coefficient of viscous friction for capillary channels, k ν s = 12μ l ν2s / S ν2s , μ is the coefficient of the viscosity of air, lvs is the length of the vocal folds, c x is the coefficient of the decrease in pressure at the exit from the glottis, Pl is the pressure under the vocal folds, and Pνt is the vocal tract pressure above the vocal folds. If the area of the glottis is very small, i.e., S ν s → 0 , then the term (2) W ν s = (Pl − Pν t )S ν3s /12μ l ν2s with the coefficient of viscous friction dominates in the equation and the volume velocity is proportional to the third power of the area of the glottis, while its derivative with respect to time is proportional to its second power. Pressure Pl is the sum of pressure PL created in the lungs by shrinking of the diaphragm and acoustic pressure Pal created by resonances of the infraglottic cavity: Pl = PL + Pal . The lung pressure PL changes slowly, while pressure Pal changes according to the velocity of acoustic oscillations under the glottis. The first resonance of the infraglottic cavity provides the greatest impact for Pal . In [58], the initial three resonances of the infraglottic cavity are measured: F1 sbg = 640 Hz, F = 1400 Hz, F3 sbg = 2850 Hz . Further measurements of a large number of speakers show that the value range for F1sbg is from 550 to 660 Hz (see [59, 60]). The pressure over the glottis consists of three components: Pνt = Pνt 0 + Paνt + Pt . Here Pν t 0 is the pressure determined by the resistance of the vocal tract with respect to the direct current. If there are no lowarea restrictions of the cross-section, then Pν t 0 is equal to the atmosphere pressure. If the cross-section in the сonstrictions of the vocal tract has a low area, then the said pressure changes relatively slow (with the velocity

of articulatory motions). Similarly to the infraglottic cavity pressure, Paν t is the acoustic pressure created by low resonances of the vocal tract. The first resonance frequency is in the range from 250 to 900 Hz (depending on the particular vowel). The computer simulation and analysis of the pulse shape of the voice excitation by means inverse filtering method show that the joint impact of infraglottic cavity resonances and resonances over vocal folds create so-called ripples in the airflow W νs (see [61]). Moreover, it frequently occurs that the increase flow velocity decreases while the decrease flow velocity increases. The pressure Pt is created by flow turbulences at the exit from the glottis. If the Reynolds number Re at the exit from the glottis exceeds the threshold Recrit, then wideband noise occurs; the frequency of the first maximum of its spectrum is approximately f1t = 0.085W ν sl ν s / S ν2s . In the range of observable parameters, the frequency of the greatest energy in the spectrum of the turbulent source is F1t lying in the range from 1300 to 1600 Hz. Therefore, the spectrum of a speech signal in the opened glottis interval is distorted by turbulent noise.

4. QUANTITATIVE PROPERTIES OF VOICE SOURCES Using various ways to measure the parameters of a flow through the glottis and the mathematical modeling of auto-oscillations of the vocal folds, we can determine restrictions useful for resolving of the inverse problem of finding the pulse shape of the voice source. Due to the physics of phonation, the flow velocity is positive if the glottis is open: W ν s (t ) ≥ 0, t op ≤ t ≤ t cl , where t op is the opening time for the glottis, while tcl is its closing time. For the opening (closing) time of the glottis, its area is equal to zero and we have W ν s (t op ) = 0; W ν s (t cl ) = 0. It is found that there exist phonation regimes such that no complete joining of the vocal folds is achieved and the volume velocity of the flow through the glottis does not vanish. However, a speech signal contains no constant flow component and one can assume that Wνs = 0 if Wνs = const. Apart from the volume velocity of the flow, its derivative vanishes at the end of the interval of the opened glottis as well. It follows from (2) that

W ν's (t op ) ⎯⎯⎯⎯ → 0, W ν's (t cl ) ⎯⎯⎯⎯ → 0 in the S ν s →0 S ν s →0 domain of very low values of the area of the glottis; i.e., the derivative of the volume velocity vanishes both at the opening time of the glottis and its closing time. If complete joining of the folds is not achieved, then the derivatives at the end of the interval of the opened glottis are different from zero, but their values are very low. It is usually assumed that the closing velocity of the glottis exceeds its opening velocity and, therefore,

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 27

No. 1

2017

DETERMINATION OF A VOCAL SOURCE

| min W ν's | ≥ max W ν's . However, direct flow measurements (in particular, measurements in [24] using a reflectionless tube and measurements in [62] using the inverse filtering) show that there are observations such that | min W 0' | < max W 0' . Thus, the relation Rww = min W ν's / | min W ν's | can be greater than 1 and less than 1 and neither a minimum nor maximum value of Rww is defined. In particular, it is hard to set such restrictions, because it is not guaranteed that the pulse shape of the volume velocity is unimodal. As we know from Sections 2.1–2.4, bending vibrations along vocal folds and phase displacements between the lower and upper fold edges can lead to rather complicated pulse shapes for the volume velocity. Indeed, examples of Wvs with two local maximums are found in experiments with the inverse filtering (see [62]) and in the results of computer simulation (see [55]); the first maximum can be both greater or less than the second. This property restricts the applicability of any parametric model that creates an unimodal shape for the volume velocity pulse. Another quantitative constraint is the data about the distribution of the fraction Q = (t op − t cl )/ T0 of the opened glottis over the fundamental period. In [27], it has been found that the average value of Q depends on the phonation regime and is located between 0.57 and 0.75 for men and between 0.69 and 0.78 for women. In [63], the estimate 0.47 < Q < 0.84 based on measurements of high-speed films of 5000 fps and electroglottograms is presented. In [3], the electroglottograms measurements database from [64] is used for three speakers (two men and one woman): the lower boundary Qmin = 0.25 of distributions Q is obtained. The range of expected values of Q can be represented by the inequality 0.25 ≤ Q ≤ 1 because there are phonation regimes such that the glottis area does not exceed zero (this is particularly true for female voices). Individual characteristics of a voice source for different speakers are expressed by the parameter t max − t op , where top is the glottis opening time, Rtt = op max t cl − t op max max = argmax(W ν's ) is the time t op when the airflow t op velocity is maximum, and tcl is the glottis closing time. For men, Rtt is between 1.5 and 2.0; for women, it is between 1.38 and 1.67. If the volume velocity pulse has two maxima, then Rtt can be less than 1. 5. RADIATION RESISTANCE Solving the problem to determine the pulse shape of a voice source, one has to take into account the distance to the receiver of the speech signal and the type of receiver. In [65], it is shown that low frequencies increase in inverse proportion to the distance to the microphone (in the acoustic near field). This should PATTERN RECOGNITION AND IMAGE ANALYSIS

143

be taken into account when analyzing signals received from mobile and landline phones. Also, the characteristics of a speech signal depend on whether the microphone receives the pressure, its gradient, or the velocity of air particles. Any handset of a landline phone contains an air chamber in front of the microphone such that its acoustic characteristics depend on its volume and affects the spectrum of the speech signal. By virtue of those effects, the most complicated problem of voice source analysis is apparently formed by an unknown characteristic of a microphone-type communication channel. Below, we assume that the microphone is a pressure receiver; other agents are disregarded. The radiation resistance Zl from a speaker’s mouth can be approximated (the error is about 10%) as 2

⎛ ωr ⎞ 8ω rl , Zl = 1⎜ l ⎟ − j 2 ⎝ c0 ⎠ 3πc0 where rl is the equivalent radius of the oral fissure and c0 is the sound velocity (see [65]). The imaginary term is equivalent to the derivative with respect to the time of acoustic pressure on the lips. Therefore, simple integration of a speech signal is usually applied when defining problems for the pulse shape of a voice source; otherwise, it would be hard to use this approximation in time domains, because the first term depends on the second power of the frequency. Using electroacoustical analogs for the radiation resistance proposed in [65], we obtain the following expression for transfer function H from the velocity of the airflow to the acoustic pressure; the latter is interpreted in the time domain as follows:

1 − j ω T1 , H ( j ω) = P = − 1 + j ω T2 V where P is the acoustic pressure at the exit of the mouth, V is the velocity of air particles, T1 = 0.5 × 10 −5 S , T2 = 2.3 × 10 −5 S , S is the oral fissure area, and 0 < S < 5 cm2. Then the dependence of velocity V on pressure P in the frequency-complex domain is as follows: V (1 − j ω T1) = −P (1 + j ω T2 ).

Taking into account that multiplication by jω in the frequency domain corresponds to differentiation in the time domain, we express this dependence as follows:

T1V ' − V = P + T2P ' . Let us approximate this differential equation by the finite difference method and assign

Vol. 27

V '(t ) ≈ [V (t ) − V (t − Δ t )]/ Δ t , P '(t ) ≈ [P (t ) − P (t − Δ t )]/ Δ t . No. 1

2017

144

SOROKIN, LEONOV

This yields the recursive equation

Δt Δ t − T1 (3) T T T × ⎡ 2 P (t − Δ t ) − P (t ) ⎜⎛1 + 2 ⎟⎞ − 1 V (t − Δ t )⎤ . ⎢⎣Δ t ⎥⎦ ⎝ Δt ⎠ Δt In this paper, we assume that the used microphone is the source of the pressure P(t) and the speech signal is preprocessed according to (3). All subsequent operations (including computation of the spectrum) are applied to the signal f(t) = V(t). For t = 0, it is assumed that t = 0 + Δ t . V (t ) =

check its efficiency is by experiment. It is assumed that if there is no additive noise, then the discrete Fourier transform of S ναs(δ)( j ω) corresponds to the voice source function G (t ) = W ν's (t ) in the time segment [top,tcl] provided that the initial-value conditions are zero. However, for any fundamental period apart from the first, the interval of the opened glottis of a speech signal contains dumping oscillations ff inherited from the previous period: N

f f (t) =

∑ (k

1n

cos(ω nt) + k 2n sin(ω nt))e −bnt ,

n =1

6. SEARCH FOR THE SOURCE There are two main methods for determining the pulse shape of a voice source: the parametric approach, which postulates a time form of the source depending on unknown parameters, and the spectral approach, which finds the spectral function of the source via a speech signal. Using parametric models to solve the inverse problem of finding a voice source, we promote stabilization of the sought solution. On the other hand, this can impose physically improbable shapes on the solution. One more problem of analyzing of a voice source in the time domain is to determine the phase of a speech signal, because some amplifiers can invert a signal. By applying the spectral determination approach to a voice sources, one can disregard its parametric models. All methods for determining the shape of a voice source yield solutions in arbitrary units. In this paper, the functions of the voice source and its volume velocity are normalized to 1, but the terms “voice source” and “volume velocity” are preserved. In [66], the following relation of short-term spectra of the speech signals for an opened and closed glottis within the fundamental period is proposed to determine the spectrum of the source:

S ναs(δ)( j ω) =

S cl* ( j ω)S op( j ω) S cl* ( j ω)S cl ( j ω) + α(δ)(1 + ω2 )

.

(4)

Here α(δ) is the regularization parameter, which has to be selected (in a special way) depending on the error in measuring a speech signal and computing the spectra δ (see [67]), S op( j ω) and S cl ( j ω) are short-term spectra in the opened (closed) glottis intervals, and S cl* ( j ω) is complex-conjugate to function S cl ( j ω). An advantage of this approach is its invariance (for α = 0) or low sensitivity with respect (to) multiplicative distortions of the speech channel, i.e., the channel impact, in the ideal case. Another advantage is its independence from possible inversions of a speech signal. Undoubtedly, such an approach is an idealization (taking into account the properties of a speech signal in the intervals of the opened and closed glottis). The only way to

0 ≤ t ≤ T0,

(5)

t = t − t op,

where top is the opening time for the glottis, N is the number of resonance frequencies Fn taken into account, ω n = 2πFn , bn are the damping coefficients, T0 is the duration of the current fundamental period, and t is the current time. Therefore, those oscillations should be compensated before we apply Eq. (4) to find the shape of the source. Such a compensation is possible as follows: to measure the resonance frequencies of the vocal tract and their damping coefficients and amplitudes in the previous interval of the closed glottis, to compute f f (t) , and to subtract the latter from the original speech signal f (t) : f (t) = f (t) − f f (t). The coefficients in Eq. (5) are computed to satisfy the conditions f (0) = 0, f '(0) = 0 . Since those conditions are achievable for various variants of choosing the coefficients k1n, k 2n , they are defined from the initial conditions and relative amplitudes

An =

S cl ( j ω n ) N

∑S

cl ( j ωi )

i =1

of peaks of the modulus of the spectrum S cl ( j ω) as follows: A [ f '(0) + bn f (0)] . k1n = An f (0), k 2n = n ω2n To obtain coefficients of dumping of resonance oscillations bn, we can use the linear prediction method, though their stability is not guaranteed. Another variant is to use the averaged dependence of the width of the resonance band on the resonance frequency obtained in experiments (see [68, 69]). In this paper, data from [57] are used. The initial conditions affect the shape of the spectrum in the closed glottis interval as well. To compensate this impact, one can compute the spectrum S cl ( j ω) of the delta pulse response of a system of second-order differential equations with eigenfrequencies and dampings defined in the current period of the eigentone. However, this strengthens the impact of the

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 27

No. 1

2017

DETERMINATION OF A VOCAL SOURCE

channel characteristic, which worsens the stability of the speaker recognition system with respect to changes in the microphone. In papers devoted to analyzing the pulse shape of voice sources, it has been noted that the choice of a particular method to compute the short-term spectrum significantly affects the result (see, e.g., [6]). Different durations of the intervals of the opened and closed glottis and a small number of samples in short intervals substantially hinder computation of the spectrum and affect the error of the solution. In our study, we consider several different ways of computing the spectrum in our problem. The most acceptable results are obtained by periodization of signals in the intervals of the opened and closed glottis, such that the duration of 256 samples is fixed. In [66], the inverse Fourier transform of S ναs(δ)( j ω) is used to find the extrema of a voice source and volume velocity. The recognition error for the speaker varies from 5% for one word to 0.1% for a ten-word sequence. Those results can be treated as acceptable from the practical viewpoint, but a detailed study of the resolving process for the inverse problem shows that the method has to be substantially modified. To find S ναs(δ)( j ω) from (4), one has to fix the opening time top and the closing time tcl for the glottis. In this work, the algorithm described in [3] is used for this. Such times are determined with an error. Computer simulation shows that regularization parameter α depends on the shape of the spectrum of a speech signal, i.e., on the phonetic composition. The oral fissure area S depends on the type of pronounced vowel as well. Fixing any value of any such parameter, one inevitably worsens the solution to the inverse problem. Therefore, it is necessary to find a criterion K and values of those parameters optimal from the viewpoint of this criterion. This criterion is formed from the properties of the volume velocity function and its derivative at the ends of the interval with the opened glottis (see Sections 3 and 4). Also, one has to take into account the fact that if the glottis is open, then the airflow moves from the lungs to the vocal tract and this direction never changes to the opposite: W ν s (t ) ≥ 0, t op ≤ t ≤ t cl . If the inverse problem is solved, then negative values of the volume velocity can occur. They correspond to the airflow from the vocal tract to the lungs; i.e., it is possible that there are times such that W ν s < 0 . Let us denote such a mean flow by W neg , t cl

W neg

1 = t cl − t op

∫W

vs dt,

W vs < 0

t op

and impose the requirement that W neg = min . PATTERN RECOGNITION AND IMAGE ANALYSIS

145

Then the optimality criterion is defined as follows:

K = W vs (t op ) + W (t cl ) 2

2

+ d1[W ′ (t op ) + W ′ (t cl )] + d 2W neg → min, 2

2

2

(6)

where d1 and d2 are coefficients taking into account the weight of any factor with the corresponding dimension. In our experiments, it was assigned that d1 = d2 = 0.1. To find the volume velocity Wvs(t), we integrate

the voice source function G(t); therefore, W ν s (t op ) = 0 . To minimize the optimality criterion K = K ( S, t op, t cl, α ) , we use methods for finding the conditional minimum (see [70, 71]) under the following constraints: 0.1 cm2 < S < 5 cm2, 0 < α < 1, t cl(0) – 0.03T0 < tcl < t cl(0) 0.03T0, (0) (0) + 0.1T0, – 0.1T0 < top < t op t op

(0) where t cl(0) and t op are the glottis closing (opening) time according to [3]. Note that, in general, the minimum value of criterion K for the sought solution does not guarantee a physically probable result. For example, the value of the volume velocity when the glottis closes can be unacceptably large. Therefore, once problem (6) is solved, it is necessary to verify the quality of the solution using physically reasonable qualitative criteria. The following requirements are taken as such criteria: the obtained solution is rejected if (а) at least one value

among W ν s (t cl )/ max(W ν s (t )) and W neg /

∫

t cl

t op

W ν s (t )dt

exceeds its a priori determined threshold or (b) the restrictions 0.25 ≤ Q ≤ 0.9 and 1 ≤ Rtt ≤ 2 are broken. 7. NUMERICAL EXPERIMENTS Below, we describe experiments on the impact of various conditions on errors in solving the inverse problem to determine the shape of a voice source via a speech signal. At the first stage, synthesized signals simulating vowels are used. For such signals, there is no radiation resistance, additive noise, and distortions of the communication channel; the initial and final time of the voice source action are known; and the excitation source itself is also known. This makes it possible to estimate the method error, comparing obtained solutions with the shape of the synthetic excitation source. Six synthetic vowels corresponding to the phonetic characteristics of the cardinal Russian vowels are used. To synthesize speech signals, we sum the oscillations at four fixed resonance frequencies and the oscillations are excited by a source (see [72]) modified such that its integral is equal to zero. Tables 1 and 2 list the resonance frequencies and the width of the resonance band. The signals are synthesized for various combi-

Vol. 27

No. 1

2017

146

SOROKIN, LEONOV

Table 1. Formant frequencies (Hz) F1

F2

F3

F4

600 500 408 390 486 490

1200 910 860 2272 1380 1350

2300 2320 2040 3100 1870 2230

3500 2630 2760 4000 2570 2770

Table 2. Bandwidth (Hz) ∆F1

∆F2

∆F3

∆F4

80 100 150 50 80 70

50 50 40 70 50 40

80 70 50 80 50 60

100 90 70 80 60 80

nations of the voice source parameter values: the values of the resonance frequency F0 are 83.3, 100, 125, 167, 250, and 333 Hz; the values of the fraction Q of the duration of the opened glottis interval over the fundamental period are 0.25, 0.5, 0.75, and 0.9; the values of the fraction of time with the greatest value of the volume velocity over the duration of the opened glottis interval are 1, 1.25, 1.5, and 2; and the values of the fraction of the greatest voice source value over its smallest are 0.2, 0.4, 0.6, and 0.8. In total 4608 pulses were studied.

Figures 1 and 2 show examples of synthetic signals and excitation sources and compare the original and computed sources. From Fig. 2 (here thin lines represent solutions), it is clear that, in the considered case, the difference in shape for the computed excitation source and its volume velocity is small for each pulse; i.e., the solution is stable from one pulse to another. In this case, a time delay of the solution occurs. In the problem of speaker recognition via a voice source, we are interested in the pulse shape and can disregard the delay of the solution. If this delay is compensated by minimizing the mean-square error between the original model and computed pulse shape of the volume velocity, then the relative mean-square error in the reconstruction of the shape for the pulse of the volume velocity of the source is less than 0.035 in comparison to the first pulse (see Table 3). The reconstruction error for the voice source exceeds the volume velocity by several times, but it is still acceptable. Figure 3 shows the distribution of the mean-square error for the flow volume velocity for the original excitation source and for the solution to the inverse problem; all possible combinations of voice source parameters are taken into account. We see that the most probable error is less than 5%, but there are also many solutions with large errors. At this stage, it is confirmed that the volume velocity error is several times less than its derivative error. For the derivative, the most probable error is about 20% and the probability of an appearance of an error rapidly decreases as the error itself increases. Usually, large errors occur for short intervals containing small numbers of samples. The next series of experiments uses the data about the area Svs of the glottis; the area is measured accordVocal source

1

Vocal source 2

0

0

−1

−2 −4 0

−2 0 0.02

0.04 Speech signal

0.06

1

0.08

2

3

4

5

6

5

6

Volume velocity 1.0

10 0.5

0

0

−10 −20 0

−0.5 0 0.02

0.04 s

0.06

1

2

0.08

Fig. 1. Excitation source and synthetized signal with arbitrary units along the y axis.

3 ms

4

Fig. 2. Normalized excitation source and volume velocity (thick curve) and solution to inverse problem for eight pulses (thin curves).

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 27

No. 1

2017

DETERMINATION OF A VOCAL SOURCE

147

Table 3. Determination errors: voice sources and volume velocity functions for synthetized signals Pulse

1

Source 0.152 Volume velocity 0.051

2

3

4

5

6

7

8

0.162 0.031

0.164 0.034

0.164 0.034

0.164 0.033

0.164 0.034

0.164 0.034

0.164 0.034

ing to the methods of [21] and a speech signal is recorded synchronously. The volume velocity of the airflow through the glottis and its derivative with respect to time, i.e., the voice source, are computed according to relation (1). In Fig. 4, the original signals for the solution to the inverse problem are displayed such that the greatest value of the area of the glottis is assumed to be 0.2 cm2, while the pressure overfall on the glottis is assumed to be 1500 Pa. For the example in Fig. 4, the solution to the inverse problem is shown in Fig. 5. Once the delay is compensated, the error in determining the volume velocity (in the considered case) is 0.1245, 0.1024, 0.0866, and 0.0738, while the error in determining the shape of the voice source is much larger: 0.4605, 0.5098, 0.4172, and 0.3151. Similar error relations take place for other solutions in this series of experiments. One more series of experiments uses a database containing synchronous records of a speech signal through two microphones of different types located at different distances from the speaker. The following sound receivers are used: the handset of a stationary phone, a headset, a directed microphone, an omnidirectional microphone, and a cardioid microphone. As mentioned above, the use of (4) is expected to decrease the scatter of solutions with respect to the Probability 0.10

voice source caused by the impact of the microphone characteristics. In [73], the inverse filtering method is used to determine the pulse shape of a voice source; it has been found that solutions for the same pronunciation synchronously recorded via microphones of different types have a low similarity. If the inverse problem is solved according to (4), then the greatest discrepancy in the solutions is 0.4. Figure 6 shows an example of solutions for a handset–directed microphone pair: the mean discrepancy between pulses from the first and second microphone is 0.34. Here, the duration of each volume velocity pulse is normalized to 100 samples to compare pulse shapes. Apparently, for signals from different microphones, variations in the pulse shape of a vocal source compensate each other such that the discrepancy (for different microphones) of pulse shapes averaged over a vowel segment is much less. Figure 7 shows the averaged pulse shapes for the handset–directed microphone pair: the mean-square error of the volume velocity is 0.07, while the error of the vocal source is 0.24. 8. DISCUSSION The proposed method of solving the inverse problem of determining the pulse shape of voice sources has been proved for various kinds of input data.

1 0 −1

0.09 0.08 0.07 0.06

0.2 0.1

0.05

0

0.04 0.03

200 100

0.02

0

0.01

0 −20

0

20

40

60

80

100 120 Discrepancy, %

Fig. 3. Error distribution for determining volume velocity for synthetized signals. PATTERN RECOGNITION AND IMAGE ANALYSIS

Speech signal

0

0

0.005

0.010 0.015 0.020 Vocal slit area

0.025 0.030

0.005

0.010 0.015 0.020 Volume velocity

0.025 0.030

0.005

0.010 0.015 0.020 Vocal source

0.025 0.030

0.005

0.010

0.025 0.030

0.015 ms

0.020

Fig. 4. Sound pressure oscillogram, area of glottis, flow volume velocity, and its derivative. Vol. 27

No. 1

2017

148

SOROKIN, LEONOV

stantial part of the solutions have large errors, though small errors (below 5%) are more probable. Physical measurements of the velocity of the flow through the glottis show that very short intervals with the closed glottis can occur; for such cases, it is reasonable to expect that the solution has either a large error or is rejected.

Normalized volume velocity 1.0 0.5 0 −0.5 0

0.005

0.010

0.015

0.020

0.025

0.030

0.025

0.030

Normalized volume source

1.0 0.5 0 −0.5 0

0.005

0.010

0.015 mces

0.020

Fig. 5. Measured (thick curves) and computed (thin curves) volume velocity and its derivative.

For synthetic signals preserving the pulse shape of a voice source from one period to another, the solution to the inverse problem differs slightly for each fundamental period. The reason is that the problem of determining the pulse shape of a voice source is unstable with respect to occurring variations in the data, e.g., from their sampling. Smoothing of the solution in the transition from the voice source to the volume velocity, one can decrease the errors of the solution. Note that we treat a voice source and volume velocity as signals proportional to the true volume velocity of the airflow and its derivative with respect to time, because the absolute values cannot be reconstructed due to the properties of the wave equation of the vocal tract. The error of the input data in the shape of the spectra Sop(jω) and Scl(jω) increases as the intervals of the opened (closed) glottis decrease and the difference in their durations increases. Therefore (see Fig. 1), a sub-

It would be the most informative to use data where the shape of the voice source is known. Unfortunately, the volume of data where the area of the glottis or volume velocity is measured synchronously with the recording of a speech signal is very small; one can treat such data an inaccessible. This prevents estimates statistically comparable with estimates for synthetic signals from being obtained. Therefore, the results of the experiment with a known area of the glottis presented above merely show that it is possible to obtain a relatively small error of solution, but they cannot be applied to estimate the scatter of solutions under various conditions. In speaker recognition problems, it is assumed that the probability of correct recognition mostly depends on the difference in conditions of the training and recognition (in particular, the difference in microphone types). If (4) contains no additive noise and regularization terms of the type α(1 + ω2 ) , then, in theory, the ratio of spectra is invariant with respect to multiplicative signal distortions. However, regularization is necessary. Thus, we can only speak about weakening of the impact of the channel characteristics (instead of invariance). Moreover, additive channel noise always exists in real conditions. It cannot be completely eliminated. This again disrupts the invariance. Also, suppression of additive noise always distorts a speech signal; this also increases the error of the solution. Experiments with signals synchronously recorded from a pair of different types of microphones under real reverberation and channel-noise conditions show that a moderate difference in the solution can be

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

50 100 0

50 100 0

50 100 0

50 100 0

50 100 0

50 100 0

50 100 0

50 100 0

50 100

Fig. 6. Normalized volume velocity for handset (─) and directed microphone (---). Segment of vowel /a/ in syllable /fan/. Normalized time samples are marked along x axis. PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 27

No. 1

2017

DETERMINATION OF A VOCAL SOURCE

nously recorded through different types of microphones is substantially less than the discrepancy in the solutions for a voice source.

1.0 0.5

REFERENCES

0 −0.5 0 0.10

20

40

60

80

100

20

40

60

80

100

0.05 0 −0.05 −0.10 0

149

Fig. 7. Mean functions of normalized volume velocity (top) and normalized voice source (bottom) for handset (─) and directed microphone (---). Segment of vowel /a/ in syllable /fan/. Normalized time samples are marked along x axis.

achieved for the volume velocity provided that the solutions are averaged over the whole vowel segment. Estimates of solution errors with respect to the flow volume velocity presented above are only solitary examples; under real conditions, much higher determination errors for a source (compared with the case of synthetic signals) should be expected. Further investigations should demonstrate whether the described analysis method for voice sources is efficient in speaker recognition problems. CONCLUSIONS We have demonstrated an algorithm using segmented speech signals to find functions proportional to a voice source and volume velocity of the airflow through the glottis for any fundamental period. The algorithm is based on minimization of the optimality criterion for the sought solution with respect to the parameters determining a speech signal in the considered period (such as the oral fissure area and opening and closing time for the glottis). The criterion is computed via the volume velocity and its derivative for opening (closing) of the glottis To find the function of a voice source and volume velocity (for the given parameters), a specialized procedure is used such that the source is defined as the inverse Fourier transform of the regularized fraction of the short-term speechsignal spectra in the intervals of the opened (closed) glottis within the current fundamental period. The experimental error in determining the optimum volume velocity is several times less than the similar computation error for the optimal voice source. The discrepancy in the solutions with respect to the volume velocity found in analyzing speech signals synchroPATTERN RECOGNITION AND IMAGE ANALYSIS

1. V. N. Sorokin, Theory of Speech Production (Radio i Svyaz’, Moscow, 1985) [in Russian]. 2. A. S. Leonov, I. S. Makarov, and V. N. Sorokin, “Frequency modulations in the speech signal,” Acoust. Phys. 55 (6), 876–887 (2009). 3. V. N. Sorokin, “Segmentation of the period of the fundamental tone of a voice source,” Acoust. Phys. 62 (2), 244–254 (2016). 4. J. D. Markel and A. H. Gray, Linear Prediction of Speech (Springer-Verlag, 1976). 5. D. Wong, J. Markel, and A. Gray, “Least squares glottal inverse filtering from the acoustic speech waveform,” IEEE Trans. Acoust., Speech, Signal Processing 27, 350–355 (1979). 6. T. Drugman, M. Thomas, J. Gudnason, P. Naylor, and T. Dutoit, “Detection of glottal closure instants from speech signals: a quantitative review,” IEEE Trans. Audio, Speech, Language Processing 20 (3), 994–1006 (2012). 7. P. Milenkovic, “Glottal inverse filtering by joint estimation of an AR system with a linear input model,” IEEE Trans. Acoust., Speech, Signal Process. ASSP34 (1), 28–42 (1986). 8. P. Alku, J. Svec, E. Vilkman, and F. Sram, “Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering,” Speech Commun. 11, 109–118 (1992). 9. Q. Fu and P. Murphy, “Robust glottal source estimation based on joint source-filter model optimization,” IEEE Trans. Audio, Speech, Language Process. 14 (2), 492–501 (2006). 10. T. Drugman, B. Bozkurt, and T. Dutoit, “Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation,” Speech Commun. 53, 855–866 (2011). 11. A. S. Leonov and V. N. Sorokin, “On the uniqueness of determination of a vocal source from a speech signal and formant frequencies,” Dokl. Math. 85 (3), 432– 435 (2012). 12. A. Isaksson and M. Millnert, “Inverse glottal filtering using a parametrized input model,” Signal Process. 18 (4), 435–446 (1989). 13. H. Strik and L. Boves, “On the relation between voice source parameters and prosodic features in connected speech,” Speech Commun. 11, 167–174 (1992). 14. D. Childers and Ch. Ahn, “Modeling the glottal volume velocity waveform for three voice types,” J. Acoust. Soc. Amer. 97 (1), 505–519 (1995). 15. H. Strik, B. Cranen, and L. Boves, “Fitting a LFmodel to inverse filter signals,” in Proc. Eurospeech Conf. (Berlin, 1993), pp. 103–106. 16. G. Fant, L. Liljencrants, and Q. Lin, “A four parameter model of glottal flow,” STL–QPSR 4, 1–13 (1985).

Vol. 27

No. 1

2017

150

SOROKIN, LEONOV

17. V. N. Sorokin and I. S. Makarov, “Gender recognition from vocal source,” Acoust. Phys. 54 (4), 571–578 (2008). 18. V. N. Sorokin, A. A. Tananykin, and V. G. Trunov, “Speaker recognition using vocal source model,” Pattern Recogn. Image Anal. 24 (1), 156–173 (2014). 19. A. S. Leonov and V. N. Sorokin, “Two parametric voice source models and their asymptotic analysis,” Acoust. Phys. 60 (3), 323–334 (2014). 20. D. W. Farnsworth, “High-speed motion pictures of the human vocal cords,” Bell Lab. Rec. 18 (7), 203–208 (1940). 21. D. G. Childers, A. Paige, and A. Moore, “Laryngeal vibration patterns. Machine-aided measurements from high-speed film,” Archiv. Otolaryngol. 102, 407–410 (1976). 22. L. Lisker, A. S. Abramson, F. S. Cooper, and M. H. Malcolm, “Transillumination of the larynx in running speech,” J. Acoust. Soc. Amer. 45 (6), 1544– 1546 (1969). 23. M. Rothenberg, “A new inverse filtering technique for deriving the glottal air flow during voicing,” J. Acoust. Soc. Amer. 53 (6), 1632–1645 (1973). 24. M. M. Sondhi, “Measurement of a glottal waveform,” J. Acoust. Soc. Am. 57, 228–232 (1975). 25. R. B. Monsen and A. M. Engebretson, “Study of variations in the male and female glottal wave,” J. Acoust. Soc. Am. 62 (4), 981–993 (1977). 26. K. Kitajima, N. Isshiki, and M. Tanabe, “Use of a hotwire flow meter in the study of laryngeal function,” Stud. Phonolog. 12, 25–30 (1978). 27. E. B. Holmberg, R. E. Hillman, and J. S. Perkell, “Glottal airflow and transglottal air measurements for male and female speakers in soft, normal, and loud voice,” J. Acoust. Soc. Am. 84 (2), 511–529 (1988). 28. J. van den Berg, “Myoelastic-aerodynamic theory of voice production,” J. Speech Hear. 1, 227–244 (1957). 29. J. van den Berg and T. S. Tan, “Results of experiments with human larynxes,” Pract. Otorhinolaryngol. 21, 425–450 (1959). 30. J. van den Berg, “Sound productions in isolated human larynges,” Ann. New York Acad. Sci. 155, 18–27 (1960). 31. T. Baer, “Observation of vocal fold vibration: measurement of excised larynges,” in Vocal Fold Physiology, Ed. by K. N. Stevens and M. Hirano (Univ. of Tokyo, Tokyo, 1981), pp. 119–133. 32. R. Boessenecker, D. A. Berry, J. Lohscheller, U. Eysholdt, and M. Döllinger, “Mucosal wave properties of a human vocal fold,” Acta. Acust. Acust. 93, 815–823 (2007). 33. J. J. Jiang, Y. Zhang, and C. N. Ford, “Nonlinear dynamics of phonations in excised larynx experiments,” J. Acoust. Soc. Am. 114, 2198–2205 (2003). 34. I. T. Tokuda, J. G. Horáček, and H. Herzel, “Comparison of biomechanical modeling of register transitions and voice instabilities with excised larynx experiments,” J. Acoust. Soc. Am. 122, 519–531 (2007). 35. I. Steinecke and H. Herzel, “Bifurcations in an asymmetric vocal fold model,” J. Acoust. Soc. Am. 97, 1571–1578 (1995).

36. X. Pelorson, A. Hirschberg, R. R. van Hassel, A. P. J. Wijnands, and Y. Auregan, “Theoretical and experimental study of quasisteady flow separation within the glottis during phonation. Application to a modified two-mass model,” J. Acoust. Soc. Am. 96, 3416–3431 (1994). 37. X. Pelorson, A. Hirschberg, A. P. J. Wijnands, and H. Bailliet, “Description of the flow through in-vitro models of the glottis during phonation,” Acta Acust. 3, 191–202 (1995). 38. R. Titze, S. S. Schmidt, and M. R. Titze, “Phonation threshold pressure in a physical model of the vocal fold mucosa,” J. Acoust. Soc. Am. 97, 3080–3084 (1995). 39. N. Ruty, X. Pelorson, A. Van Hirtum, I. Lopez-Artega, and A. Hirschberg, “An in vitro setup to test the relevance of low-order vocal fold models,” J. Acoust. Soc. Am. 121, 479–490 (2007). 40. J. Neubauer, Z. Zhang, R. Miraghaie, and D. A. Berry, “Coherent structures of the near field flow in a selfoscillating physical model of the vocal folds,” J. Acoust. Soc. Am. 121, 1102–1118 (2007). 41. Z. Zhang, “Restraining mechanisms in regulating glottal closure during phonation,” J. Acoust. Soc. Am. 130, 4010–4019 (2011). 42. Z. Zhang, “The influence of material anisotropy on vibration at onset in a three dimensional vocal fold model,” J. Acoust. Soc. Am. 135 (3), 1480–1490 (2014). 43. Z. Zhang, J. Neubauer, and D. A. Berry, “Physical mechanisms of phonation onset: a linear stability analysis of an aeroelastic continuum model of phonation,” J. Acoust. Soc. Am. 122 (4), 2279–2295 (2007). 44. A. Mendelsohn and Z. Zhang, “Phonation threshold pressure and onset frequency in a two layer physical model of the vocal folds,” J. Acoust. Soc. Am. 130, 2961–2968 (2011). 45. J. L. Flanagan and L. L. Landgraf, “Self-oscillating source for vocal tract synthesizer,” IEEE Trans. Audio Electroacoust. AU-16, 57–64 (1968). 46. K. Ishizaka and J. L. Flanagan, “Synthesis of voiced sounds from a two-mass model of the vocal cords,” Bell. Syst. Techn. J., No. 5, 1233–1268 (1972). 47. J. C. Lucero, J. Schoentgen, J. Haas, P. Luizard, and X. Pelorson, “Self-entrainment of the right and left vocal fold oscillators,” J. Acoust. Soc. Am. 137 (4), 2036–2046 (2014). 48. T. Wurzbacher, M. Döllinger, R. Schwarz, U. Hoppe, U. Eysholdt, and J. Lohscheller, “Spatiotemporal classification of vocal fold dynamics by a multimass model comprising time dependent parameters,” J. Acoust. Soc. Am. 123(4), 2324–2334 (2008). 49. A. Yang, J. Lohscheller, D. A. Berry, S. Becker, U. Eysholdt, D. Voigt, and M. Döllinger, “Biomechanical modeling of the three dimensional aspects of human vocal fold dynamics,” J. Acoust. Soc. Am. 127 (2), 1014–1031 (2010). 50. Q. Xue, X. Zheng, R. Mittal, and S. Bielamowicz, “Computational modeling of phonatory dynamics in a tubular three dimensional model of the human larynx,” J. Acoust. Soc. Am. 132, 1602–1613 (2012). 51. Z. Zhang and T. Luu, “Asymmetric vibration in a twolayer vocal fold model with left-right stiffness asymme-

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 27

No. 1

2017

DETERMINATION OF A VOCAL SOURCE

52.

53. 54. 55.

56.

57. 58.

59.

60.

61.

try: experiment and simulation,” J. Acoust. Soc. Am. 132 (3), 1626–1635 (2012). Z. Zhang, “Regulation of glottal closure and airflow in a three-dimensional phonation model: Implications for vocal intensity control,” J. Acoust. Soc. Am. 137 (3), 898–910 (2015). J. R. Titze, “The human vocal cords: a mathematical model. Part 1,” Phonetica 28, 129–170 (1973). J. R. Titze, “The human vocal cords: a mathematical model. Part 2,” Phonetica 29, 1–21 (1974). I. R. Titze, The Myoelastic Aerodynamic Theory of Phonation (National Center for Voice and Speech, Iowa City, 2006). D. G. Childers and Ch.-F. Wong, “Measuring and modeling vocal source-tract interaction,” IEEE Trans. Biomed. Eng. 41 (7), 663–671 (1994). V. N. Sorokin, Speech Synthesis (Nauka, Moscow, 1992) [in Russian]. G. Fant, K. Ishizaka, J. Lindqvist, and J. Sundberg, “Speech analysis and speech production. Subglottal formants,” STL QPSR, No. 1, 1–12 (1972). S. M. Lulich, J. R. Morton, H. Arsikere, M. S. Sommers, G. K. F. Leung, and A. Alwan, “Subglottal resonances of adult male and female native speakers of American English,” J. Acoust. Soc. Am. 132 (4), 2592– 2602 (2012). H. Arsikere, G. K. F. Leung, S. M. Lulich, and A. Alwan, “Automatic estimation of the first three subglottal resonances from adults’ speech signals with application to speaker height estimation,” Speech Commun. 55, 51–70 (2013). G. Fant, “Some problem in voice source analysis,” Speech Commun. 13, 7–22 (1993).

62. D. Childers and Ch. Ahn, “Modeling the glottal volume velocity waveform for three voice types,” J. Acoust. Soc. Am. 97 (1), 505–519 (1995). 63. D. G. Childers, D. M. Hicks, G. P. Moore, L. Eskenazi, and A. L. Lalwani, “Electroglottography and vocal folds physiology,” J. Speech Hearing Res. 33, 245–254 (1990). 64. CMU ARCTIC speech synthesis databases. http://festvox.org/cmu/arctic 65. J. L. Flanagan, Speech Analysis, Synthesis, and Perception (Springer Verlag, 1965). 66. V. N. Sorokin, A. S. Leonov, and V. G. Trunov, “Speaker recognition regardless of context and language on a fixed set of competitors,” Pattern Recogn. 26 (2), 450–459 (2016). 67. A. N. Tikhonov and V. Ya. Arsenin, Methods for Solving Incorrect Problems (Nauka, Moscow, 1979) [in Russian]. 68. O. Fujimura and J. Lindqvist, “Sweep-tone measurements of vocal tract characteristics,” J. Acoust. Soc. Am. 49 (2), 541–558 (1971). 69. G. Fant, “Vocal tract wall effects, losses and resonance bandwidth,” STL QPSR, Nos. 2–3, 28–52 (1973). 70. R. H. Byrd, M. E. Hribar, and J. Nocedal, “An interior point algorithm for large-scale nonlinear programming,” SIAM J. Optimiz. 9 (4), 877–900 (1999). 71. J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. (Springer Verlag, 2006). 72. T. V. Ananthapadmanabha, “Acoustic analysis of voice source dynamics,” STL QPSR, Nos. 2–3, 1–24 (1984). 73. V. N. Sorokin and I. S. Makarov, “Reversed problem for voice source,” Inf. Protsessy 6 (4), 375–395 (2006). www.jip.ru

Translated by A. Muravnik

Victor Nikolaevich Sorokin. Born 1938. Graduated from Moscow Aviation Institute in 1963. Senior research fellow at Institute for Information Transmission Problems of Russian Academy of Sciences, Doctor of Physical and Mathematical Sciences. Author of three monographs and about 150 scientific papers (Theory of Speech Production, 1985; Speech Synthesis, 1992; Speech Processes, 2012). Scientific interests: theory of speech production, automatic speech and speaker recognition, and speech synthesis.

PATTERN RECOGNITION AND IMAGE ANALYSIS

151

Vol. 27

Aleksandr Sergeevich Leonov. Born 1948. Graduated from Moscow State University in 1972. Received candidate’s degree in 1975 and doctoral degree in 1988. Professor of National Research Nuclear University (MEPhI). Scientific interests: mathematical physics, mathematical modeling, and methods for solving inverse and ill-posed problems. Author of three books and more than 140 scientific papers.

No. 1

2017

Determination of a vocal source by the spectral ratio method

Recommend Documents