Multimed Tools Appl DOI 10.1007/s11042-014-2156-2
Head motion synthesis from speech using deep neural networks Chuang Ding · Lei Xie · Pengcheng Zhu
Received: 18 March 2014 / Revised: 14 May 2014 / Accepted: 19 May 2014 © Springer Science+Business Media New York 2014
Abstract This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audiovisual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation. Keywords Head motion synthesis · Deep neural network · Talking avatar · Computer animation
1 Introduction When talking, we often move heads and exhibit various facial expressions. Non-verbal cues, e.g., hand gestures, facial expressions and head motions, are used to express feelings, give feedbacks and engage human-human communication. Hence natural head motion is an indispensable factor for a computer-animated talking avatar looking lifelike [17, 33, 34]. Previous research has discovered the connection between speech and the accompanying C. Ding · L. Xie () School of Computer Science, Northwestern Polytechnical University, Xi’an, China e-mail:
[email protected] C. Ding e-mail:
[email protected] P. Zhu School of Software and Microelectronics, Northwestern Polytechnical University, Xi’an, China e-mail:
[email protected]
Multimed Tools Appl
head motion. Busso et al. [2] have reported high correlation between head movements and acoustic features via canonical correlation analysis (CCA) [8]. Munhall et al. [23] have suggested that head motion is important in speech perception and appropriate head motion can significantly enhance speech perception for a talking avatar [40]. Therefore, much effort has been devoted to head motion generation for a talking avatar in recent years. According to the input, head motion synthesis can be divided into speech-based [3, 37] and text-based approaches [40, 41]. Text-based approaches analyze the relationship between head motion and text rhythmic structure, establish rules or an association model and realize head motion through a generation algorithm. Speech-based approaches usually record bi-modal data (audio-visual or audio-MoCap) of a talking person and establish a model between the head motion and the acoustic features. Based on the model, head motion can be predicted from the acoustic input. Head motion synthesis can be addressed as either a classification or a regression task. Graf et al. [11] categorize head motion into three patterns: nod around one axis, nod with overshoot and abrupt swing in one direction. Busso et al. [2] partitions continuous head motion trajectories into several patterns using Linde Buzo Gray-Vector Quantization (LBG-VQ). In the phase of head motion synthesis, the acoustic input is decoded to a sequence of head motion patterns. Motivated by the hidden Markov model (HMM) based speech synthesis, Hofer et al. [16] train head-pattern HMMs and generate smooth head motion trajectories under the maximum likelihood estimation (MLE) criterion. Classification methods rely on not only the definitions of typical head motion patterns [29] but also the accurate recognition of these patterns. In addition, the association between speech and head motion is essentially a non-deterministic, many-tomany mapping problem. As a result, the head motion pattern recognition accuracy usually remains very low. Therefore, recent research indicates that it is more appropriate to regard speech-to-head-motion synthesis as a regression problem [21]. Specifically, we use back propagation neural network to seek a direct and continuous mapping from acoustic speech to head motion. Compared with the HMM-based approaches, a simple one-hidden-layer MLP (multi-layer perceptron) can significantly improve the head motion prediction accuracy and the naturalness of the head movement of a talking avatar. In the past several years, deep neural networks (DNN) and deep learning methods have been successfully used in many tasks, such as speech recognition [13], natural language processing [22], and computer vision [14]. For example, the DNN-HMM approach has boosted the speech recognition accuracy significantly [39]. Zhao et al. [42] have applied DNN in the articulatory inversion problem, in which continuous articulatory movements are accurately predicted from acoustic speech. Compared with the traditional neural network, the success of a DNN mainly lies in the many-layer deep architecture and how those layers are optimized. Historically, MLPs use only one or two hidden layers due to their limited computation power and the difficulties in optimizing many layers. Those have only become practical recently through the use of clusters or graph processing units (GPUs) and most importantly the discovery of layer-wise training either generatively or discriminatively. Each layer in the DNNs nonlinearly transforms its input representation into a higher level, resulting in more abstract representation that better models the underlying factors of the data. Therefore, lower representations of the input, e.g., pixels in images [5] and filter banks (FBank) in speech [9], can be effectively used to further boost the performance. In this paper, we address the speech-to-head-motion synthesis problem using deep neural networks. We learn speech-to-head-motion mapping using a DNN from audio-visual data of anchorpersons in broadcast news and drive the head movement of a talking avatar from the
Multimed Tools Appl
predicted head motion. Significant performance gain in head motion prediction is achieved. Specifically, our contributions are as follows. – –
–
We investigate the best architecture of a neural network, i.e., depth and width, for the speech-to-head-motion synthesis task. We evaluate the effectiveness of different acoustic features, including mel frequency cepstral coefficients (MFCC), linear prediction coefficients (LPC) and the lower representation of speech - FBank. We examine whether the extra data from other subjects are beneficial to the head motion prediction performance in both pre-training and fine-tuning stages.
The rest of the paper is organized as follows. Section 2 illustrates the architecture of the proposed speech-to-head-motion synthesis system. Section 3 describes the deep neural network approach for speech-to-head-motion synthesis. Experimental analysis is presented in Section 4. Finally, Section 5 draws the conclusions and presents the future directions.
2 System overview Figure 1 illustrates the block diagram of the speech-to-head-motion synthesis system that consists of a training phase and an animation phase. The training phase learns the speechto-head-motion mapping using a DNN from an audio-visual corpus (audio-visual data of anchorpersons in broadcast news). Given the DNN model, the animation phase converts input audio to head motion that is used to drive a talking avatar. In the first step of the training phase, acoustic features and head motions are extracted through the acoustic feature extraction and the head motion extraction modules,
Acoustic Feature Extraction
Audio
Audio-visual Bimodal Speech Corpus
Video
Head Motion Extraction 3D Rotation Angle
Acoustic Feature
Discriminative Fine-tuning
Generative Pre-training
Training DNN Model
Animation Audio
Acoustic Feature Extraction
Acoustic Feature
Speech-to-head-motion Mapping 3D Rotation Angle
Play Back
Head Animation
Fig. 1 Architecture of the speech-to-head-motion synthesis system
Talking Avatar System
Multimed Tools Appl
respectively. We use Intraface [36] tool to track the anchor face from video and get the 3dimensional head rotation angles around x-axis, y-axis and z-axis, namely, nod, yaw and roll, as shown in Fig. 2. Subsequently, we train a DNN through a pre-training step and a fine-tuning step. In order to pre-train the DNN, we train a series of Restricted Boltzmann Machines (RBMs) using acoustic features and stack up these RBMs to form a dynamic belief network (DBN) architecture. The RBM pertaining procedure is used to initialize the weights of a DNN. Then we use back propagation algorithm to discriminatively fine-tune the model, which builds up the correspondences between acoustic features and the head motion. The animation phase is quite simple. Given the acoustic features extracted from a new speech waveform, the three head rotation angles are estimated from the DNN and these parameters are used to drive a talking avatar with synchronized speech and the head motion.
3 Speech-to-head-motion synthesis with deep neural network Our previous work has shown that a shallow neural network (one-hidden layer MLP) achieves superior performance in speech-to-head-motion synthesis [21]. This naturally motivates us to use a deep neural network with more powerful structure and effective learning method to further push forward the performance of speech-to-head-motion mapping. In this section, we first introduce the definition of a DNN and then we describe how to train the DNN through generative pre-training and discriminative fine-tuning [7, 13, 15, 39]. 3.1 Deep neural network A deep neural network is actually a multi-layer perceptron (MLP), i.e., a feed-forward neural network model that maps sets of input data onto a set of outputs. In our case, the input
Fig. 2 The three Euler angle tracked by IntraFace [36]
Multimed Tools Appl
and output are acoustic features and head rotation angles, respectively. An MLP usually consists of an input layer, a hidden layer and an output layer, and the nodes in each layer are fully connected to the nodes in another layer. A DNN generalizes an MLP with multiple hidden layers, as shown in Fig. 3. The input layer has no computation capability as it simply attaches the observations to the network. Each hidden layer takes in the activations of the layer below and computes a new set of nonlinear activations for the layers above. The output layer generates either a value in regression or a posterior vector in classification using the activations from the last hidden layer. Each hidden layer computes the activation hl via a linear transformation using a weight matrix Wl and a bias vector bl followed by a nonlinear function fl (x):
hl = fl (Wl hl−1 + bl )
1≤l
(1)
where the nonlinear function fl (x) usually operates in an element-wise manner on the input vector. The commonly used activation function is the sigmoid logistic function. Each sigmoid hidden unit can be regarded as carrying out a logistic linear regression feature extraction process [24] which refines the input representation to a better one. The output layer (Lth layer) acts as the functional role that is to predict either a value or a class label. In our study, head rotation angles are the targets to be predicted.The output layer simply carries out a similar linear regression as hidden layers do using a weight matrix WL and a bias vector bL . However, a different task-dependent nonlinear function is usually adopted. For regression tasks like our speech-to-head-motion mapping, a linear or sigmoid function is often used. For classification tasks, the softmax function is adopted which converts the values of arbitrary ranges into a probabilistic representation. In summary, the parameters for an L-layer network are (W1 , b1 ), (W2 , b2 ), ..., (WL , bL ). They are usually ran-
Fig. 3 The structure of a deep neural network
Output layer (L)
Hidden layers
Input layer (1)
Multimed Tools Appl
domly initialized and then discriminatively updated using the error back propagation (BP) algorithm [27]. However the gradient-based BP is only effective for one or two hidden layers [25, 26]. With more than two hidden layers, the gradient diminishing problem usually leads to local optima for the BP algorithm [10]. A better training method that can fully explore the training information to build multiple layers of nonlinear feature abstractions is highly in demand. 3.2 Deep delief network based pre-training As we just mentioned, an effective training algorithm is essential to the success of DNN. Hinton et al. have proposed a training method by using deep belief networks (DBN) [15], which provides a practical way of building deep layered networks and triggered great interest in learning deep models. The key of learning deep models lies in the unsupervised generative pre-training using Restricted Boltzmann Machines (RBMs) and stack RBMs to form a DBN. This generative pre-training stage leads the model into a space that is close to a better optimum. It hence enables the learning of deep models with better generalizations. 3.2.1 Restricted boltzmann machine An RBM can be considered as a special type of Markov Random Field (MRF) that has one layer of (typically Bernoulli) stochastic hidden units and one layer of (typically Bernoulli or Gaussian) stochastic visible or observable units. It can be viewed as a bipartite graph in which visible units that represent observations are connected to binary, stochastic hidden units using undirected weighted connections, as shown in Fig. 4-(1). There are no visiblevisible or hidden-hidden connections and all visible units are connected to all hidden units. RBMs have an efficient training procedure which makes them suitable as building blocks for DBN. In an RBM, the joint distribution p(v, h;θ ) over the visible units v and hidden units h, given the model parameters θ , is defined by an energy function E(v, h;θ ), i.e., p(v, h;θ ) =
exp(−E(v, h;θ )) , Z
(2)
hd ……
h
h2 w
w2 h1
v w1 v (1) RBM
(2) DBN
Fig. 4 RBM and stacking up RBMs to form a DBN. Simplified model representation is used
Multimed Tools Appl
where Z=
v
exp(−E(v, h;θ ))
(3)
h
is a normalization factor and the marginal probability that the model assigns to a visible vector v is exp(−E(v, h;θ )) (4) p(v; θ ) = h Z For a Bernoulli-Bernoulli RBM, the energy function is defined as I I J J wij vj hj − bi vi − aj hj (5) E(v, h;θ ) = − i=1 j =1
i=1
j =1
where wij represents the symmetric interaction term between visible unit vi and hidden unit hj , bi and aj the bias terms, and I and J are the numbers of visible and hidden units, respectively. The conditional probabilities can be efficiently calculated as I p(hj = 1|v; θ ) = σ ( wij vi + aj ) (6) i=1
p(vi = 1|h; θ ) = σ (
J
wij hj + bi )
(7)
j =1
where σ (x) =
1 1+exp(x) .
Similarly, for a Gaussian-Bernoulli RBM, the energy is
E(v, h;θ ) = −
I J
1 (vi − bi )2 − aj hj . 2
wij vi hj −
i=1 j =1
I
J
i=1
j =1
(8)
The corresponding conditional probabilities become p(hj = 1|v, θ ) = σ (
I
wij vi + aj )
(9)
wij hj + bi , 1)
(10)
i=1
p(vi |h; θ ) = N (
J j =1
where vi takes real values and follows a Gaussian distribution with mean Jj=1 wij hj + bi and variance one. Gaussian-Bernoulli RBMs can be used to convert real-valued stochastic variables to binary stochastic variables, which can then be further processed using the Bernoulli-Bernoulli RBMs [12]. 3.2.2 Generative training of an RBM Taking the gradient of the log likelihood log p(v; θ ), we can derive the update rule for the RBM weights as wij = Edata (vi hj ) − Emodel (vi hj ), (11) where Edata (vi hj ) is the expectation observed in the training set and Emodel (vi hj ) is that same expectation under the distribution defined by the model. Unfortunately, Emodel (vi hj ) is intractable to compute so contrastive divergence (CD) approximation to the gradient is used, where Emodel (vi hj ) is replaced by running the Gibbs sampler initialized at the data for one full step. The steps in approximating Emodel (vi hj ) is described as follows [39]:
Multimed Tools Appl
– – – –
Initialize v0 at data; Sample h0 ∼ p(h|v0 ); Sample v1 ∼ p(v|h0 ); Sample h1 ∼ p(h|v1 ).
Then (v1 , h1 ) is a sample from the model, as a very rough estimate of Emodel (vi hj ) = (v∞ , h∞ ), which is a true sample from the model. The use of (v1 , h1 ) to approximate Emodel (vi hj gives rise to the algorithm of single-step contrastive divergence (CD-1) [31]. 3.2.3 Stacking up RBMs to DBN Stacking a number of RBMs learned layer by layer from bottom up gives rise to a DBN, as shown in Fig. 4-(2). The stacking procedure is as follows [15]. After learning a GaussianBernoulli RBM (used in our study) or Bernoulli-Bernoulli RBM, we treat the activation probabilities of its hidden units as the data for training the Bernoulli-Bernoulli RBM one layer up. The activation probabilities of the second-layer Bernoulli-Bernoulli RBM are then used as the visible data input for the third-layer Bernoulli-Bernoulli RBM. By this way, we grow the network to a desired depth. The greedy procedure above achieves approximate maximum likelihood learning. Note that this learning procedure is unsupervised and requires no target label, in our case, the head motions. 3.3 Discriminative fine-tuning of DNN A randomly-initialized target layer is added on the top of the DBN, resulting in a DNN [13]. For regression tasks, a linear or sigmoid function is often used; while for classification tasks, the softmax function is adopted which converts the values of arbitrary ranges into a probabilistic representation. In our approach, a linear regression layer is used and the output of this layer corresponds to head rotation angles. The DBN-based pre-training procedure described in Section 3.2 is used to initialize other layers of the DNN [28]. This generative pre-training strategy leads the model into a space that is close to a better optimum. As conventional shallow MLPs, the standard back propagation algorithm is used to adjust or fine-tune the whole DNN model.
4 Experiments
4.1 Corpus and experiment setup An audio-visual database is essential for the mapping between speech and head motion. In previous studies, subjects are asked to read a specifically-designed corpus, with a number of markers on the face for head pose tracking [4, 6, 18], i.e., the MoCap approach. The glued markers on the face may lead to unnatural head movements and less accurate analysis results. In this study, we delicately collect a relatively large audio-visual database from NBC English broadcast news. Specifically, we collect video clips from in-studio anchors, which consist of 103-minute data from a target anchorperson and 120-minute data from another 10 anchorpersons. We aim to predict the head movement of the target anchorperson from a trained DNN. The data assignment is summarized in Table 1, where the training data comes from not only the target speaker but also other speakers. The video frame rate is 25 fps
Multimed Tools Appl Table 1 Audio-visual speech corpus Source
English anchor newscast video from NBC
Task
Predict a target anchor’s head movements from speech 93 minutes from the target anchor (male)
Training data
120 minutes from other 10 anchors (5 males & 5 females)
Testing data
10 minutes from the target anchor
Acoustic feature
MFCC, LPC and FBank
Head rotation
Nod, yaw and roll
and the sample rate of the audio is 44.1 KHz. Audio is further down-sampled to 16 KHz for acoustic feature extraction. We extract MFCC, LPC and FBank features using the HTK toolkit [38]. The frame window length is set to 25 ms with an overlap of 15 ms. In the experiments, a context window of 11 acoustic frames (5 left frames, 1 current frame and 5 right frames) is used as the DNN input. Hence, the number of the units of the input layers is equal to D × 11, where D denotes the dimension of the frame-level acoustic feature vector. We use IntraFace [36] to track the anchor face from the video clips and get the rotation Euler angles. Figure 5 shows the head rotation trajectories extracted form a video clip in the database. Both the acoustic features and the head motion angles are processed by utterancelevel normalization that subtracts their respective global mean and divids by 4 times the standard deviation for each dimension. 4.2 Evaluation criteria We evaluate how the predicted head movements from speech match the head movements tracked from video (ground truth). Specifically, performance is quantitatively measured in terms of canonical correlation analysis (CCA) [30], average correlation coefficient (ACC) [20] and mean square error (MSE) [1]. CCA is a multivariate statistical model that facilitates study of the inter-relationship among sets of multiple dependent variables and multiple independent variables [19]. If CCA value is less than 0.3, it indicates that 20 15 Nod
Angle
10 5 Yaw
0 −5
Roll −10 0
50
100
150
200
250
300
350
400
Frame
Fig. 5 Head motion rotation trajectories extracted from a video clip in NBC English broadcast news
Multimed Tools Appl Table 2 Comparison of DNN and ANN in terms of CCA, ACC and MSE Method
CCA
ACC
MSE
DNN
0.5127
0.4117
0.2258
ANN
0.3688
0.2638
0.2315
Random
0.0936
-0.0383
0.2701
there is tiny or no correlation between two sets of variables; if the value ranges from 0.3 to 0.5, this declares that the correlation is really existing but relatively weak; furthermore, there is significant correlation when the value is greater than 0.5. ACC is computed as: ˆ v) = ACC = ρ(Ov , O
T d v v 1 (ot,i − μoiv )(oˆ t,i − μoˆiv ) , T ·d ρoiv ρoˆiv
(12)
t=1 i=1
and the MSE is defined as T ˆ v − Ov || = 1 ||ˆovt − ovt || MSE = ||O T
(13)
t=1
v and oˆ v are where ovt and oˆ vt denote the actual and the predicted head rotation angles; ot,i t,i their ith coefficients, respectively; μ and ρ are their mean and standard deviation; d = 3 in this paper and T is the total number of frames. MSE shows the parameter prediction errors and CCA and ACC describe how similar in shape the predicted trajectory is with the ground truth.
4.3 Experimental results We compare the proposed DNN approach with the conventional artificial neural network (ANN) approach (random initialization) using MFCC features. The input layer of the networks has 429 visible units (39-D MFCC features with static, first and second order delta features; 11 frames). Both networks have 1 hidden layer with 160 hidden units and the final output layer has 3 units, corresponding to the 3 head rotation angles. The ANN is initialized randomly and trained using 50 iterations of back propagation. The DNN is initialized using layer-by-layer generative pre-training and then discriminatively trained using 50 iterations of back propagation. The learning rate is initialized as 0.05 and decreased by multiplying a scale factor of 0.99 in each iteration, with a momentum of 0.5. Back propagation is done using stochastic gradient descent in mini-batches of 128 training examples. The training data is from the target speaker with 93 minutes and the testing data is also from the same speaker with 10 minutes, as shown in Table 1. Please note that both the ground truth head rotation trajectory and the predicted one are smoothed using moving average before calculation of CCA, ACC and MSE 1 . The experimental results are shown in Table 2. From the results, we can clearly see that, with the help of generative pre-training, the DNN system significantly outperforms
1 We
use the same smoothing method in all experiments
Multimed Tools Appl
Ground Truth ANN DNN Random
0.8 0.6
Nod Prediction
0.4 0.2 0 −0.2 −0.4 −0.6 0
50
100
150
200
250 300 Frame
350
400
450
500
550
Fig. 6 Nod trajectories generated by different approaches for a clip in the test set
the ANN system. The CCA is 0.5127 and the ACC is 0.4117 for DNN; while for ANN, the CCA is 0.3688 and the ACC is 0.2638. We also notice that the DNN system also brings apparent reduction in MSE. For comparison, the performance of a randomly generated trajectory is also given in Table 2 (named Random). Figure 6 shows the head nod trajectories generated by different methods for a test clip. We can see that the trajectory generated by DNN is much closer to the ground truth trajectory. The nod, yaw and roll trajectories can be used to drive a talking avatar. Figure 7 shows some snapshots from a synthesized head movement of a talking avatar and their corresponding face images in the broadcast news. 4.4 Effects of DNN depth and width We carry out experiments to evaluate the performance of the DNN approach with different layer depths and widths. The network input and the data assignment are kept the same with those in Section 4.3. In the training of all networks with different depth/width configurations, the learning rate is 0.05 and the momentum 0.99. For clarity, we use CCA as the evaluation criterion.
Fig. 7 Some snapshots from a synthesized head movement sequence (up) and their corresponding face images in the broadcast news (bottom)
Multimed Tools Appl Table 3 Effects of DNN depth and width in head motion prediction in terms of CCA Depth
Width
1
2
3
4
5
100
0.5178
0.4924
0.4843
0.4621
0.4350
200
0.5234
0.4933
0.4802
0.4728
0.4580
300
0.4998
0.4812
0.4650
0.4547
0.4322
400
0.4853
0.4686
0.4532
0.4486
0.4290
The results are shown in Table 3. As can be seen, the best network configuration is one hidden layer with 200 nodes, which achieves the highest CCA (0.5234). We also notice that CCA decreases with the increase of network depth and further expansion of the network width can not bring performance improvement. This may be explained by two reasons: (1) the relationship between acoustic speech and the head motion is probably a one-layer nonlinear mapping; (2) the size of our training data is still limited and the network is overfitted. 4.5 Comparison of different acoustic features In this experiment, we compare different acoustic features to show their abilities in predicting head motions. Specifically, 39-dimentional MFCC, 36-dimentional LPC (12LPC+12Delta+12DeltaDelta) and 26-dimentional filter bank features (FBank) are investigated. As previous study in speech recognition has shown that FBank outperforms MFCC [9], we aim to find out if this is also the case in speech-to-head-motion prediction. Experimental configurations are listed in Table 4. The performance of different acoustic features is shown in Table 5. Although the MSE values achieved by DNNs using different acoustic inputs are quite close, the DNN system with the FBank feature shows superior performance in terms of CCA and ACC. This means the FBank predicted head motion trajectories are closer to the shape of the ground truth trajectories. Figure 8 shows the nod trajectories generated by DNNs with different acoustic inputs for a clip in the test set. From this figure, we can see that the MFCC feature and the FBank feature have similar trend in most frames; but in some places, e.g., frame 50– 120, FBank brings closer shape with the ground truth as compared to MFCC. The results demonstrate that the FBank feature is more suitable in the speech-to-head-motion synthesis task. This conclusion is consistent with DNN-based speech recognition where FBank also outperforms MFCC [9].
Table 4 Experimental configurations for DNNs with MFCC, LPC and FBank features MFCC
LPC
FBank
Input
39*11=429
36*11=396
26*11=286
Depth/Width
1/200
1/140
1/120
Learning Rate/Scale Fator
0.05/0.99
0.05/0.99
0.075/0.99
Mini-batch
128
128
128
Momentum
0.5
0.5
0.5
BP Iterations
50
50
50
Multimed Tools Appl Table 5 Comparison of DNN with MFCC, LPC and FBank acoustic features Feature
CCA
ACC
MSE
MFCC
0.5234
0.4372
0.2255
LPC
0.3751
0.2738
0.2350
FBank
0.5403
0.5189
0.2253
4.6 Effects of extra training data In this experiment, we investigate how extra training data from other speakers affect the performance of DNN during the generative pre-training phase and the discriminative fine-tuning phase. Our objective is to see whether data from other speakers can improve the head motion prediction of the target speaker. The experimental configurations are summarized in Table 6. Please note that FBank is used for the experiments due to its superior performance described in Section 4.5. We tune the network structure and hyperparameters to get the best performance. For Training A, the best parameter configuration is as follows. The DNN has 1 hidden layer with 120 hidden units. A learning rate of 0.05 is used with a scale factor of 0.99 each epoch. Back propagation is done using stochastic gradient descent in mini-batches of 128 training examples with a momentum of 0.5. We perform 50 iterations of back propagation in the fine-tuning stage. The best configuration for Training B is the same with Training A except that the learning rate is 0.05. Results are summarized in Table 7. From the table, we can observe that, in Training A, with the help of the extra training data from other speakers in the pre-training step, CCA increases from 0.5403 to 0.5451 and ACC increases from 0.5189 to 0.5201. This performance gain may suggest that the generative pre-training phase achieves better network initialization with more data from other speakers and learns the common speech pattern of different speakers. On the contrast, in Training B, if we use the extra
0.8 Ground Truth FBank LPC MFCC
0.6
Nod Prediction
0.4 0.2 0 −0.2 −0.4 −0.6 0
50
100
150
200
250 300 Frame
350
400
450
500
550
Fig. 8 Nod trajectories generated by DNNs with different acoustic features for a clip in the test set
Multimed Tools Appl Table 6 Experimental configurations for different settings of training data Setting
Pre-training
Fine-tuning
Single speaker
93 mins from target speaker
93 mins from target speaker
Training A Training B
93 mins from target speaker
93 mins from target speaker
120 mins from other 10 speakers 93 mins from target speaker
93 mins from target speaker
120 mins from other 10 speakers
120 mins from other 10 speakers
data both in the pre-training and fine-tuning phases, we see clear performance degradations. Specifically, CCA decreases from 0.5403 to 0.5129 and ACC decreases from 0.5189 to 0.5158. This suggests that different speakers have different head motion patterns when talking and addition of data from other speakers in the supervised fine-tuning step may degrade the head motion prediction performance of the target speaker. Similar phenomena are observed in DNN-based multi-task speech recognition [9]. Figure 9 shows the head nod trajectories generated by different training settings for a clip in the test set.
5 Conclusions and future work In this paper, we address the speech-to-head-motion synthesis problem by deep neural networks. Through a learned DNN from audio-visual data, our approach can predict head movement of a speaker from his/her speech. This can be effectively used in a talking avatar system accompanied with realistic head motion [33, 35]. We have investigated the problem through three important aspects: the ideal structure of the network, the most effective acoustic feature and the training strategy. Our study leads to several important conclusions. First, a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network in head motion prediction. Second, similar to the conclusions drawn in speech recognition, FBank has the best ability in head motion synthesis as compared with MFCC and LPC. Third, extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker but the extra data does not help if used in both pre-training and fine-tuning stages. Future work can be devoted to the following two aspects. First, previous study has shown that a typical head motion pattern may span for a relatively long period; this motivates to
Table 7 Comparison of different amount of data on DNN Setting
CCA
ACC
MSE
Single Speaker
0.5403
0.5189
0.2253
Training A
0.5451
0.5201
0.2243
Training B
0.5129
0.5158
0.2317
Multimed Tools Appl
Ground Truth Training A Single Speaker Training B
0.6
Nod Prediction
0.4 0.2 0 −0.2 −0.4 −0.6 0
50
100
150
200
250 300 Frame
350
400
450
500
550
Fig. 9 Nod trajectories generated by different training settings for a clip in the test set
investigate the head motion prediction ability of an even longer span of the acoustic speech input. Second, since prosodic aspects of speech, e.g., intonational and durational cues [32], are highly related to head movements, we plan to use prosodic features with current acoustic features to further improve the head motion synthesis performance. Acknowledgments This work was supported by the National Natural Science Foundation of China (61175018) and the Fok Ying Tung Education Foundation (131059).
References 1. Allen DM (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13(3):469–475 2. Busso C, Narayanan SS (2007) Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans Audio, Speech, Lang Process 15(8):2331–2347 3. Busso C, Deng Z, Neumann U, Narayanan S (2005) Natural head motion synthesis driven by acoustic prosodic features. Comput Animat Virtual Worlds 16(3–4):283–290 4. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359 5. Ciresan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural comput 22(12):3207–3220 6. Cruz-Neira C, Sandin DJ, DeFanti TA, Kenyon RV, Hart JC (1992) The cave: audio visual experience automatic virtual environment. Communications of the ACM 35(6):64–72 7. Dahl GE, Yu D, Deng L, Acero A (2011) Large vocabulary continuous speech recognition with contextdependent dbn-hmms. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 4688–4691 8. Dehon C, Filzmoser P, Croux C (2000) Robust methods for canonical correlation analysis. In: Data analysis, classification, and related methods, Springer, pp 321–326 9. Deng L, Li J, Huang JT, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J et al (2013) Recent advances in deep learning for speech research at microsoft. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 8604–8608 10. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: International conference on artificial intelligence and statistics, pp 249–256 11. Graf HP, Cosatto E, Strom V, Huang FJ (2002) Visual prosody: Facial movements accompanying speech. In: 5th IEEE international conference on automatic face and gesture recognition, IEEE, pp 396–401 12. Hinton G (2010) A practical guide to training restricted boltzmann machines. Momentum 9(1):926
Multimed Tools Appl 13. Hinton G, Deng L, Yu D, Dahl GE, Ar Mohamed, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Proc Mag 29(6):82–97 14. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507 15. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554 16. Hofer G, Shimodaira H (2007) Automatic head motion prediction from speech data. In: INTERSPEECH, pp 722–725 17. Jia J, Zhang S, Meng F, Wang Y, Cai L (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Trans Audio Speech Language Process 19(3):570–582 18. Kuratate T, Munhall KG, Rubin P, Vatikiotis-Bateson E, Yehia H (1999) Audio-visual synthesis of talking faces from speech production correlates. In: EuroSpeech 19. Lattin JM, Carroll JD, Green PE (2003) Analyzing multivariate data. Thomson Brooks/Cole Pacific Grove 20. Lee Rodgers J, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1):59–66 21. Li B, Xie L, Zhu P (2013) Head motion generation for speech driven talking avatar. J Tsinghua Univ (Sci & Tech) 53(6):898–902 22. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 23. Munhall KG, Jones JA, Callan DE, Kuratate T, Vatikiotis-Bateson E (2004) Visual prosody and speech intelligibility head movement improves auditory speech perception. Psychol Sci 15(2):133–137 24. Reynolds DA, Campbell WM (2008) Text-independent speaker recognition. Springer handbook of speech processing, pp 763–782 25. Rosenblatt F (1961) Principles of neurodynamics, perceptrons and the theory of brain mechanisms. Technical Report, DTIC Document 26. Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Technical Report, DTIC Document 27. Rumelhart DE, Hinton GE (1988) Learning representations by back-propagating errors. MIT Press, Cambridge 28. Salakhutdinov R, Hinton GE (2009) Deep boltzmann machines. In: International Conference on Artificial Intelligence and Statistics, pp 448–455 29. Sargin ME, Yemez Y, Erzin E, Tekalp AM (2008) Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Trans Pattern Anal Mach Intell 30(8):1330–1345 30. Thompson B (2005) Canonical correlation analysis. Encyclopedia of statistics in behavioral science 31. Tieleman T (2008) Training restricted boltzmann machines using approximations to the likelihood gradient In: Proceedings of the 25th international conference on Machine learning, ACM, pp 1064– 1071 32. Xie L (2008) Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news. Multimedia Syst 14(4):237–253 33. Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE T Multimedia 9(3):500–510 34. Xie L, Liu Z-Q (2007) A coupled HMM approach for video-realistic speech animation. Pattern Recogn 40(10):2325–2340 35. Xie L, Sun N, Fan B (2013) A statistical parametric approach to video-realistic text-driven talking avatar. Multimed Tools Appl. doi:10.1007/s11042-013-1633-3 36. Xiong X, De la Torre F (2013) Supervised descent method and its applications to face alignment. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 532–539 37. Yehia HC, Kuratate T, Vatikiotis-Bateson E (2002) Linking facial animation, head motion and speech acoustics. J Phon 30(3):555–568 38. Young S, Evermann G, Kershaw D, Moore G, Odell J, Ollason D, Valtchev V, Woodland P (1997) The HTK book, vol 2. Entropic Cambridge Research Laboratory Cambridge 39. Yu D, Deng L (2011) Deep learning and its applications to signal and information processing. IEEE Signal Proc Mag 28(1):145–154 40. Zhang S, Wu Z, Meng HM, Cai L (2007) Facial expression synthesis using pad emotional parameters for a chinese expressive avatar. In: Affective computing and intelligent interaction, Springer, pp 24–35
Multimed Tools Appl 41. Zhang S, Wu Z, Meng HM, Cai L (2007) Head movement synthesis based on semantic and prosodic features for a chinese expressive avatar. In: IEEE international conference on acoustics, speech and signal processing 2007, ICASSP 2007, vol 4. IEEE, pp IV–837 42. Zhao K, Wu Z, Cai L (2013) A real-time speech driven talking avatar based on deep neural network. In: Signal and information processing association annual summit and conference (APSIPA), 2013 AsiaPacific, IEEE, pp 1–4
Chuang Ding received the B.E. degree in computer science and technology from Northwestern Polytechnical University, Xi’an, China, in 2013. He is currently a master student in the school of Computer Science, Northwestern Polytechnical University, Xi’an, China. His current research interests include talking avatar animation, speech synthesis and pattern recognition.
Lei Xie received the Ph.D. degree in computer science from Northwestern Polytechnical University, Xi’an, China, in 2004. He is currently a Professor with School of Computer Science, Northwestern Polytechnical University, Xi’an, China. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium, as a Visiting Scientist. From 2004 to 2006, he was a Senior Research Associate in the Center for Media Technology, School of Creative Media, City University of Hong Kong, Hong Kong. From 2006 to 2007, he was a Postdoctoral Fellow in the Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong. He has published more than 90 papers in major journals and conference proceedings, such as the IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE TRANSACTIONS ON MULTIMEDIA, INFORMATION SCIENCES, PATTERN RECOGNITION, ACL, Interspeech, ICPR, ICME and ICASSP. He has served as program chair and organizing chair in several conferences. He is a Senior Member of IEEE. His current research interests include speech and language processing, multimedia and human-computer interaction.
Multimed Tools Appl
Pengcheng Zhu received the B.E degree in software engineering from Northwestern Polytechnical University (NWPU), Xi’an, China, in 2013. He is now a master student in the School of Software and Microelectronics, NWPU. His main research interests include talking avatar animation, speech synthesis and pattern recognition.