J Eng Math DOI 10.1007/s10665-014-9689-2
Application of extended Kalman filter for improving the accuracy and smoothness of Kinect skeleton-joint estimates Jody Shu · Fumio Hamano · John Angus
Received: 12 June 2013 / Accepted: 21 January 2014 © Springer Science+Business Media Dordrecht 2014
Abstract The Kinect sensor is a powerful tool for applications that require machine vision and voice recognition. The sensor has the capability to detect and track up to two individuals within its field of view and output 20 key 3D “skeleton” joints on these individuals at 30 frames per second. Moreover, the sensor also has a sound-localizing array of microphones that is used to compute the azimuth of any primary sound source within its range. While this skeleton data have good accuracy most of the time, the 20 tracking points exhibit a high level of jitter due to noise and estimation error, and when a subject moves slightly out of the field of view of the sensor for a short period of time, there is no built-in capability to continue the tracking by extrapolating the positions of these points. In addition, the sensor does not take advantage of the sound source angle when the subject being tracked is speaking. In this work, tracking with the sensor is improved by applying an extended Kalman filter. This filter smooths out the jitter, adds the capability to continue tracking for a short period of time when the subject moves out of range of the sensor, and improves the accuracy of the tracking by incorporating the information contained in the sound source angle from the sensor. The efficacy of the filter is demonstrated by applying it to the skeleton head joint, a tracking point near the center of the subject’s head. Keywords
Machine vision · Pattern recognition · Sensors · Sound-enhanced tracking
1 Introduction This paper presents a method for improving the estimation of key human body locations designated in the Kinect skeleton. A single human subject is considered, for example a lecturer, moving about in the field of view of the sensor and speaking intermittently. This method uses an extended Kalman filter to integrate standard Kinect skeleton data and sound direction, resulting in significantly improved real-time location estimation. The Kinect sensor was originally introduced by Microsoft as an accessory for their Xbox video gaming system. Much has been written recently about the Kinect sensor and its evolution. The technology behind the Kinect was J. Shu · F. Hamano College of Engineering, California State University Long Beach, Long Beach, CA, USA J. Angus (B) Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA, USA e-mail:
[email protected]
123
J. Shu et al.
Fig. 1 Kinect coordinate system and its sensor locations
Fig. 2 Top view of Kinect depth camera field of view
developed by PrimeSense [1], and an excellent overview of the details of how the sensor works may be found in [2]. The sensor is equipped with a linear array of four microphones, an internal digital signal processor, an RGB camera, an infrared camera, and an infrared projector. The infrared devices combine to provide depth information for objects in the view field. This is accomplished employing epipolar geometry [3], that is, computing disparities between two images taken from different perspectives. The two images are created by the infrared projector and infrared camera as follows. The infrared projector emits a laser beam through a diffraction grating. This produces a pseudorandom pattern of laser dots on any object in the view field. An image of these dots on a flat surface at a known distance is stored internally in the sensor. When the infrared camera senses a pattern of dots on an image in the view field, this pattern is correlated with the dots on the stored image to locate them within the stored image. The disparity between these two images can thus be computed and the depth information for the object in the view field determined. The intrinsic parameters of the Kinect’s infrared camera (e.g., focal length, lens distortion, offsets) then allow full three-dimensional (3D) coordinates to be computed for a dense pattern of dots on any object within view of the sensor. A schematic of the Kinect and its sensor locations, along with its right-handed coordinate system, is shown in Fig. 1. The sensor’s base sits parallel to the (x, z) plane, with the x axis running parallel through the sensor’s video and audio sensor arrays. The positive z axis defines the sensor’s direction of view, and the y axis runs perpendicular to the base of the sensor. In terms of its ability to track objects, the default field of view for the sensor is shown in Fig. 2. The angle of view is roughly 57.5◦ in the horizontal direction, and objects between (roughly) 2.6 and 13.1 ft from the sensor can be resolved for depth (i.e., their z coordinate can be estimated). In the vertical direction [i.e., in the (y, z) plane], the angle of view is 43.5◦ , and the sensor has a motorized mechanism that can tilt the sensor up or down 27◦ , thereby potentially extending the vertical angle of view to 97.5◦ . The microphone array is used for sound localization. By measuring the time delay of arrival of a sound source between pairs of microphones in the array, the azimuth in the (x, z) plane of the sound source can be computed. With this sound source direction, the microphone array can be turned into a directional microphone via beamforming to produce an optimal sound signal from the source. See [4] for details on how sound localization and beamforming are accomplished. These capabilities, along with proprietary algorithms, allow the player to interact with the Xbox in real time using speech and gestures to respond and issue commands to the game. The most innovative feature of Kinect is its skeleton stream. This concept originated from the work of Shotton et al. [5,6]. Applied to the Kinect sensor, their approach was based on sampling Kinect data from sensing a large number of humans in various postures. Using that database, they trained a random forest classifier (Breiman [7]) to recognize 20 distinct body parts from a set of
123
Application of extended Kalman filter for improving
(a)
(b)
(c)
Fig. 3 Screen captures of Kinect skeleton superimposed in real time. a Upper torso on depth map. b Upper torso on color image. c Full skeleton on depth map
measured features (see also [2]). Those body parts are the left and right ankles, left and right elbows, left and right feet, left and right hands, head, center hip, left and right hips, left and right knees, shoulder center, left and right shoulders, spine, and left and right wrists. Having identified the 3D points in each body part, they used a mean-shift algorithm to identify the center of mass of each body part. These centers of mass are the so-called 20 skeleton joints associated with the body parts in 3D coordinates. The random forest classifier algorithm runs in real time using Kinect data as input and produces the skeleton stream from human subjects detected in the view field. The Kinect can detect up to six people in its field of view and track the skeleton joints of up to two people. The 3D coordinates of the 20 joints for the tracked subjects are computed as part of the skeleton stream and delivered at a default rate of 30 frames per second. These coordinates are accurate enough for recognizing and incorporating a player’s gestures into Xbox games or for doing simple gesture detection and recognition for other applications. For a seated subject near the sensor, with only the upper torso in view, Kinect can be switched to operate in near mode, tracking only the upper torso. Figure 3 shows a screen capture of the upper torso of a seated subject with the Kinect skeleton superimposed in real time on the depth map (Fig. 3a) and color image (Fig. 3b). These two frames were captured at different instants of time. The third image (Fig. 3c) is a depth image showing the full skeleton with all 20 joints. Recently, Kinect has become an object of interest to those developing other types of systems that require machine vision: surveillance, motion detection, mouseless/keyboardless interface with computers, and even a motorized skateboard controlled by the rider’s gestures (see [8]). For example, in Shu [9], the concept and a working prototype of an autonomous video camera system for instructional technology were presented using a system based on Kinect, and in Martin et al. [10], a system using Kinect was designed to teach factory workers proper lifting and carrying ergonomics. In this work, the focus is on enhancing the skeleton stream produced by Kinect. Kinect has a limited viewing area for the depth camera (Fig. 2), so the skeleton stream will drop out at the boundaries of this region. Also, the skeleton joints are subject to random and other estimation errors, causing considerable high-frequency variation in the estimated positions it provides. This makes smooth tracking difficult. The angle of “view” for sound detection is considerably wider than the visual angle of view, namely 100◦ . For a subject who is speaking, the Kinect estimate of the sound source angle gives the direction from the sensor to the (x, z) coordinate of the subject’s skeleton head joint; this means that sound could be useful for improving the accuracy of skeleton tracking and potentially continuing the tracking outside of the 57.5◦ angle of view for the depth camera. This fact is later exploited in the approach to enhancing the tracking accuracy of Kinect. The application focuses on tracking a single subject, specifically, tracking the subject’s head joint as the subject moves about. It is assumed that the subject is speaking intermittently, as a lecturer might do during a presentation. It is also assumed that this speech, when present, is the dominant source of sound that the sensor picks up. That is, all other sound is relatively low-amplitude ambient noise. The accuracy of the tracking of skeleton joints by Kinect is limited by the hardware design of Kinect sensor and the algorithms Microsoft chose to implement for estimating the joints’ coordinates. Shotton et al. [5,6] measure
123
J. Shu et al.
accuracy as the mean average precision, which is the mean (over all skeleton joints) of the average percentage of true positives in identifying the joints. Their experiments show that mean average precisions of 80 and 85 % are achievable for Kinect when the distance threshold (the closeness at which a true positive is scored) is set at 10 and 20 cm, respectively. Unfortunately, this only provides upper bounds of 10 and 20 cm on the actual skeleton joint estimation errors (i.e., the actual distance between the estimated skeleton joints and their true positions). Outside of improving the calibration of the sensor (see [11], for example), there is little that has appeared in the open literature on improving the estimation accuracy of the skeleton tracking. The Kinect software development kit (SDK), a library of software for interfacing with the sensor, has smoothing algorithms for skeleton tracking that appear to be based on standard time-series methods (e.g., weighted moving averages). These merely decrease the “jitter” in the skeleton joint estimates at the expense of increased latency. However, there appear to be no algorithms (or internal firmware) that are specifically designed to enhance accuracy beyond that which is inherent in the hardware and skeleton-joint identification algorithms. Thus, the approach considered here, which focuses on increasing the estimation accuracy of the skeleton joints by using an extended Kalman filter that models the position and velocity as the state of the skeleton joint and uses the sound source angle as an observable when it is available, is unique and new. In the following sections, an extended Kalman filter is described that shows promise in improving skeleton joint estimation and smoothness. Attention is restricted to the head joint of the skeleton, but in principle, the concept could be applied to other joints as well. It is demonstrated that with this filter, skeleton tracking is smoother and more accurate. Evidence is also presented that the filter is able to continue tracking accurately for short periods when the skeleton stream becomes unavailable.
2 Review of the Kalman filter The Kalman filter is a recursive algorithm for estimating a set of unknown parameters (i.e., the state) based on a set of observed measurements. The basic setup for the standard (linear) filter is as follows (see also [12,13]): X k+1 = F X k + wk ,
(1)
Yk = H X k + vk .
(2)
Here, the subscript indicates successive equally spaced discrete times, X 0 , X 1 , . . . denote successive state vectors in Rn×1 , and Y0 , Y1 , . . . denote successive observation vectors in Rm×1 . The matrices F ∈ Rn×n and H ∈ Rm×n are known and may also vary with k, although for now it will be assumed that they do not. The error sequences v0 , v1 , . . . and w0 , w1 , . . . are assumed to be white-noise sequences (i.e., zero mean, uncorrelated sequences) independent of one another. The covariance matrices for these are defined by (3) Q = E wk wkT , R = E vk vkT . (4) Typically, the state vector is not directly observed and must be estimated from the observation vectors. Specifically, the objective is to compute, at each step k, the estimate Xˆ k = E (X k |Y0 , Y1 , . . . , Yk ) .
(5)
If the error distributions are unknown, the preceding conditional expectation can be replaced with a linear leastsquares estimate that depends on the observed data. The recent work [14] gives a general derivation of both the standard and extended Kalman filter in the least-squares context. If the error distributions are Gaussian (normal), then the two approaches (least-squares and conditional expectation) yield the same estimates. Thus, the state estimates may be derived assuming that the errors are Gaussian, with the understanding that if the errors are not Gaussian, the resulting state estimates are simply linear least-squares estimators. This relatively simple approach to deriving filter equations
123
Application of extended Kalman filter for improving
does not seem to be widely known and so is presented here for its pedagogic value and in the interest of keeping the paper self-contained. The following lemma is needed for the derivation of the recursive estimation equations. X a Lemma 1 Let Y consist of two jointly distributed Gaussian random column vectors with mean vector b and covariance matrix 11 12 , 21 22
X . Then the conditional expectation E (X |Y ) = a + Y † † 12 22 (Y − b), where 22 is the pseudo inverse (which is equal to the regular inverse, if it exists) of 22 . Moreover, † 21 . E (X − E(X |Y )) (X − E(X |Y ))T = 11 − 12 22
both of which follow the corresponding partitioning of
−1 Proof For the nonsingular case, let A = 12 22 . Then Cov (X − AY, Y ) = 12 − A22 = 0. Since X and Y are jointly Gaussian, this choice of A makes X − AY and Y independent (since uncorrelated normal random variables are independent). Hence, E (X − AY |Y ) = a − Ab by independence. On the other hand, by the linearity of the conditional expectation operator, it is also true that E (X − AY |Y ) = E(X |Y ) − AY . Putting these two together gives the formula for the conditional expectation E (X |Y ). The formula for E (X − E(X |Y )) (X − E(X |Y ))T follows from the independence of Y and X − AY . In the singular case, it is sufficient to show that the equation 12 − A22 = 0 or its transpose, 21 − 22 AT = 0, † for A will solve the equation 12 − A22 = has a solution in A (which implies that substitution of 12 22 0). Let C(22 ) denote the column space of 22 , and let P be the projection matrix onto C(22 ). Then Cov (Y −PY ) = 0 since P 22 = 22 . This implies that Y − b = P(Y − b) with probability 1, and hence 21 = E (Y − b) (X − a)T = E P (Y − b) (X − a)T = P21 , which implies that the columns of 21 are also in C (22 ). This implies that there is a solution to 21 − 22 AT = 0, completing the proof.
The state estimate can be computed recursively by updating the previous estimate each time a new observation becomes available. Denote by Xˆ k− the estimated state vector just prior to the observation of Yk , and let Xˆ k denote the estimate of the state incorporating the new observation Yk . Then, conditional on the observations prior to Yk , the joint distribution of Xk Yk is Gaussian, with mean vector − Xˆ k
H Xˆ k− and covariance matrix − Pk Pk− H T H Pk−
H Pk− H T + R
,
where Pk− is the covariance matrix of X k conditioned on the observations prior to Yk , that is, Pk−
:= E
X k − Xˆ k−
X k − Xˆ k−
T
Yk−1 , Yk−2 , . . . , Y0 .
(6)
The updated state estimate Xˆ k and covariance matrix Pk , incorporating the observation Yk , can be obtained by T , = HP− H T + R, a = X ˆ −, b = applying the preceding lemma with 11 = Pk− , 12 = Pk− H T , 21 = 12 22 k k − H Xˆ k . These are given by
123
J. Shu et al.
† Xˆk = Xˆ k− + Pk− H T HPk− H T + R (Yk − H Xˆ k− )
(7)
and Pk := E
X k − Xˆ k
X k − Xˆ k
T
†
Yk , Yk−1 , . . . , Y0 = P − − P − H T HP− H T + R H P − . k k k k
(8)
The matrix † T H + R K k := Pk− H T HP− k
(9)
is referred to as the Kalman gain matrix. Using the state equation, − = F Xˆ k Xˆ k+1
(10)
and − Pk+1 = FPk F T + Q.
(11)
The recursive computations for the Kalman filter are summarized as follows: 1. Initialization step: input prior estimates Xˆ 0− , P0− 2. Measurement update step: † T K k = Pk− H T HP− H + R , k
(12a)
Xˆ k = Xˆ k− + K k (Yk − H Xˆ k− ),
(12b)
Pk = (I − K k H )
(12c)
Pk− .
3. Time update step: − = F Xˆ k , Xˆ k+1 − Pk+1
(13a)
= FPk F + Q. T
(13b)
3 Extended Kalman filter with application to filtering skeleton head point The focus here is on applying the Kalman filter principle to tracking the skeleton head joint. The state vector is taken to be the true 3D coordinates of the skeleton head joint and their velocities. Denoting these (without the discrete time subscripts k) as x, y, z, x, ˙ y˙ , z˙ and using the Kinect coordinate system (in meters), it follows that at time step k the state vector is ⎞ ⎛ xk ⎜ yk ⎟ ⎟ ⎜ ⎜ zk ⎟ ⎟ (14) Xk = ⎜ ⎜ x˙k ⎟ . ⎟ ⎜ ⎝ y˙k ⎠ z˙ k The state evolution is modeled by X k+1 = F X k + wk ,
123
(15)
Application of extended Kalman filter for improving
where the matrix F is given in block form by F=
I3x3 03x3
I3x3 I3x3
,
(16)
where is the width of the time step in seconds. The Kinect skeleton frame becomes available at 30 Hz, so = 1/30 s. The skeleton frame provides a measurement of the skeleton head joint, whose coordinates are denoted (without subscripts) by x, ˜ y˜ , z˜ . When the tracked subject is speaking, the Kinect audio stream provides a measurement of the sound source angle from its microphone array. Assuming the sound source is actually the speech from the subject, the sound source angle would be exactly θ = arctan (x/z) ,
(17)
where x and z are the true x and z coordinates of the skeleton head joint. However, the measured sound source angle θ has an error, so it is modeled as = arctan (x/z) + error
(18)
in radians. Thus, the measurement of the sound source angle also provides information about the x and z coordinates in the state vector, and the measurement model can be formulated as ⎞ ⎞ ⎛ ⎛ xk x˜k ⎟ yk ⎜ y˜k ⎟ ⎜ ⎟ ⎟=⎜ Yk ≡ ⎜ + vk := h (X k ) + vk . (19) ⎜ z k ⎟ ⎝ z˜ k ⎠ ⎝ ⎠ k arctan xz kk This measurement model is related to the state variables by the nonlinear function h. To handle this, the idea of the extended Kalman Filter is followed, linearizing the function h and replacing the matrix H in the filter by the Jacobian of h, evaluated at the current state estimate (which makes it dependent on k). The only portion of h that is nonlinear is the fourth row involving the arc tangent. Taking partial derivatives and evaluating them at the state estimate Xˆ k− the fourth row of the H matrix becomes ∂ arctan xˆk− /ˆz k− ∂ arctan xˆk− /ˆz k− ∂ arctan xˆk− /ˆz k− 0 0 0 ∂ xˆk− ∂ yˆk− ∂ zˆ k− =
2 xˆk− / zˆ k− 1/ˆz k− 2 0 − 2 0 0 0 , 1 + xˆk− /ˆz k− 1 + xˆk− /ˆz k−
and the H matrix, which now depends on the time step k, is ⎛ ⎞ 1 0 0 0 0 0 ⎜ 0 1 0 0 0 0⎟ ⎜ ⎟ Hk = ⎜ 0 0 1 0 0 0⎟ ⎜ ⎟. 2 ⎝ ⎠ 1/ˆz k− xˆk− / zˆ k− − − 2 0 − − − 2 0 0 0 1+ xˆk /ˆz k
(20)
(21)
1+ xˆk /ˆz k
123
J. Shu et al.
The recursive extended Kalman filter for this model is then described by the following steps: (1) Initialization step: input prior estimates Xˆ 0− , P0− (2) Measurement update step: † K k = Pk− HkT Hk Pk− HkT + R ,
(22a)
Xˆ k = Xˆ k− + K k (Yk − h( Xˆ k− )),
(22b)
Pk = (I −
(22c)
K k Hk ) Pk− .
(3) Time update step: − Xˆ k+1 = F Xˆ k , − Pk+1
(23a)
= F Pk F + Q. T
(23b)
4 Handling missing data Several events can cause the loss of the skeleton estimates of the head position: moving outside of the 57.5◦ region where the skeleton is computed, moving beyond the maximum depth that Kinect can detect and measure, or moving too close to the Kinect sensor. Likewise, the adaptive sound source angle will be lost if there is insufficient sound (e.g., the subject stops speaking) or the subject’s voice is not loud enough to overcome omnidirectional ambient noise. When such losses of data occur, it is important to be able to continue to track the speaker. In [15] and [16], it is shown that the filter can “coast” (i.e., maintain estimation integrity) when portions of the observation vector are unobserved by simply performing the following modifications in the measurement update step given previously: 1. 2. 3. 4.
The elements in the observation vector Yk that are missing are replaced by 0. The corresponding rows of h( Xˆ k− ) (i.e., those matching the missing data in Yk ) are set to 0. The corresponding rows of Hk (i.e., those matching the missing data in Yk ) are set to 0. The off-diagonal elements of R involving missing data are replaced by 0.
These rules are summarized in Table 1 for the specific version of the extended Kalman filter used here. When this filter deals with missing measurement data, i.e., missing sound data or missing skeleton data, it becomes equivalent to the extended Kalman filter designed without the corresponding measurement channel. The nonzero diagonal elements of R, i.e., the variances for missing data, are kept for algorithmic convenience to maintain R nonsingular, and a particular value (or values) has no effect on the resulting estimates. The modification steps 1–3 basically eliminate the measurement channel with missing observation data.
5 Results with Kinect data The smoothness and tracking accuracy of the extended Kalman filter is examined in this section. To perform this experiment, ground-truth trajectories in the (x, z) plane were set up with respect to the Kinect sensor origin, allowing for an examination of how the extended Kalman-filtered position estimates, as well as the Kinect skeleton head point estimates, deviated from these trajectories. It is tricky to measure a known trajectory precisely. Strategies that map out points along the trajectory on the floor are fraught with error since the subject’s feet may follow the path faithfully while the head does not. Also, measurement errors in the coordinates of the ground-truth path are likely to be large. The strategy followed was to take a piece of twine of known lengths (6 and 12 ft), attach it to the rigid base of the sensor stand and then have the subject, facing the sensor at all times, move along the semicircular path traced out while holding the end of the twine taut at the skeleton head joint. Note that in Fig. 4, which shows the subject in the experimental setup, the end of the cord is held directly on the forehead at the skeleton head joint. In fact, the 3D coordinates of the head joint are
123
Application of extended Kalman filter for improving Table 1 Structures for handling missing observations in extended Kalman filter Vector ⎛ Original
Sound data missing
Skeleton data missing
h(X k )
Yk ⎞ x˜k ⎜ y˜k ⎟ ⎜ ⎟ ⎝ z˜ k ⎠ k ⎞ ⎛ x˜k ⎜ y˜k ⎟ ⎟ ⎜ ⎝ z˜ k ⎠ 0 ⎞ ⎛ 0 ⎜ 0 ⎟ ⎟ ⎜ ⎝ 0 ⎠ k
⎛ ⎜ ⎜ ⎜ ⎝
⎛ ⎜ ⎜ ⎜ ⎝
xk yk z k arctan ⎛ xk ⎜ yk ⎜ ⎝ zk 0 0 0 0 arctan
xk zk
Hk ⎞
⎛
⎟ ⎟ ⎟ ⎠
⎜ ⎜ ⎜ ⎝
⎞
1 0 0 1/ˆz k− 2 1+ xˆk− /ˆz k−
⎛
1 ⎜0 ⎜ ⎝0 0
⎟ ⎟ ⎠
xk zk
0 1 0
⎞
⎛
⎟ ⎟ ⎟ ⎠
⎜ ⎜ ⎜ ⎝
0 0 0 1/ˆz k− − − 2 1+ xˆk /ˆz k
R 0 0 1
0 − 0 1 0 0
2 xˆk− / zˆ k− − − 2 1+ xˆk /ˆz k
0 0 1 0
0 0 0 0 −
0 0 0 0
0 0 0 0 0 0 0
⎞ 0 0⎟ ⎟ 0⎠ 0
2 xˆk− / zˆ k− − − 2 1+ xˆk /ˆz k
⎞ 0 0 0 0 0 0⎟ ⎟ 0 0 0⎟ ⎠ 0 0 0
⎞ 0 0 0 0 0 0⎟ ⎟ 0 0 0⎟ ⎠ 0 0 0
⎛
R11 R12 ⎜ R21 R22 ⎜ ⎝ R31 R32 R41 R42 ⎛ R11 R12 ⎜ R21 R22 ⎜ ⎝ R31 R32 0 0 ⎛ R11 0 ⎜ 0 R22 ⎜ ⎝ 0 0 0 0
⎞ R14 R24 ⎟ ⎟ R34 ⎠ R44 ⎞ R13 0 R23 0 ⎟ ⎟ R33 0 ⎠ 0 R44 ⎞ 0 0 0 0 ⎟ ⎟ R33 0 ⎠ 0 R44 R13 R23 R33 R43
Fig. 4 Experimental setup
roughly at the center of mass of the subject’s head, as estimated by Kinect’s algorithms. To adjust for this, roughly 4 in. was added to the radii of the trajectories. The subject was instructed to move from beyond the 57.5◦ skeleton coverage area, through the 57.5◦ coverage area, and then move outside the 57.5◦ coverage area, at all times staying on the ground-truth trajectory. As a result, there was data dropout in the skeleton tracking for those brief periods outside the 57.5◦ coverage area. The subject made two passes along this trajectory. On the first pass, the subject did not speak. On a separate pass, the subject moved along the trajectory while speaking (reciting the numbers 1, 2, 3, …etc. ). No effort was made to control ambient noise during the experiment. For the extended Kalman filter, Q was taken to be a diagonal matrix with all entries equal to 10−6 and R to be a diagonal matrix with all entries equal to 0.001, 0.001, 0.001, 0.01. The filter was initialized by setting Xˆ 0− , P0− equal to the initial skeleton head point estimate with zero initial velocities and a diagonal matrix with entries given by 10−4 , respectively. Positions are in meters, velocities in meters per second, and angles in radians (converted to degrees in plots). Figure 5 below shows the smoothness of tracking afforded by the filter, as compared to the skeleton head point estimate provided by Kinect. This run is at a radius of 12 ft, with the subject speaking. The subject was instructed to move slowly and deliberately, trying to maintain the head position at a constant elevation. The oscillations due
123
J. Shu et al.
(a)
(b)
(c)
(d)
Fig. 5 Comparison of extended Kalman filter and Kinect estimates, in meters. a x trajectory. b y trajectory. c z trajectory. d Sound source angle
to walking are small but still apparent in both the skeleton and filter estimates of the y coordinate (Fig. 5b), which measures the vertical displacement of√ the head joint, but mostly eliminated from the x (Fig. 5a) and z (Fig. 5c) coordinates, which are related by z = r 2 − x 2 along the true trajectory (with r = 12 ft). It can be seen here that the filter estimates of all three spatial coordinates show a much smoother trajectory than the unmodified Kinect skeleton estimates. The sound source angle was also estimated using the filter estimates of the (x, z) coordinates from the formula θ = arctan(x/z)
(24)
since the sound was emanating from (roughly) the subject’s head joint. It can be seen here that there is also a much smoother estimate of the sound source angle as compared to that provided directly by the Kinect digital signal processor (Fig. 5d). To measure the accuracy, the absolute error of the estimated position (either from the skeleton or from the filter) from the true semicircular path was calculated. The overall measure of accuracy was taken to be the mean of these values over all the points along the path. These calculations were performed using the following formulas.
123
Application of extended Kalman filter for improving
For a given estimated point on the trajectory Pi = (xi , z i ), the average absolute deviation of this point from the true trajectory is calculated as
D (Pi ) =
|xi | − r 2 − z 2 + z i − r 2 − x 2
i
i
2
,
(25)
where r is the radius of the true trajectory with respect to the sensor origin. The mean and standard error (SE)1 of these values were also calculated. Because there are data dropouts (for which the coordinates of the skeleton point are replaced by 0 s), to be fair, the error measures for both the filter and the skeleton were calculated only when the skeleton data were available. Otherwise, there would be some very large deviations for the skeleton data in regions where the filter is able to continue to provide reasonable estimates due to coasting or the availability of sound source data. Two runs were made with the filter at each radius: one with the subject speaking and one with the subject silent. The two plots of the trajectories are shown in Fig. 6 for the two different radii (6 and 12 ft). Also given are the mean errors with their SEs, all in meters. In both sets of runs, the confidence threshold for Kinect’s adaptive sound source angle is set at 0.1. The confidence measure here is computed by Kinect and is a value between 0 and 1, expressing the confidence that the angle computed by Kinect is correct. This means that the filter treats any source angle data with an associated confidence of less than 0.1 as missing. This has little effect on those runs in which the subject is speaking. However, for runs where the subject is silent, it allows the filter to possibly use the faint sounds of the subject moving (e.g., the sound of footsteps), if available. A separate run was made at a 6 ft radius with the subject speaking to examine whether the filter could continue to track accurately when the skeleton data stream became unavailable. In this run, the subject continued moving at the end of the semicircular trajectory for a few steps after the Kinect skeleton stream stopped producing data. This region is shown on the accuracy plot in Fig. 7, where it is seen that the skeleton estimate begins to become inaccurate (showing a trajectory perpendicular to the true trajectory) just as the 57.5◦ boundary is crossed, yet the filter estimate continues to track the true trajectory based on the sound source data, which are still available, and the previous velocity state that had been estimated. A noticeable improvement is seen in the plots for accuracy for the filter in the speaking runs (Fig. 6a, c) compared to the nonspeaking runs (Fig. 6b, d). This illustrates the position accuracy improvement due to the use of the sound source angle data. The accuracy results are summarized in Table 2. The numbers are the mean errors in meters, while the numbers in parentheses below these are the corresponding estimated standard errors. The sample sizes n for the runs are also given. The extended Kalman filter appears to be more accurate than the unmodified Kinect skeleton in both nonspeaking and speaking runs, and in the speaking runs, these differences are the most dramatic. In addition, there appears to be a significant increases in accuracy of the extended Kalman filter for the speaking runs, as compared to the extended Kalman filter for the nonspeaking runs, indicating that sound can significantly enhance the accuracy of the extended Kalman filter estimates. Finally, the accuracy appears to diminish with increasing distance, which is a known phenomenon with Kinect. The preceding data can be used to calculate the statistical significance of the differences between the speaking and nonspeaking runs for the extended Kalman filter at a fixed distance. Since the speaking and nonspeaking runs are separate, the filter errors are independent between these runs, so that the statistical significance may be computed by calculating the z-score. The z-score is z=
1
m1 − m2 S E 12 + S E 22
,
The standard error is the standard deviation of the values D (Pi ) , i = 1, 2, . . . , n divided by
(26) √
n.
123
J. Shu et al.
(a) 2.05
(b) 2.15 From Skeleton Filtered Truth
2.1
2
2.05 From Skeleton Filtered Truth
1.9
2
Z
Z
1.95
1.95 1.85 1.9 1.8
1.85
1.75
1.8 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1
−0.5
0
X
(c)
(d)
4 From Skeleton Filtered Truth
3.8
3.7
3.7
3.6
3.5
3.4
3.4
3.3
3.3
3.2 −0.5
0
X
0.5
1
1.5
From Skeleton Filtered Truth
3.6
3.5
−1
1.5
4
3.8
−1.5
1
3.9
Z
Z
3.9
−2
0.5
X
2
3.2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
X
Fig. 6 Accuracy comparison of extended Kalman filter (EKF) and Kinect estimates. Accuracy statistics are mean deviations with standard errors in parentheses. a 6 ft, speaking; EKF: 0.012431 (0.0008464); Kinect: 0.079557 (0.0037208). b 6 ft, not speaking; EKF: 0.025795 (0.0012375); Kinect: 0.075994 (0.0035948). c 12 ft, speaking; EKF: 0.07626 (0.003722); Kinect: 0.16183 (0.0044305). d 12 ft, not speaking; EKF: 0.14486 (0.0042684); Kinect: 0.18374 (0.0040104). Axes and accuracy statistics are in meters. In all cases, the Kinect adaptive sound source angle confidence threshold is set at 0.1 Fig. 7 Filter tracking in absence of skeleton stream data; axes in meters
123
Application of extended Kalman filter for improving Table 2 Summary of accuracy results Distance
6 ft
6 ft
12 ft
Speaking (S), Not speaking (NS)
S (n = 184)
NS (n = 188)
S (n = 226)
NS (n = 364)
Extended Kalman filter
0.012431 m (0.0008464 m) 0.079557 m (0.0037208 m)
0.025795 m (0.0012375 m) 0.075994 m (0.0035948 m)
0.07626 m (0.003722 m) 0.16183 m (0.0044305 m)
0.14486 m (0.0042684 m) 0.18374 m (0.0040104 m)
Kinect skeleton
12 ft
where m 1 is the mean error for the speaking run, m 2 is the mean error for the nonspeaking run, S E 1 is the SE for the speaking run, and S E 2 is the SE for the nonspeaking run. For the 6 ft runs, the z-score is −8.91 and for the 12 ft runs, the z-score is −12.54. These are both highly significant, being well below the 0.0001 percentile of the standard normal distribution. In a similar fashion, it is easy to see that the effect of distance (i.e., less accuracy at 12 ft than at 6 ft) is highly significant for both speaking and nonspeaking runs for both the Kinect skeleton and the extended Kalman filter. Finally, the extended Kalman filter is compared to the Kinect skeleton for the four combinations of distance and sound: (6 ft, speaking), (6 ft, not speaking), (12 ft, speaking), (12 ft, not speaking). Since the filter estimates are highly correlated with the corresponding skeleton estimates, these are treated as paired observations by creating a single time series of the difference between the filter and skeleton time series. The corresponding mean differences and their standard errors (in parentheses) are, respectively, −0.0671 (0.0038), −0.0502 (0.0036), −0.0856 (0.0053), and −0.0389 (0.0030), all being highly significant. 6 Conclusions There are no specifications for the Kinect skeleton tracking accuracy. The intent has been simply to show that, given the inherent accuracy of the Kinect skeleton tracking, it can be improved significantly by applying optimal filtering techniques. Whether these improvements are good enough depends on the application. The inherent accuracy of the skeleton is probably good enough for recognizing simple gestures based on large motor coordination. However, for recognizing small motor gestures, any improvements in accuracy and smoothness can only improve the effectiveness of the sensor. Accordingly, an extended Kalman filter was developed and its efficacy examined for smoothing and increasing the accuracy of the head joint of the Kinect sensor’s skeleton. The extended Kalman filter state contains both position and velocity, and advantage is taken of measurements supplied by the Kinect skeleton and sound source angle. The results of these experiments are significant. The extended Kalman filter substantially smooths the Kinect’s skeleton head joint position and sound source angle estimates. More importantly, the experiments demonstrate that the accuracy of tracking is significantly increased using the filter, particularly when sound direction is taken into account. Referring to Fig. 8, at a distance of 6 ft, average accuracy improves from 8 cm for the Kinect skeleton estimate to 1.5 cm for the extended Kalman filter when the subject being tracked is speaking, and from 7.6 to 2.6 cm when the subject is not speaking. At a distance of 12 ft, average accuracy improves from 16 cm for the Kinect skeleton estimate, to 7.5 cm for the extended Kalman filter when the subject being tracked is speaking, and from 18 to 14 cm when the subject is not speaking. Comparing the effect of including the observed sound source angle in the extended Kalman filter, it was shown that at 6 ft, the filter estimates improved from an average accuracy of 2.5 cm when the subject was not speaking to an average accuracy of 1.5 cm when the subject was speaking. At 12 ft, the filter accuracy improved from an average of 15 cm when the subject was not speaking to an average of 7.5 cm when the subject was speaking. It was found that the effect of distance was also significant, with less accuracy at 12 ft than at 6 ft, which is consistent with the Kinect literature. One could argue that the extra 1 cm improvement at 6 ft achieved by using the sound source angle in the extended Kalman filter does not seem worth it. However, this result is especially important with regard to the use of the Kinect sensor as a natural user interface to a computer. For example, in future applications, this extra centimeter
123
J. Shu et al.
Fig. 8 Summary of experimental results
of accuracy could be important in tracking the user’s fingertips to manipulate a virtual touch screen, for example. The improvement from 15 to 7.5 cm accuracy at 12 ft is even more important: it represents an area near the limit of the sensor’s ability to estimate depth (hence 3D position) accurately. Clearly, the added observation of the sound source information has a large impact on increasing the accuracy of the sensor’s tracking. There are some limitations to the scope of these results. So that an accurate ground-truth trajectory could be maintained, only the tracking of the (x, z) coordinate of the subject’s head joint was considered along a known semicircular trajectory, with the subject moving slowly and deliberately. Further experiments that allow more complex motion would shed light on the broadness of the scope of these results. The difficulty with these experiments would be in controlling and accurately measuring more complex true trajectories. The application of the concept could easily be extended to include additional (even all) skeleton joints. For example, the head, chest, and shoulder joints could be combined in a single Kalman filter with 24 states (spatial coordinates plus velocities). The positions of these joints would be highly correlated, and skeleton joints close to the central core of the body would have (x, z) coordinates that are relatively close to those of the head joint, so when the subject is speaking, the use of the observed sound source angle would likely improve the accuracy of all the skeleton joint locations. But again, measuring accuracy in an experiment of this type would be very challenging, as the ground-truth trajectories of the four joints would be challenging to know and control. Since both smoothness and accuracy improved in the experiment through the use of the filter, with or without the benefit of the sound source angle, it is speculated that there would be improvements in smoothness and accuracy if additional skeleton joints were added to the filter, or if the true trajectory were more complex. Since these investigations would be costly and require more sophisticated and expensive equipment for measuring the true trajectories, these questions are left for possible future research. Finally, it is noted that Microsoft has developed the next-generation Kinect sensor, the Kinect 2, which has many new enhancements in its sensing capabilities, including increased depth accuracy, range, and resolution. This device was not yet available to the public as of the writing of this article. However, the extended Kalman filter idea that has been discussed in this work should be applicable to Kinect 2.
References 1. 2. 3. 4.
http://en.wikipedia.org/wiki/Primesense. Oct 11 2012 http://users.dickinson.edu/jmac/selected-talks/kinect.pdf. Dec 27 2012 http://en.wikipedia.org/wiki/Epipolar_geometry. Dec 25 2012 Zhang C, Florencio D, Ba D, Zhang Z (2008) Maximum likelihood sound source localization and beamforming for directional microphone arrays in distributed meetings. IEEE Trans Multimed 10(3):538–548
123
Application of extended Kalman filter for improving 5. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, and Blake A (2011) Real-time human pose recognition in parts from single depth images. Proceedings of 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 1294–1304 6. Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Pushmeet K, Criminisi A, Kipman A, Blake A (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intell 35(12):2821–2840 7. Breiman L (2001) Random forests. Mach Learn 45:5–32 8. http://www.chaoticmoon.com/case-studies/board-of-awesomeness/ 9. Shu J (2013) Autonomous voice and motion controlled video camera system for instructional technology. Ph.D. dissertation in engineering and industrial applied mathematics, Claremont Graduate University and California State University Long Beach 10. Martin C, Burkert D, Choi K, Wieczorek N, McGregor P, Herrmann R, Beling P (2012) A real time ergonomic monitoring system using the Microsoft Kinect. Proceedings of the 2012 IEEE systems and information engineering design symposium, pp 50–55 11. Khoshellam K, Elberink SO (2012) Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 12:1437– 1457 12. Grewal MS, Andrews AP (2008) Kalman filtering: theory and practice using Matlab, 3rd edn. Wiley-IEEE Press, New York 13. Shumway RH, Stoffer DS (2006) Time series analysis and its applications, with R examples. Springer, Berlin 14. Humphries J, Redd P, West J (2012) A fresh look at the Kalman filter. SIAM Rev 54(4):801–823 15. Shumway RH, Stoffer DS (1982) An approach to time series smoothing and forecasting using the EM algorithm. J Time Ser Anal 3:253–264 16. Stoffer DS (1982) Estimation of parameters in a linear dynamic system with missing observations. Ph.D. Dissertation. University of California, Davis
123