Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes

This paper presents a simple, yet powerful local descriptor, so-called the histograms of space–time dominant orientations (HiSTDO). Specifically, our ...

0 downloads 24 Views 3MB Size

Download PDF

Machine Vision and Applications DOI 10.1007/s00138-016-0801-7

ORIGINAL PAPER

Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes Wonjun Kim1 · Jae-Joon Han2

Received: 15 December 2015 / Revised: 18 April 2016 / Accepted: 10 July 2016 © Springer-Verlag Berlin Heidelberg 2016

Abstract This paper presents a simple, yet powerful local descriptor, so-called the histograms of space–time dominant orientations (HiSTDO). Specifically, our HiSTDO is composed of two main components, i.e., the dominant orientation and its coherence, which represents how intensively gradients in the local region are distributed along the space–time dominant orientation. By incorporating them into the histogram, we define it as our HiSTDO descriptor. In contrast to previous methods vulnerable to the presence of the background clutter and the camera noise, our HiSTDO greatly encodes the space–time shape of underlying structures even under such challenging conditions, and it can thus be efficiently applied to various applications (e.g., object and action detection). Experimental results on diverse datasets demonstrate that the proposed descriptor is effective for human action as well as object detection. Keywords Space–time descriptor · Dominant orientation · Coherence · HiSTDO · Object and action detection

1 Introduction With increasing demand of high-level scene understanding in images and videos, there has been considerable interest in developing a simple and robust space–time descriptor. It can be considered as one of the key prerequisites for many applications including object detection and recognition, sur-

B

Wonjun Kim [email protected]

1

Electronics Engineering, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Korea

2

Samsung Advanced Institute of Technology (SAIT), 130 Samsung-ro, Suwon-si, Gyeonggi-do 16678, Korea

veillance, image and video retrieval, and human–computer interfacing. For this task, diverse models have been proposed based on image gradients over the last few decades. Most notably, Lowe [1] proposed a scale-invariant feature transform (SIFT) by utilizing gradient statistics extracted from the local region with the multiscale analysis. This SIFT descriptor has a great ability to describe the given scene even under varying illuminations and partial occlusions. Subsequently, Dalal and Triggs [2] introduced a simplified version of SIFT referred to as histograms of oriented gradients (HOG). Since the HOG descriptor is data driven (i.e., without any assumption of model parameters for the feature distribution) and can be densely represented in a local region, it has inspired many researchers to employ this descriptor for various image processing tasks [3–7]. In particular, Felzenszwalb et al. [3] applied HOG to the deformable part-based model with the latent SVM and their method yields a big success in object detection. Although SIFT and HOG descriptors effectively capture local shapes in a given image, the pixel-wise gradient information is vulnerable to various challenging conditions such as the presence of noise and cluttered background. On the other hand, some concepts employed in the static images have been successfully extended to the space–time domain for video analysis. Scovanner et al. [8] proposed the three-dimensional SIFT descriptor, which is defined using spatiotemporal gradients obtained from local volumes. Their concept is generalized by exploiting the integral video and the three-dimensional orientation quantization based on regular polyhedrons [9], called HOG3D in the literature. Although these approaches are simple and widely used for video applications, they are definitely sensitive to the motion of the background clutter due to the dependence primarily on gradient magnitudes obtained in a pixel-wise manner. In contrast, Marszalek et al. [33] investigated the histograms of optical flows (HOF) as a local descriptor for video representation.

123

W. Kim, J.-J. Han

Fig. 1 Comparison of the discriminative power over a selected region (white rectangular box). Results are shown by a HOG [2]. b 2-D version of HOG3D [9]. c Proposed HiSTDO for the spatial domain. Red bars indicate values for the original image while blue bars are for the noisy one. Note that the bin 5 is the quantization level related to the 90◦ . Best viewed in colors (color figure online)

Specifically, they accumulate the magnitude of optical flows onto the corresponding direction in a similar way with the original HOG [2]. Furthermore, Dalal et al. [10] proposed the motion boundary histograms (MBH) by computing derivatives separately for horizontal and vertical components of the optical flow, respectively. HOF and MBH have been popularly employed for human action recognition; however, the ability of those descriptors is limited to capturing the temporal information only (i.e., without considering the spatial shape), which yields the significant drop in recognition performance under the cluttered background. Moreover, the computation of optical flows is quite expensive and corresponding results heavily rely on the choice of regularization methods [11]. Even though HOG and its variants have been successfully adopted to represent the local structure for various computer vision applications, they still suffer from the ambiguities driven by the noise and highly textured background. For example, the energy highly concentrated on a particular orientation of HOG and 2-D version of HOG3D [33] starts to spread over the whole range of orientations by involving noise, which leads to the significant drop in the discriminative power for detection tasks (see Fig. 1). In other words, previous methods fail to equally represent the same structure of the given image as shown in Fig. 1. In this paper, we propose a novel local descriptor, called histograms of space–time dominant orientations (HiSTDO). The proposed approach, which is extended from our previous work only considering spatial gradients [12], provides a unified framework for efficiently describing the local shape in the space–time domain. The key idea of the proposed method is that the directional characteristics based on the

123

dominant orientation can capture the relevant shape, which is highly coherent with the underlying structure of the given image or video. Therefore, our HiSTDO descriptor greatly preserves the structural information of the given scene even with the presence of clutters and significant distortions. This is fairly desirable to achieve the robust object and action detection in a wide range of scenes. Note that this paper extends our previous one [12] and differs in the following respects: (1) we extend the concept of the coherence histogram to the spatiotemporal domain by efficiently exploiting the temporal gradient and the 2-D histogram representation; (2) the technical details for action detection based on our unified framework are minutely explained as well; and (3) we provide comparative evaluations with representative action detection methods using various metrics both qualitatively and quantitatively. The rest of this paper is organized as follows. The unified framework for the proposed method is introduced in Sect. 2. The performance of object and action detection on various datasets is demonstrated in Sect. 3. The conclusion follows in Sect. 4.

2 The Proposed local descriptor 2.1 Space–time dominant orientation and its coherence The motivation of our new unified approach is to find a more effective way to represent the local region of the given image or video under the assumption that the human visual system (HVS) is highly adaptive to extracting the dominant orientation for recognizing texture and motion patterns. To do this, we allow for the 3-D structure tensor which efficiently summarizes the dominant orientation and the energy along this direction (i.e., coherence) based on the space–time local gradient field defined as follows: k k ⎤ ⎡ k 2 I x ( j)Itk ( j) I x ( j)I yk ( j) I x ( j) I k ( j)I k ( j) ⎦ , I k ( j)2 Sk (i) = ⎣ I xk ( j)I yk ( j) yk yk 2 t k k k I y ( j)It ( j) It ( j) I x ( j)It ( j)

(1) where I xk , I yk , and Itk denote the gradient in horizontal, vertical, and temporal directions at the kth frame, respectively. Each summation in (1) is defined in the local region Bi (e.g., 5 × 5 pixels), which is centered at the ith pixel position. Note that temporal gradients can be simply computed as Itk ( j) = I k ( j) − I k−τ ( j) (τ = 3 in our work). It is worth noting that the 2-D structure tensor for static image-based applications is a specific form of (1) when all the temporal gradients are set to zero (i.e., Itk ( j) = 0). The usefulness of the 3-D structure tensor defined in (1) for our task stems from the fact that the eigenvector corresponding to the largest

Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes

eigenvalue of Sk (i), i.e., e1 , indicates the dominant orientation of the local region. Furthermore, the relative discrepancy between the largest eigenvalue (i.e., λ1 ) and others (i.e., λ2 and λ3 ) of Sk (i) indicates how intensively gradients in the local region are distributed along the space–time dominant direction. Note that only two eigenvalues (i.e., λ1 and λ2 ) obtained from Sk (i) of 2 × 2 matrix are required for the static image analysis. Therefore, we first compute the dominant orientation and its coherence at each pixel position in the spatial domain (i.e., Itk ( j) = 0 in (1)) as follows: θ = tan−1 (e1 /e1x ), where e1 = [e1x , e1 ]T .

(2)

cs = λ1 − λ2 ,

(3)

y

y

where cs indicates our coherence. Here the larger the value cs is, the higher the coherence is. For the space–time analysis, we decompose the three-dimensional eigenvector [i.e., the first eigenvector of (1)], which is defined in the space–time domain (i.e., x yt-plane), into two sub-directional vectors (see Fig. 4) and compute the dominant orientation as follows: θ = tan−1

e1y e1x

⎡

, φ = tan−1 ⎣

⎤ et1 (e1x )2 + (e1y )2

⎦,

(4)

where e1 = e1x , e1y , et1 denotes the eigenvector corresponding to the largest eigenvalue of (1), which indicates the dominant orientation of the given local volume. It should be emphasized that the relationship between spatial and temporal information is implicitly encoded in our 3-D structure tensor-based scheme. Note that the average of gradients does not guarantee the reliable measure of the dominant orientation since aligned but oppositely oriented gradients would cancel out in this average [13]. In what follows, we define the coherence of the space–time dominant orientation at each pixel position as follows: ct = λ1 − (λ2 + λ3 ).

(5)

Note that the λ3 value will be zero when our scheme is applied to static images. For better understanding, we illustrate the distributions of spatiotemporal gradients obtained from three different regions in Fig. 2. As can be seen, spatiotemporal gradients generated by salient motions (i.e., waving hands) are intensively distributed along the dominant orientation compared to those of non-salient motions by the background clutter. Moreover, the proposed method yields quite high coherence for the region ② compared to the highly textured background (③), which indicates the ability to extract the shape information of the local region even without motions. Those are desirable to describe the structural information in

Fig. 2 Distributions of space–time gradients obtained from three different regions containing salient motions (i.e., waving hands) (①), strong edges (②), and motions by the background clutter (③). The coherence of the space–time dominant orientation is shown in each subfigure. Note that the visually important regions (e.g., ① and ② regions) yield the high coherence along the dominant orientation compared to the background clutter (③)

a wide range of scenes even with the complex background and the camera noise. In the following subsection, we will explain our local space–time descriptor defined by combining these two components (i.e., dominant orientation and its coherence) into the histogram in detail. 2.2 HiSTDO : a local space–time descriptor 2.2.1 HiSTDO in the spatial domain For the spatial domain, we simply incorporate the dominant orientation and its coherence into a 1-D histogram in a given sub-block of the image. Specifically, we compute these two components by using (2) and (3) over all the pixels belonging to the given sub-block. Note that the dominant orientation is quantized into K levels in the range of [0◦ , 180◦ ] since the sign of orientations is neglected due to the symmetric characteristic. Then, we build the histogram by accumulating the coherence value cs of each pixel onto the corresponding dominant orientation, subsequently normalized by L 2 -norm. This normalized histogram is our HiSTDO descriptor in the spatial domain and numerically defined as f = ( f 1 , f 2 , . . . , f K ) with: f k = K

Ek

2 l=1 (E l ) + ε

,

Ek =

cs (x, y).

(6)

(x,y)∈W

θ (x,y)∈k

Here cs (x, y) and θ (x, y) denote the coherence value and the quantized dominant orientation at the pixel position (x, y), respectively. ε is a small positive constant. W is a set of pixels in the sub-block.

123

W. Kim, J.-J. Han

Fig. 3 Feature vector generation based on HiSTDO descriptors. Note that the dimension of the feature vector is K × N

For the task of object detection, we define the feature vector by concatenating HiSTDO descriptors obtained from each sub-block in the image patch (see Fig. 3) given as: F = (f1 , f2 , . . . , f N ),

(7)

where N denotes the total number of sub-blocks in the image patch and fn denotes the HiSTDO descriptor of the nth subblock. Thus, the dimension of our feature vector becomes K × N for the given image patch. The overall procedure is also shown in Fig. 3. 2.2.2 HiSTDO in the space–time domain For the space–time domain, we similarly define our HiSTDO descriptor by encoding the space–time dominant orientation and its coherence into the 2-D histogram. First of all, these two components are computed based on (4) and (5) over all the pixels belonging to the given sub-block. For efficiently binning the histogram, our spatiotemporal dominant orientations are decomposed into two sub-directions with the angle θ and φ as mentioned in the previous subsection (see Fig. 4). Note that θ and φ are quantized into P and Q levels in the range of [0◦ , 180◦ ] and [−90◦ , 90◦ ] in our implementation, respectively. Then, we build the histogram by accumulating the coherence value ct of each pixel onto the corresponding (m, n) where this indicates the index of the quantized orientations obtained from θ and φ, respectively (i.e., m = 1, 2, . . . , P and n = 1, 2, . . . , Q), and subsequently normalize it by using L 2 -norm. This normalized histogram is our HiSTDO descriptor in the space–time domain and numerically defined as f = ( f 1,1 , f 1,2 , . . . , f P,Q ) with: E m,n , Q 2 p=1 q=1 (E p,q ) + ε ct (x, y),

f m,n = P E m,n =

(8) (9)

(x,y)∈W

(x,y)∈n θ (x,y)∈m,φ

where W is a set of pixels in the sub-block as mentioned.

(x, y) denote the quantized orientations of each θ (x, y) and φ sub-direction, respectively. ε is a small positive constant. Our spatiotemporal HiSTDO descriptors also can be applied for action detection in videos. In a similar way to

123

Fig. 4 Overall procedure for generation of the HiSTDO descriptor in the space–time domain. Note that the space–time dominant orientation is decomposed into two sub-directions and the corresponding coherence is accumulated into the 2-D histogram

Algorithm 1 HiSTDO : a novel space–time local descriptor Input : a sub-block of a given image or video for (x, y) = (1, 1) to (X, Y ) do (where X and Y are width and height of the sub-block) Compute the space–time gradients I x , I y , It . Generate the structure tensor (1) and conduct the eigenanalysis. Compute the dominant orientation and its coherence at each pixel position (x, y) as follows: i) for 2-D cases (using (2) and (3)) y θ(x, y) = tan−1 (e1 /e1x ), cs (x, y) = λ1 − λ2 . ii) for 3-D cases (using (4) and (5)) 1 e e1 θ(x, y) = tan−1 e1y , φ = tan−1 1 2t 1 2 . x

(ex ) +(e y )

ct (x, y) = λ1 − (λ2 + λ3 ). end for Generate the histogram as follows: i) Conduct quantization of θ(x, y) and φ(x, y). ii) Accumulate cs (or ct ) onto the corresponding bin. Normalize the histogram by L 2 norm. Output : f (HiSTDO descriptor).

the spatial-based scheme, we first divide the given window into overlapped sub-blocks and define the feature vector Fk at the kth frame by concatenating HiSTDO descriptors obtained from all the sub-blocks as follows: Fk = (fk1 , fk2 , . . . , fkN ),

(10)

where N denotes the total number of sub-blocks and flk indicates the HiSTDO descriptors of the lth sub-block. For representation of the given volume, we concatenate all Fk vectors obtained from previous L frames, for example, 30 frames, which is basically set by the length of the query volume (i.e., F = (Fk , Fk−1 , . . . , Fk−L+1 )) for action detection. Thus, the dimension of the feature vector becomes P × Q × N × L for each local volume as well as the query volume. For the sake of completeness, the overall procedure of the proposed method is summarized in Algorithm 1.

Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes

Fig. 5 HOG(red) and HiSTDO(blue) for two selected regions. Even though they equally represent the structured region (①), HOG yields ambiguities in the highly textured region (②), which contains the randomly distributed direction of gradients with large magnitudes (best viewed in colors) (color figure online)

2.3 Properties of HiSTDO descriptors The most important advantage of the proposed method is the ability to encode the underlying structural information into the histogram-based feature format. Unlike previous approaches, our HiSTDO descriptor accumulates the coherence, which is defined based on the local characteristics of space–time texture patterns, rather than the pixel-wise statistical information, e.g., the magnitude of space–time gradients. For example, gradient vectors of two selected regions are shown in Fig. 5. Yellow arrows in each region indicate the direction of gradients computed on each pixel position and black arrows represent their coherence. Note that the thickness of the black arrow denotes the degree of the coherence while the direction describes the dominant orientation. From Fig. 5, we can see that the HiSTDO descriptor successfully suppresses the highly textured region (②), which yields high ambiguities by randomly distributed orientations with large magnitudes for feature matching even though it is not visually important. Therefore, it is thought that the proposed subspace analysis of space–time gradient fields paves the way for describing the underlying structure of the given image and video. In addition, the proposed HiSTDO descriptor implicitly contains the spatial texture and motion information based on the subspace analysis of the 3-D structure tensor. In contrast to explicit schemes estimating motion vectors directly from consecutive image sequences, which is time-consuming, it is desirable to utilize the proposed framework for the flexibility, which allows the simple histogram-based features to capture even complex motions as well as dominant texture patterns. Even though some motion descriptors such as HOF [33] and MBH [10] have been widely employed for action recognition [20,21], these only consider the temporal information (i.e., motion pattern) and thus cannot be adopted for static image-based applications. In contrast to that, the proposed descriptor provides the unified framework working on both spatial and space–time domains. To confirm the robustness of the proposed method, we show some examples in various challenging conditions shown in Figs. 6 and 7. As can be seen, the proposed HiSTDO

Fig. 6 Invariance and robustness of our HiSTDO descriptor in various challenging conditions. Note that WGN means white Gaussian noise. Coherence maps are equally represented under various environments (bright colors denote the higher coherence). Dominant orientations are quantized into 8 levels (e.g., bin 5 contains orientations from 90◦ to 112.5◦ ). Note that the object deformation (e.g., fish-eye warping) makes the meaningful difference on the proposed HiSTDO descriptor (see the pink-colored bar) (color figure online)

Fig. 7 Some examples of our HiSTDO descriptor in the space–time domain under various challenging conditions. a Input. b Contrast change. c Gaussian noise (0,0.05). The HiSTDO is quantized into eight and five levels in θ and φ directions, respectively. Note that the underlying structure of the local region is reliably preserved by using our HiSTDO descriptors

descriptor reliably preserves the discriminative power even in the presence of the brightness/contrast change and noise as well. Note that the corresponding coherence maps are equally represented in our approach (see the middle part of Fig. 6). In contrast, the object deformation makes the change of the underlying structure (see the last column in Fig. 6) and it thus yields the significant difference on the HiSTDO descriptor compared to the original image (e.g., the accumulated energy disperses in the case of the sub-block ① as shown in the bottom part of Fig. 6). In Fig. 7, the shape of motion patterns is successfully preserved even under sig-

123

W. Kim, J.-J. Han

nificant noises. Therefore, we can confirm that our HiSTDO descriptor is highly desirable to capture the underlying structure of the given image and video. Due to the excellence of the HiSTDO descriptor for encoding the space–time shape context, we apply it to solve detection problems in diverse images and videos (i.e., object and action detection), which will be explained in the following subsection.

3 Experimental results In this section, we demonstrate the efficiency and robustness of our HiSTDO descriptor for object and action detection in various datasets. To this end, we employ the sliding windowbased template matching scheme without training to localize the target object or action. Based on experimental results, we confirm that the proposed descriptor is effective for the detection task in both images and videos.

Fig. 8 Overall procedure of the action detection framework. a HiSTDO features of the query volume (dimension : P × Q × N × L ). b Similarity computation based on the sliding window scheme (top) and the corresponding ρi2 values (bottom). c Detection result in the test video

3.1 Similarity computation For the template matching, we introduce a simple sliding window-based scheme, which computes the similarity between the query and the local window extracted from the test image or video. Specifically, we first divide the given window into overlapped sub-blocks and define the feature vector F as explained in Section II-B 1) and 2). To compute the similarity between two feature vectors, we adopt the correlation metric, given as [12] F Q , FT,i ρi = ρ(F Q , FT,i ) = , F Q , F Q FT,i , FT,i

(11)

where F Q and FT,i denote feature vectors obtained from the query and the i th local region of the test video defined by the sliding window, respectively. ·, · denotes the inner product of two given vectors. It has been shown that ρi2 ∈ [0, 1] for the detection task provides better contrast in the result. For localization of the target object or action, we apply the nonmaxima suppression technique [15] to ρi2 values and cut off outputs by the pre-defined threshold value. The multiscale analysis is also performed similarly by the guideline introduced in [16]. As an example, Figure 8 shows the action detection framework based on our HiSTDO descriptors. Note that the object detection can be achieved based on the same scheme using only one frame (i.e., static image). The similarity ρi2 is larger than the pre-defined thresholding value; then, we can determine the corresponding window includes the target object or action.

123

Fig. 9 ROC curves with a various quantization levels of dominant orientations and b various sizes of the sub-block

3.2 Object detection in images In this subsection, we show the performance of the proposed HiSTDO descriptor for object detection in various images. First, we tested our approach on 30 images containing 135 frontal faces with various sizes [12], which are randomly sampled from the MIT-CMU frontal face dataset [14]. Specifically, we analyzed the effects of various choices for the quantization levels of dominant orientations and sub-block sizes as shown in Fig. 9. From the ROC curve based on recall and 1− precision values, our HiSTDO descriptor shows the best performance with eight bins (i.e., K = 8) and (W/6) × (H/6) pixels for the size of the subblock, where W and H denote the width and height of the image patch (i.e., sliding window), respectively. Based on these parameters, we also tested our proposed descriptor for car detection by using the UIUC car dataset [16], which is provided with two types of scales (i.e., single and multiple scales). Some detection results of both datasets are shown in Fig. 10. As can be seen, the HiSTDO-based detector provides reliable results even in different illumination conditions (compared to those of the query images) and highly textured backgrounds. Moreover, the proposed method is able to successfully capture the target object under the partial occlusion [see Fig. 10(b)].

Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes

Fig. 10 Examples of face and car detection results on a the MIT-CMU frontal face [14] and b the UIUC car [16] datasets, respectively. Only one query patch is used for matching with all target images. Note that faces

and cars are successfully detected by utilizing our HiSTDO descriptor even in various illumination conditions as well as partial occlusions

To quantitatively evaluate the detection performance, we employ the F-measure, which gives equal importance to both recall and precision as follows [16]: F=

2 · Recall · Precision . Recall + Precision

(12)

We first compared the proposed HiSTDO descriptor with the conventional HOG [2] and HOGM (i.e., 2-D version of HOG3D [9] which accumulates the mean of gradient magnitudes) descriptors on the MIT-CMU face dataset as shown in Fig. 11(a) and Table 1, respectively. Specifically, we achieved 90.5 % detection rate by F-measure, which provides the better performance compared to previous HOGbased descriptors while showing comparable results with approaches (e.g., cascade-based [17,18] and normalized difference-based [19]) recently proposed in this field. For the UIUC car dataset, we compared the HiSTDO-based detector with other approaches introduced in the literature [16,22,23] and the corresponding results are shown in Table 2. Even though our HiSTDO-based detector does not adopt any training scheme, it performs comparably against state-of-the-art training-based methods [16,22,23] due to the excellence of the proposed HiSTDO descriptor for representing the underlying image structure. Furthermore, we demonstrate the detection results for images collected from Web sites, including various objects such as hands and human poses, as shown in Fig. 12. It is easy to see that various target objects are correctly localized even in the presence of the background clutter and diverse illumination conditions. 3.3 Action detection in videos In this subsection, we demonstrate the performance of the proposed HiSTDO descriptor for action detection in various

Fig. 11 Performance comparison between the proposed HiSTDO and other descriptors. a ROC curves on the MIT-CMU frontal face dataset. b Some detection results in noisy images [red: HiSTDO, blue: HOG, green: HOGM (i.e., 2-D version of HOG3D)] (color figure online) Table 1 Detection rate on the MIT-CMU frontal face dataset Methods

HOG [2]

HOGM [9]

HiSTDO

F-measure

84.6 %

74.7 %

90.5 %

Methods

Soft cas. [17]

Surf cas. [18]

LPD [19]

F-measure

92.9 %

92.1 %

86.4 %

Table 2 Performance comparison by detection rate on the UIUC singlescale (S) and multiscale (M) car test set Detection rate

[16]

[22]

[23]

Ours

Single (S)

77.1 %

94.0 %

97.5 %

92.5 %

Multi (M)

44.0 %

93.5 %

95.0 %

83.6 %

videos. To this end, we employ two representative datasets for action detection, which are constructed by CMU [25] and MSR [26]. Specifically, the CMU dataset includes total five actions, i.e., pickup, one-hand wave, two-hand wave, jumping jack, and push button, with complex backgrounds (e.g.,

123

W. Kim, J.-J. Han

Fig. 12 Some examples of generic object detection results. Note that various objects (e.g., hands and human poses) are successfully detected even in the presence of clutter and various illumination conditions

Table 3 Performance variation according to parameters of P and Q Settings

P = 4, Q = 5

P = 6, Q = 5

P = 8, Q = 5

F

0.512

0.541

0.557

Settings

P = 10, Q = 5

P = 8, Q = 3

P = 8, Q = 7

F

0.554

0.527

0.559

moving crowds and camera motions) while the MSR dataset contains three actions such as boxing, clapping, and two-hand waving captured under varying illuminations with background clutters. Based on extensive experiments as shown in Table 3, we confirm that our HiSTDO descriptor shows the best performance with eight and five bins for θ and φ (i.e., P = 8 and Q = 5), and (W/4) × (H/4) × L pixels for the size of the sub-volume (where W , H , and L denote the width and height of the sliding window and the duration of the query video, respectively) by considering the dimension of the feature vector affecting the processing time. For the dense representation, sub-volumes are overlapped in the sliding window by half of the sub-volume size in horizontal and vertical directions, respectively. Note that we conduct downsampling for all the videos to 160 × 120 pixels as introduced in [25]. We first tested our space–time descriptor on the CMU action dataset. Some examples of action detection by the proposed method are shown in Fig. 13. As can be seen, our HiSTDO descriptor has the ability to successfully describe the space–time shape information even in the complex background. To confirm the efficiency of the HiSTDO descriptor for action detection, we quantitatively compared ours with other descriptors proposed in the literature, which are flow-based [24], shape and flow-based [25], and HOG3Dbased [9]. The corresponding results based on recall and precision are shown in Fig. 14 and Table 4. We can see that the proposed method reliably works on finding various target actions compared to previous approaches.

123

Fig. 13 Some detection results of the CMU action dataset. Query actions are shown in the left part while corresponding detection results in target videos are illustrated in the right part. Note our HiSTDO descriptor-based scheme successfully localizes query actions in various videos

Fig. 14 ROC curves on the CMU action dataset. a Pickup. b Onehand wave. c Two-hand wave. d Jumping jack. e Push button. Note that flow-based [24], shape and flow-based [25], HOG3D-based [9], and our HiSTDO-based results are illustrated as black, red, green, and blue lines (best viewed in colors) (color figure online)

Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes Table 4 Performance comparison by the F-measure

Pickup

One-hand

Two-hand

Jump

Push

Flow [24]

0.195

0.094

0.395

0.178

0.250

Shape + flow [25]

0.537

0.400

0.583

0.358

0.467

HOG3D [9]

0.324

0.428

0.574

0.555

0.515

Ours

0.423

0.507

0.629

0.660

0.565

Fig. 15 Some detection results by the proposed method on the MSR dataset (top boxing, middle clapping, bottom : two-hand waving). Query actions are obtained from the KTH action dataset [31]. Note our

HiSTDO descriptor-based scheme successfully localizes query actions in various videos even with complex and moving backgrounds

Table 5 Confusion matrix on the MSR action dataset

Table 6 Recognition rate (average precision) on the MSR action dataset

[27]

Boxing

Clapping

Waving

Ground truth

Methods

Average precision (%)

Boxing

40

0

41

81

Yuan et al. 2009 [27]

59.6

2

21

28

51

Siva and Xiang 2011 [28]

71.2

11

0

60

71

Tian et al. 2012 [26]

78.8

Ground truth

Roshtkhari and Levine 2013 [29]

79.8

Adeli-Mosabbeb and Fathy 2015 [30]

81.5

Ours

85.7%

Clapping Waving [26] Boxing

Boxing

Clapping

Waving

61

8

12

81

2

38

11

51

Waving

10

0

61

71

Ours

Boxing

Waving

Ground truth

Boxing

Clapping

Clapping

74

2

5

81

Clapping

9

36

6

51

Waving

6

1

64

71

In addition, we also evaluate the performance of our space–time descriptor on the MSR action dataset and several detection results are shown in Fig. 15. It is worth noting that the HiSTDO-based detector successfully captures the target action under various indoor and outdoor environments. For the quantitative analysis on this dataset, we compared our HiSTDO descriptor with sub-volume-based [27] and hierarchical filtered motion-based [26] methods and corresponding results are shown in Tables 5 and 6, respectively.

Note that numbers indicate total counts of detected actions in Table 5. To show the efficiency of the proposed descriptor, we compared the average precision of our approach with five representative methods [26–30], and show the results in Table 6. Among 203 actions from 54 videos, we achieve 85.7 % (=174/203) recognition rate, which outperforms previous approaches [26,27] as shown in Tables 5 and 6. Note that actions from the same category of the KTH dataset [31] are employed for queries as guided in [26]. From experimental results, we can see that various target actions are correctly localized based on the proposed descriptor even in the presence of background clutters and varying illuminations. Moreover, we applied the proposed HiSTDO descriptor for action recognition based on the UCF11 dataset [32]. To compare the performance of descriptors only, we utilized the same structure of the baseline method introduced in [33], i.e., train-

123

W. Kim, J.-J. Han Table 7 Average precision on the UCF11 dataset Methods

Average precision (%)

SIFT + Harris3D + Pruning [33]

71.2

HOG + KLT [34]

71.0

Color SIFT + Optical flow [35]

73.2

HiSTDO(proposed) + Harris3D

75.6

Multiple features + MIL [36]

75.2

HOG + HOF + MBH + trajectories [34]

84.2

Convolutional ISA [37]

75.8

ing and test by using the codebook representation with the interest points. As shown in Table 7, the proposed HiSTDO descriptor yields the better recognition rate compared to the HOG(or SIFT)-based descriptors [33–35] while showing the comparable recognition rate with other approaches [36,37] proposed in this field. Therefore, we finally conclude that our HiSTDO descriptor has a great ability to represent the underlying structure of the local region in the space–time domain and is successfully incorporated into the detection framework for image and video applications. For evaluating the processing time, we tested our HiSTDO descriptor based on the high-end PC (Intel i7 3.4 GHz with 4 GB RAM). Since the proposed method requires the SVD operation for each pixel position, it takes more processing time compared to HOG and its variants. Specifically, our HiSTDO descriptor takes about 0.16 sec for the size of 320 × 240 pixels while previous ones can be mostly constructed within 0.035 sec. To reduce the computational cost, more advanced approaches [38,39] for computing SVD can be incorporated into our proposed scheme. Note that the size of the storage for HiSTDO does not increase compared to previous descriptors such as HOG and HOG3D since the number of bins for the histogram representation is not changed.

4 Conclusion A simple and novel local descriptor, so-called histograms of space–time dominant orientations (HiSTDO), has been proposed in this paper. The key idea of the proposed method is that the shape of motions as well as texture patterns can be efficiently approximated by exploiting the dominant orientation and its coherence obtained from the local region of the space–time domain. By incorporating these two components into the histogram, we efficiently describe the underlying structure of the local region in a given image or video. To justify the efficiency and robustness of the proposed HiSTDO descriptor, we apply it to object and action detection frameworks. From experimental results on various datasets, we confirm that our HiSTDO descriptor greatly represents the

123

shape of target object in images and videos even with the background clutter. Our future work is to incorporate the proposed descriptor into various learning schemes to resolve a wide range of recognition problems.

References 1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004) 2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 886–893 (2005) 3. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010) 4. Zeng, C., Ma, H., Ming, A.: Fast human detection using mi-SVM and a cascade of HOG-LBP features. In: Proceedings IEEE international conference on image processing (ICIP), pp. 3845–3848 (2010) 5. Zhang, J., Huang, K., Yu, Y., Tan, T.: Boosted local structured HOG-LBP for object localization. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 1393–1400 (2011) 6. Jia, W., Hu, R.-X., Lei, Y.-K., Zhao, Y., Gui, J.: Histogram of oriented lines for palmprint recognition. IEEE Trans. Syst. Man Cybern.: Syst. 44(3), 385–395 (2014) 7. Pang, Y., Zhang, K., Yuan, Y., Wang, K.: Distributed object detection with linear SVMs. IEEE Trans. Cybern. 44(11), 2122–2133 (2014) 8. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: Proceedings ACM international conference on multimedia, pp. 357–360 (2007) 9. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: Proceedings British machine vision conference (BMVC), pp. 995–1004 (2008) 10. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of optical flow and appreance. In: Proceedings European conference on computer vision (ECCV), pp. 428–441 (2006) 11. Baker, S., Roth, S., Scharstein, D., Black, M. J., Lewis, J., Szeliski, R.: A database and evaluation methodology for optical flow. In: Proceedings IEEE international conference on computer vision (ICCV), pp. 1–8 (2007) 12. Kim, W., Yoo, B., Han, J-J.: HDO : a novel local image descriptor. In: Proceedings IEEE international conference on image processing (ICIP), pp. 5671–5675 (2014) 13. Kim, W., Kim, C.: Spatiotemporal saliency detection using textural contrast and its applications. IEEE Trans. Circuits Syst. Video Technol. 24(4), 646–659 (2014) 14. Rowley, H., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 22–38 (1998) 15. Devernay, F.: A non-maxima suppression method for edge detection with sub-pixel accuracy. Technical report, INRIA, no. RR2724 (1995) 16. Agarwal, S., Awan, A., Roth, D.: Learning to detect objects in images via a sparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1475–1490 (2004) 17. Bourdev, L., Brandt, J.: Robust object detection via soft cascade. Proc. IEEE Comput. Vis. Pattern Recognit. (CVPR) 2, 236–243 (2005) 18. Li, J., Wang, T., Zhang, Y.: Face detection using SURF cascade. In: Proceedings IEEE computer vision and pattern recognition workshops (CVPRW), pp. 2183–2190 (2011)

Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes 19. Liao, S., Jain, A.K., Li, S.: A fast and accurate unconstrained face detector. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 211–223 (2016) 20. Wang, H., Klaser, A., Schmid, C., Liu, C-L.: Action recognition by dense trajectories. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 3169–3176 (2011) 21. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings IEEE international conference on computer vision (ICCV), pp. 3551–3558 (2013) 22. Kappor, A., Winn, J.: Located hidden random fields: learning discriminative parts for object Detection. In: Proceedings European conference on computer vision (ECCV), pp. 302–315 (2006) 23. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. Int. J. Comput. Vision 77(1), 259–289 (2008) 24. Shechtman, E., Irani, M.: Space-time behavior-based correlationor-how to tell if two underlying motion fields are similar without computing them? IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 2045–2056 (2007) 25. Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: Proceedings IEEE international conference on computer vision (ICCV), pp. 1–8 (2007) 26. Tian, Y., Cao, L., Liu, Z., Zhang, Z.: Hierarchical filtered motion for action recognition in crowded videos. IEEE Trans. Syst. Man Cybern.-Part C: Appl. Rev. 42(3), 313–323 (2012) 27. Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 2442–2449 (2009) 28. Siva, P., Xiang, T.: Weakly supervised action detection. In: Proceedings British machine vision conference (BMVC), pp. 65.1–65.11 (2011) 29. Roshtkhari, M.J., Levine, M.D.: Human activity recognition in videos using a single example. Image Vis. Comput. 31(11), 864– 876 (2013) 30. Adeli-Mosabbeb, E., Fathy, M.: Non-negative matrix completion for action detection. Image Vis. Comput. 39(7), 38–51 (2015) 31. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings IEEE conference on pattern recognition (ICPR), pp. 32–36 (2004) 32. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 1996–2003 (2009) 33. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 2929–2936 (2009) 34. Wang, H., Klaser, A., Schmid, C., Liu, L.: Action recognition by dense trajectories. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 3169–3176 (2011) 35. Reddy, K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24, 971–981 (2013) 36. Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: Proceedings European conference on computer vision (ECCV), pp. 494–507 (2010) 37. Le, Q. V., Zou, W. Y., Yeung, S. Y., Ng, A. Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings IEEE computer vision and pattern recognition (CVPR), pp. 3361–3368 (2011) 38. Chetverikov, D., Axt, A.: Approximation-free running SVD and its application to motion detection. Pattern Recogn. Lett. 31(9), 891–897 (2010)

39. Liu, X., Wen, Z., Zhang, Y.: Limited memory block Krylov subspace optimization for computing dominant singular value decomposition. SIAM J. Sci. Comput. 35(3), 1641–1668 (2013)

Wonjun Kim received the B.S. degree in electrical engineering from Sogang University, Seoul, Korea, the M.S. degree from the Department of Information and Communications, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, and the Ph.D. degree from the Computational Imaging Laboratory, Department of Electrical Engineering, KAIST, in 2006, 2008, and 2012, respectively. From September 2012 to February 2016, he was a Research Staff Member in the Samsung Advanced Institute of Technology (SAIT), Gyeonggi-do, Korea. Since March 2016, he has been with the Department of Electronics Engineering, Konkuk University, Seoul, Korea, where he is currently an Assistant Professor. His research interests include image and video understanding, computer vision, pattern recognition, and biometrics, with an emphasis on saliency detection, face and action recognition. He has served as a regular reviewer for over 25 international journal papers, including the IEEE Transactions on Image Processing, the IEEE Transactions on Circuits and Systems for Video Technology, the IEEE Transactions on Multimedia, the Machine Vision and Applications, and the Pattern Recognition. Jae-Joon Han received the B.S. degree in electronic engineering from Yonsei University, Korea, in 1997, the M.S. degree in electrical and computer engineering from the University of Southern California, Los Angeles, in 2001, and the Ph.D. degree in electrical and computer engineering from Purdue University, West Lafayette, IN, USA, in 2006. He was a Teaching Assistant and then a Research Assistant with the School of Electrical and Computer Engineering, Purdue University, from 2001 to 2006. Since 2006, he has been with Purdue University, where he was a Post-Doctoral Fellow in 2007. He has been with the Samsung Advanced Institute of Technology (SAIT), Gyeonggi-do, Korea, since 2007, as a Principal Researcher. His research interests include statistical machine learning and data mining, computer vision, and real-time recognition technologies. He also participated in the development of standards, such as ISO/IEC 23005 (MPEG-V) and ISO/IEC 23007 (MPEG-U), and served as the Editor of ISO/IEC 230051/4/6.

123

Directional coherence-based spatiotemporal descriptor for object detection in static and dynamic scenes

Recommend Documents