Anomaly detection using sparse reconstruction in crowded scenes

In this paper, we propose an algorithm of anomaly detection in crowded scenes by using sparse representation over the normal bases. First, the histogr...

6 downloads 94 Views 3MB Size

Download PDF

Multimed Tools Appl DOI 10.1007/s11042-016-4115-6

Anomaly detection using sparse reconstruction in crowded scenes Ang Li 1,2 & Zhenjiang Miao 1,2 & Yigang Cen 1,2 & Yi Cen 3

Received: 4 May 2016 / Revised: 23 September 2016 / Accepted: 1 November 2016 # Springer Science+Business Media New York 2016

Abstract In this paper, we propose an algorithm of anomaly detection in crowded scenes by using sparse representation over the normal bases. First, the histogram of maximal optical flow projection (HMOFP) features are extracted from a set of normal training data. Then, the online dictionary learning algorithm is used to train an optimal dictionary with proper redundancy, which is better than the dictionary simply composed by the HMOFP features of the whole training data. In order to detect the normalness of a frame, the l1-norm of the sparse reconstruction coefficients is used as the Reconstruction Coefficient Sparsity (RCS). Our algorithm is effective for both global abnormal events (GAE) and local abnormal events (LAE). We evaluate our method on three benchmark datasets-the UMN dataset, the PETS2009 dataset and the UCSD Ped1 dataset. Compared with the most popular methods, experimental results show that our algorithm achieves good results especially for the pixel-level local abnormal event localization. Keywords HMOFP . Online dictionary learning . Sparse representation . Abnormal events . Crowded scenes

* Ang Li [email protected] Zhenjiang Miao [email protected] Yigang Cen [email protected] Yi Cen [email protected]

1

Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China

2

Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China

3

School of Information Engineering, Minzu University of China, Beijing 100081, China

Multimed Tools Appl

1 Introduction Nowadays, with the advancement of people’s public safety awareness and the reduction of surveillance equipments’ cost, more and more surveillance cameras have been used in public places, such as markets, stadiums, museums, airports, train stations, etc. The research on crowd behaviors in public scenes draws more and more attentions and has become a hot topic in the field of computer vision. Depending on the scale, the detection objects can be classified into global abnormal events (GAE), where the whole scene is abnormal, and local abnormal events (LAE), where the local behavior is different from that of other parts in the scene. The density of people in crowded scenes is usually high. As a result, traditional algorithms such as individual object segmentation and tracking, which consider an individual in isolation, often face difficult situations such as the overlapping of pedestrians due to interactions among them. According to the difference of established models, the algorithms of crowd video analysis can be classified into three main categories [22, 31]: (1) Macroscopic modeling, where the low level features optical flow and spatial-temporal gradient are used. (2) Microscopic modeling, where some specific methods are utilized, such as the particle filter (PF) framework, occlusion handling, etc. (3) Crowd events detection. During the last decade, studies on abnormal behavior detection has actively evolved by taking the advantages of recent developments in some related fields, such as Computer Vision (CV), Pattern Recognition (PR), Soft Computing (SC), Mathematical Modeling (MM), Biomedical Information (BI), Image Signal Processing (ISP), Data Mining (DM), Computational Intelligent (CI) and Artificial Intelligence (AI) [21]. Especially in the CV field, plenty of automated crowd analysis techniques have been proposed. Such as people counting/density estimation models, tracking in crowded scenes, crowd behavior understanding models, etc. [8]. Usually, the crowd in a scene is treated as a single entirety to analyze. The status of crowd is updated as normal or abnormal based on the dynamics of the whole crowd. However, if the motion pattern of a crowd is unstructured, i.e., the motion of the crowd appears to be random [18], the current algorithms are not very effective. What’s more, the understanding and modeling of crowd behaviors remain immature despite the considerable advances have been achieved in human activity analysis. To detect abnormal events in surveillance video, we propose a framework to extract the motion features of the crowd and utilize the sparse reconstruction theory to fulfill the task in this paper, which are effective for both GAE and LAE. The rest of the paper is organized as follows. In Section 2, the problem formulation is described. In Section 3, related works are briefly reviewed. In Section 4, how to extract the motion features of GAE or LAE is presented. Section 5 introduces the training sample optimization and the dictionary construction method. The algorithm of abnormal event detection is presented detailedly in Section 6. In Section 7, experimental results of both GAE detection and LAE detection are provided. Conclusions are presented in Section 8.

2 Problem formulation By considering different application scenarios for abnormal event detection, we define the problem of video anomaly detection as follows. Suppose we are provided a training set H ¼ ½H 1 ; H 2 ; :::; H N 0 , where N0 is the number of training samples; Hi ∈ ℝM is the feature

Multimed Tools Appl

vector describing a normal training sample (M is the feature dimension), which can be an image patch, a color histogram, a mixture of dynamic texture, or our proposed motion context, etc. Suppose we have a test sample H tet , our task is to design a function to determine whether H tet is normal or abnormal. That is f : H tet →fnormal; abnormalg

ð1Þ

According to (1), it can be seen that for a given scene, once we defined the normal event, the abnormal event detection can be treated as a binary classification problem. In the abnormal event detection, an important issue is the video event representation. High-dimensional feature is usually preferred to better represent the event. However, to fit a good probability model, the required number of training data increases exponentially approximate O(d2) with the feature dimension d, it is unrealistic to collect enough training data for density estimation in practice. Sparse representation is suitable to represent high-dimensional samples using less training data. This motivates us to detect abnormal events via a sparse reconstruction from normal ones. Given an input test sample, H tet ∈RM , we reconstruct it by a sparse linear combination of an over-complete normal base set DT ∈ ℝM × K, where M < K, as follows: min∥zt ∥1 s:t: H tet ¼ DT zt

ð2Þ

where zt is the reconstruction coefficients. For a normal event, it is likely to generate sparse reconstruction coefficients zt , while an abnormal event is dissimilar to any of the normal bases, thus generates a dense representation. Then, the Reconstruction Coefficient Sparsity (RCS) is used in our proposed method to quantify the normalness: S w ¼ ∥zt ∥1

ð3Þ

It can be seen that a normal frame has a small RCS value, while the abnormal frame usually generates a large RCS. Therefore, the RCS can be adopted as an anomaly measurement for such a binary classification problem. To handle both LAE and GAE, spatial-temporal basis or spatial basis are used respectively, which provide a general way of representing different types of abnormal events. Moreover, the online dictionary learning algorithm is used to train an optimal dictionary such that the dictionary can be more representative and compact.

3 Related work In order to eliminate the world representation layer which can be a significant source of errors for the modeling algorithms, an approach by modeling directly at the pixel level was described in [9]. Social force model based abnormal crowd behavior detection was introduced in [14, 30]. In this model, moving particles were treated as individuals based on optical flow, and the interaction force between every two particles was considered as social force. In the process of classification, bag of words and z-score approaches based on the histogram of oriented social force (HOSF) were presented. In [32, 33], a model named social attribute-aware force model was proposed. In this model, in order to improve the effectiveness of the simulation on the interaction behaviors of the crowd, social characteristics of crowd behaviors were taken into account. Also, the scene scale was efficiently estimated in an unsupervised manner. Then, the concepts of social disorder and congestion

Multimed Tools Appl

were introduced. Finally, according to the social force obtained by an online fusion strategy, the model was constructed. In [7], SIFT features were extracted for the Bag of Words (Bow) model with Spatial Pyramid Matching Kernel (SPM). Then a SVM classifier was used for cross-scene abnormal event detection. In [19], based on the fact that the occurrence of abnormal events is rare while the frequently occurring events are normal in general human perception, the unsupervised learning algorithm proximity clustering for abnormal event detection in video sequence was proposed. In [24], when the labeled information about normal events was limited and the information about abnormal events was not available, projection subspace associated with detectors was discovered by using both labeled and unlabeled segments. In wireless sensor networks, the fact has been observed that instead of being transient, most abnormal events persist over a considerable period time. Then a technique for handling data in a segment-based manner was introduced in [27]. Without using any tracking or motion features, a feature extraction and events detection method was presented in [6], where features were extracted from foreground blobs and then confined in SVM based models for real-time events detection. In [20], a novel abnormal event detection framework based on the newly developed spatial-temporal co-occurrence Gaussian mixture models (STCOG) was proposed, which requires a short training period and has a fast processing speed. The above approaches can successfully realize abnormal event detection, but they are limited in some aspects. Some models are established complicatedly, and others cost too much time in the detection process. The concept of entropy is used for abnormal event detection in some models. In [5], based on the particle entropy and the Gaussian Mixture Model (GMM), an algorithm effectively representing the crowd distribution in the crowd scenes was proposed. Based on the idea that human beings tend to exhibit random motion patterns during abnormal situations, a general purpose human motion analysis (HMA) method was described in [10]. Using basic statistical quantities, angular and linear displacements of limb movements were characterized in this model. In addition, the entropy of the Fourier spectrum was used to measure the randomness of the abnormal behavior. In [16], the two entropy methods were compared, namely the network entropy and the normalized relative network entropy (NRNE), which were utilized to classify different network behaviors. Optical flow is the approximated motion vector at each pixel, which can reflect the relative distance of a moving object at two different moments. It is important and useful in video surveillance and abnormal event detection. The method using the histogram of optical flow was described in [26]. A similar pixel-based motion feature for abnormal event detection was proposed in [11]. Unlike most existing approaches used for abnormal event detection, sparse representation has obtained more and more attentions in recent years. In [17], a method to detect abnormal events by the sparse subspace clustering was proposed. In [3], a model aimed at anomaly detection was described, which utilized Sparse Reconstruction Cost (SRC) over the normal dictionary to measure the normalness of the tested frames. This work is effective for both GAE and LAE. In [12], a model formed by the mixtures of dynamic texture of the normal crowd behavior was presented. For the construction of the redundant dictionary, some optimization methods were presented, such as the K-SVD algorithm [1], the dictionary learning method based on the modified K-SVD algorithm [2], the Fisher discrimination dictionary learning (FDDL) model [29], the latent dictionary learning (LDL) method [28], and non-negative matrix factorization (NMF) based on the robust Earth Mover’s Distance (EMD) [34], etc.

Multimed Tools Appl

4 Motion feature extraction Optical flow field is the movement on the surface of grayscale images, which reflects the movement information of two consecutive frames. Optical flow can provide the information of direction and amplitude of the moving object in a scene, which can describe the behavior of people well. In this paper, we adopt the Horn-Schunck (HS) method to compute the optical flow of frame images and propose a novel motion feature descriptor, called as the Histogram of Maximal Optical Flow Projection (HMOFP). Figure 1 briefly shows the process for computing the HMOFP feature. First, HS method is utilized to extract the optical flow field of the current frame based on two consecutive frames. Then, the optical flow field is divided into patches. Finally, the maximal optical flow projection belonging to each direction (i.e., HMOFP of one patch) is obtained. More details of the process are described as follows. As shown in Fig. 2, the optical flow field of the sth frame is divided into m image patches with overlap areas. Each patch contains B1×B2 pixels. Then we deal with the optical flow in each patch as follows. 0°–360° are segmented into p bins. For an image patch, the optical flow vector of each pixel must belong to a bin. Thus, each bin may contain several optical flow vectors. We project all optical flow vectors in a same bin onto the angle bisector of this bin. Then the length of the maximal projection vector is selected as the feature descriptor. For on2 falling into the first bin. It is easy to example, in Fig. 3a, there are two vectors ! on1 and ! ! know that the projection of on2 onto the angle bisector of the first bin is longer than the !0 is selected as the feature projection of ! on1 . Thus, the length of the projection vector on 2 descriptor of the first bin. In this way we can obtain the feature descriptor vector of each image patch, which is called as the histogram of maximal optical flow projection (HMOFP), as follows: T hi ¼ h1i ; h2i ; …; hpi p1

ð4Þ

where hi(1 ≤ i ≤ m) denotes the feature descriptor of the ith patch as shown in Fig. 3b, hji(1 ≤ j ≤ p) denotes the maximal amplitude of all projection vectors in the jth bin. The feature descriptor hi is a vector in ℝp, and its p elements are scalars. To realize the detection of both global abnormal events (GAE) and local abnormal events (LAE), we use two types of bases with different structures as shown in Fig. 3c. For GAE, we choose the spatial basis Type A to describe the feature of a whole frame. Each frame is divided into 4 × 4 patches. After the HMOFP of each patch is obtained, these feature vectors are concatenated to build the HMOFP of the frame. For LAE, after the location of the patch in the current frame is determined, the 8 neighbor patches adjoining it in the spatial domain and the

Fig. 1 The process for computing the HMOFP feature

Multimed Tools Appl Fig. 2 Patch-division of the optical flow field belonging to the sth frame

patch in the same location of the next frame are selected to construct the basis Type B. Then the HMOFP extraction of the basis is similar to that in GAE detection. In order to describe a crowded scene well, two factors are needed: explicit directions and the moving distance along each direction. The operation of segmenting the 2D space into p bins provides us ample information to describe the directions of moving people. For a bin, there are many optical flow vectors falling into it. Therefore a unique direction should be selected to represent all vectors in this bin. Here, the angle bisector is selected as the unique direction of each bin. Then p directions can be obtained for p bins. Since there may be far more than one optical flow vector in each bin, in order to enhance the distinction between the normal scene and the abnormal scene, we select the maximal vector projection rather than the sum of all the vector projections on the bisector as the motion feature descriptor. If we ignore the background area, the amplitudes of motion vectors belong to the normal area will be small in a normal frame and the motion vectors corresponding to the abnormal area should be large in an abnormal frame. Usually, the number of normal motion vectors is much more than that of the abnormal motion vectors. If we use the sum of all projection vectors on the angle bisector as the feature descriptor of each bin, the accumulation of the massive small motion vectors in the normal frame may confuse the small number of large motion vectors in the abnormal frame, i.e., the sum of all projection vectors on the angle bisector in each bin of the normal frame is likely to be close to that of the abnormal frame. Thus, in order to improve the distinguishability

Fig. 3 (a) The calculation of HMOFP in direction bins (b) Components of the feature descriptor of the ith patch (c) The spatial basis Type A and the spatial-temporal basis Type B

Multimed Tools Appl

between the abnormal and normal frames, we select the length of the maximal projection vector as the feature descriptor of each bin, as it is demonstrated in Fig. 3a.

5 Dictionary construction In this section, we will show how to construct the dictionary. For a given initial training set T R ¼ ½tr1 ; tr2 ; …; trN

ð5Þ

where tri (1 ≤ i ≤ N) denotes a frame for GAE or a patch of the ith frame for LAE, and it is called as a training datum in this paper. The corresponding feature pool of TR is H ¼ ½H 1 ; H 2 ; :::; H N 0 ∈ RM N 0

ð6Þ

where Hi(1 ≤ i ≤ N0) denotes the HMOFP of tri and it is called as a training sample in this paper. M is the length of the HMOFP of the basis for GAE or LAE. Note that the optical flow of the ith frame is calculated based on the ith and (i + 1)th frames. Thus, in the right side of (6), the maximal subscript is N0= N − 1. In the feature pool, some training samples may be useless for the representation of the normal events. Thus they should be deleted. Consider the optimization problem: min ∥s j ∥0 s:t: Hs j ¼ H j ; s j j ¼ 0 ð j ¼ 1; 2; :::; N 0 Þ

ð7Þ

T s j ¼ s j1 ; s j2 ; …; s jN 0

ð8Þ

s j ∈RN 0

where

(7) is an NP-hard problem. We can use the method in [4] to relax the l0-norm optimization problem as: min ∥s j ∥1 s:t: Hs j ¼ H j ; s j j ¼ 0 ð j ¼ 1; 2; :::; N 0 Þ

s j ∈RN 0

ð9Þ

In the matrix form, it can be reformed as min ∥S∥1 s:t: HS ¼ H; diagðS Þ ¼ 0

S∈RN 0 N 0

ð10Þ

where S ¼ ½s1 ; s2 ; :::; sN 0

ð11Þ

We utilize the orthogonal matching pursuit (OMP) method [23] to solve (10). The optimal solution is denoted as S*: h * * * ð12Þ S * ¼ s1R ; s2R ; :::; sNR 0 T 0 * 0 0 where sRj j ¼ 1; 2; :::; N 0 is the j′th row of S*. We calculate the l2-norm of each sRj * . If the result is 0, we delete the corresponding column in the feature pool, i.e., if

Multimed Tools Appl j

0

*

0

∥sR0 ∥2 ¼ 0; 1≤ j0 ≤N 0

ð13Þ

then H j0 is deleted from the feature pool. The optimized feature pool is denoted as h0 i H* ¼ H *1 ; H *2 ; :::; H *K 0 ∈ RM K 0 ðK 0 < N 0 Þ. The above optimization steps will delete the training samples that never used for the representation of the other samples in the feature pool, such that the feature pool can be more compact for the normal events. Moreover, we hope the dictionary used in the final abnormal event detection can represent the normal samples well, i.e., the training samples in the feature pool can be sparsely reconstructed by the dictionary perfectly. Thus, the online dictionary learning algorithm [13] is utilized to generate an optimal dictionary with proper redundancy, such that the atoms in the dictionary can be more representative for the normal features. The online dictionary learning algorithm is described as follows.

6 Abnormal event detection Suppose that in a given scene, there is a set of training frames, F ¼ ½ f 1 ; f 2 ; …; f N

ð15Þ

which describes the normal behaviors of people in crowded scenes. Our proposed procedures for abnormal event detection based on the histogram of maximal optical flow projection (HMOFP) feature are introduced as follows.

Multimed Tools Appl

Step 1 Depending on the situations of GAE and LAE, we can get TR from F by (5). Then the optical flow OP ¼ op1 ; op2 ; …; opN 0 can be calculated at each pixel of tri(1 ≤ i ≤ N0) by the HS method, i.e., HS

½tr1 ; tr2 ; :::; trN abN →½op1 ; op2 ; :::; opN 0 abN 0

ð16Þ

where a × b is the size of tri. Step 2 Extract the HMOFP feature of tri(1 ≤ i ≤ N0) in the training set. The obtained feature set is denoted as H ¼ ½H 1 ; H 2 ; :::; H N 0 . The process is denoted as ð17Þ op1 ; op2 ; :::; opN 0 abN 0 →½H 1 ; H 2 ; :::; H N 0 M N 0 Step 3 Based on the HMOFP feature, we delete the useless columns in the feature pool by (7) ~ (13). Then the optimized dictionary DT can be obtained by the online dictionary learning algorithm as introduced in Section 5. Step 4 Extract the HMOFP feature of the testing datum tet , i.e., H tet and calculate its sparse reconstruction coefficient vector zt by (2) min∥zt ∥1 s:t: H tet ¼ DT zt which can be solved by the OMP method. Then the RCS value Sw of tet is computed by (3) S w ¼ ∥zt ∥1 Step 5 For the detection of GAE, i.e., tet = ft, we turn to Step 6 directly. For the detection of LAE, i.e., tet is a patch of the testing frame ft, we need to repeat Step1 ~ Step4 until the last patch of the frame ft. The RCS values of all patches are added together. The sum is still denoted as Sw for convenience. Step 6 The frame ft is detected as normal if the following criterion is satisfied Sw < η ð18Þ where η is a user defined threshold that controls the sensitivity of the algorithm. The whole procedure is illustrated in Fig. 4.

Fig. 4 The flowchart of the proposed abnormal event detection algorithm

Multimed Tools Appl

7 Experimental results To evaluate the effectiveness of the proposed method, we perform experiments on several public datasets. The Receiver Operating Characteristic (ROC) curve and the Area Under the ROC Curve (AUC) are the most commonly used criterions for abnormal event detection. The ROC curve is defined as follows. The ROC curve is composed of true positive rate (TPR) and false positive rate (FPR), which is used to measure the accuracy for multiple threshold values. TPR and FPR are defined as follow: TPR ¼

True Positive True Positive þ False Negative

ð19Þ

where True Positive (TP) denotes that an abnormal event is labeled correctly as an abnormal event; False Negative (FN) denotes that an abnormal event is labeled incorrectly as a normal event; TPR reflects the percentage of positive samples that are correctly classified during the test. In addition, FPR ¼

False Positive False Positive þ True Negative

ð20Þ

where False Positive (FP) denotes that a normal event is labeled incorrectly as an abnormal event; True Negative (TN) denotes that a normal event is labeled correctly as a normal event; TPR reflects the percentage of negative samples that are incorrectly classified during the test.

7.1 Global abnormal event detection based on the UMN dataset In this section, we evaluate our method for global abnormal event detection based on the UMN dataset [25]. There are three different crowded scenes in the UMN dataset, which are named as lawn, indoor and plaza respectively. The total frame number is 7739 with a 320 × 240 resolution. The normal events are people walking randomly in the scene, and the abnormal events are people running away at the same time. In our experiments, the image patch size is set as 80 × 60 and there is no overlapping between two neighbor patches. 0°–360° are divided into 18 bins evenly, i.e., p = 18. By using the spatial basis Type A shown in Fig. 3c, the HMOFP feature with length 288 can be obtained. The initial dictionary is a Discrete Cosine Transform (DCT) matrix with size 288 × 576, and it is trained by the first 400 normal frames in each scene.

7.1.1 Detection in the lawn scene The video sequence of the lawn scene contains 1453 frames in total. Since the calculation of the optical flow field of the current frame need to use the next frame, only the optical flow fields of the first frame to the 1452th frame can be obtained. Thus, our results only contain 1452 frames. Two typical frames of the different events in the lawn scene are shown in Fig. 5. The detection result and the ROC curve of the lawn scene are shown in Fig. 6. The AUC is 99.76%.

Multimed Tools Appl

(a) The normal event

(b) The abnormal event

Fig. 5 Two different events in the lawn scene

7.1.2 Detection in the indoor scene The video sequence of the indoor scene contains 4144 frames in total. Two typical frames of the different events in the indoor scene are shown in Fig. 7. The detection result and the ROC curve of the indoor scene are shown in Fig. 8. The AUC is 95.78%.

Fig. 6 (a) The classification result of the lawn scene (b) The ROC curve of the lawn scene

(a) 1 0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

FPR

(b)

0.8

1

Multimed Tools Appl

(a) The normal event

(b) The abnormal event

Fig. 7 Two different events in the indoor scene

7.1.3 Detection in the plaza scene The video sequence of the plaza scene contains 2142 frames in total. Two typical frames of the different events in the plaza scene are shown in Fig. 9. The detection result and the ROC curve of the indoor scene are shown in Fig. 10. The AUC is 98.64%.

Fig. 8 (a) The classification result of the indoor scene (b) The ROC curve of the indoor scene

(a) 1 0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6 FPR

(b)

0.8

1

Multimed Tools Appl

(a) The normal event

(a) The abnormal event

Fig. 9 Two different events in the plaza scene

7.1.4 Performance comparison The performance comparison results of our algorithm based on the HMOFP feature and several popularly used methods are shown in Table 1. Our algorithm outperforms the methods of Optical Flow [14], NN [3], STCOG [20] and HOFO [26]. Also, it is comparable to the remain two methods in the scene of indoor. Fig. 10 (a) The classification result of the plaza scene (b) The ROC curve of the plaza scene

(a) 1 0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

FPR

(b)

0.8

1

Multimed Tools Appl Table 1 The comparison of our proposed algorithm with other popularly used methods for the detection of GAE Scene

AUC Ours

HOFO [26]

Sparse [3]

STOG [20]

Lawn

99.76%

98.45%

99.5%

93.62%

Indoor

95.78%

90.37%

97.5%

77.59%

Plaza

98.64%

98.15%

96.4%

96.61%

NN [3]

Optical flow [14]

SF [14]

93%

84%

96%

7.2 Global abnormal event detection based on the PETS2009 dataset In this section, we evaluate our method for global abnormal event detection based on the PETS2009 dataset [15]. In the following experiments, we can choose some specific scenes we are interested in as the targets in the detection progress. In the PETS2009 dataset, the resolution of an image is 768 × 576. The image patch size is set as 384 × 288 and there is no overlapping between two neighbor patches. 0°–360° are divided into 18 bins evenly, i.e., p = 18. By using the spatial basis similar to Type A shown in Fig. 3c, the HMOFP feature with length 72 can be obtained. The initial dictionary is a DCT matrix with size 72 × 144. Our experiments and the detection results are shown as follows.

7.2.1 People running detection In this part, the training set contains 50 frames (Frame 0 to Frame 49) of the video sequence Time 14–31 and 61 frames (Frame 0 to Frame 60) of the video sequence Time 14–17, where people are walking from right to left and from left to right respectively. The normal testing set includes 104 frames (Frame 0 to Frame 37 and Frame 108 to Frame 173) of Time 14–16. 119 frames (Frame 38 to Frame 107 and Frame 174 to Frame 222) of Time 14–16 are labeled as abnormal for testing, in which people are running towards one direction. The two different scenes are shown in Fig. 11. The detection result and the ROC curve are shown in Fig. 12. The AUC is 91.23%.

(a) The normal scene Fig. 11 Two different scenes in the same location

(b) The abnormal scene

Multimed Tools Appl Fig. 12 (a) The classification result of the testing frames (b) The ROC curve of the testing frames

(a) 1 0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FPR

(b) 7.2.2 People scatter detection In this part, the training set is the video sequence Time 14–16 (Frame 0 to Frame 222), where people are walking or running towards one direction. The normal testing set includes 41 frames (Frame 48 to Frame 88) of Time 14–17. 41 frames (Frame 337 to Frame 377) of Time14-33 are labeled as abnormal for testing, in which people are scattering in all directions. The two different scenes are shown in Fig. 13. The detection result and the ROC curve are shown in Fig. 14. The AUC is 99.97%.

(a) The normal scene Fig. 13 Two different scenes in the same location

(a) The abnormal scene

Multimed Tools Appl Fig. 14 (a) The classification result of the testing frames (b) The ROC curve of the testing frames

(a) 1 0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FPR

(b) 7.2.3 People movement detection In this part, the training set is the video sequence Time 14–27 (Frame 0 to Frame 225), where people are talking and standing in a relatively fixed location. The normal testing set includes 108 frames (Frame 226 to Frame 333) of Time 14–27. 108 frames (Frame 0 to Frame 107) of Time14-16 are labeled as abnormal for testing, in which people are walking or running towards one direction. The two different scenes are shown in Fig. 15. The detection result and the ROC curve are shown in Fig. 16. The AUC is 98.38%

(a) The normal scene Fig. 15 Two different scenes in the same location

(a) The abnormal scene

Multimed Tools Appl Fig. 16 (a) The classification result of the testing frames (b) The ROC curve of the testing frames

(a) 1 0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FPR

(b) 7.3 Local abnormal event detection In this section, we will evaluate our method for local abnormal event detection and localization on the UCSD Ped1 dataset (http://www.svcl.ucsd.edu/projects/anomaly). There are 34 short clips in the training set. In the testing set, there are 36 short clips and a subset of 10 clips with manually generated pixel-level binary masks, which are used to verify the regions containing anomalies. Each clip contains 200 frames with a 158× 238 resolution. As shown in Fig. 17, in the normal frames, the video contains only pedestrians. In the abnormal scenes, commonly occurring anomalies include bikes, skaters, wheelchairs and cars. In the experiments, the image patch size is set as 26 × 34 with 4-pixel overlapping between two neighboring patches in the horizontal direction. 0°–360° are divided into 18 bins, i.e., p = 18. By using the spatial-temporal basis Type B shown in Fig. 3c, the length of the HMOFP feature is 180. The initial dictionary is a DCT matrix with size 180 × 720.

Fig. 17 The normal event (a) and the abnormal events (b) ~ (e) in the UCSD Ped1 dataset

Multimed Tools Appl 1

Fig. 18 The ROC curve of the LAE detection using frame-level groundtruth

0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

FPR

7.3.1 Abnormal event detection based on the frame-level groundtruth If a frame contains at least one abnormal pixel, it should be recognized as abnormal. The detection is compared to the frame-level groundtruth annotation of each frame. The detection ROC curve under this criterion is shown in Fig. 18. The performances of our algorithm and other popularly used methods are shown in Table 2, including AUC and Equal Error Rate (EER). Note that this evaluation has a risk of co-occurrences of erroneous detections and abnormal events, i.e., normal features are incorrectly detected as anomalies in abnormal frames. As shown in Table 2, our algorithm outperforms the methods of Ren et al. [17], Adam et al. [12], MPPCA [12] for both EER and AUC measurements. For the AUC values, only Sparse [3] and MDT [12] are better than our algorithm.

7.3.2 Abnormal event localization based on the pixel-level groundtruth If at least 40% of the truly abnormal pixels are detected, the frame is considered to be detected correctly. Otherwise, it is counted as a false positive. Pixie-level criterion is stricter but more accurate. The detection is compared with the pixel-level groundtruth masks. The detection ROC curve under this criterion is shown in Fig. 19. The performances of our algorithm and other popularly used methods are shown in Table 3, which includes AUC and Equal Detected Rate (EDR). By Table 3, it can be seen that our algorithm achieves the best results for both EDR (ours 53.82% > 48.49% [17]) and AUC (ours 58.23% > 48.23% [17]).

Table 2 The comparison of our proposed algorithm with other popularly used methods for the detection of LAE Criteria Ours

Sparse [3]

Ren et al. [17]

Adam et al. [12]

MDT [12]

SF-MPPCA [12]

MPPCA [12]

SF [14]

EER

35.01% 19%

46.44%

38%

25%

32%

40%

31%

AUC

70.62% 86%

54.26%

56.63%

81.8%

67.25%

59%

67.5%

Multimed Tools Appl 1

Fig. 19 The ROC curve of the LAE localization using pixel-level groundtruth

0.9 0.8 0.7

TPR

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

FPR

Table 3 The comparison of our proposed algorithm with other popularly used methods for the localization of LAE Criteria Ours

Sparse [3]

Ren et al. [17]

Adam et al. [12]

MDT [12]

SF-MPPCA [12]

MPPCA [12]

SF [14]

EDR

53.82% 46%

48.49%

24%

45%

28%

18%

21%

AUC

58.23% 46.1%

48.23%

13.3%

44.1%

21.3%

20.5%

17.9%

On the frame-level evaluation, a frame is marked as positive if at least one abnormal feature is found, while on the pixel-level evaluation, the stricter criterion is applied - a frame can only be marked as correctly positive if at least 40% of truly abnormal features are reported. The frame-level evaluation has a risk of coincidence - if a normal feature in an abnormal frame is incorrectly detected as an anomaly, it still contributes to a higher AUC score. Due to this reason, the pixel-level criterion is stricter but more accurate. Our method obtains a AUC of 58.23% on the pixel-level evaluation, which achieves the best result (promoting AUC by 10% compared to the best score (48.23%) [17]), yet has a good result on the frame-level evaluation (AUC is 70.62%).

8 Conclusions In this work, an algorithm for abnormal event detection in crowded scenes for both GAE and LAE was proposed. Based on the HMOFP, two spatial-temporal bases were introduced for the global-scale and local-scale, respectively. Moreover, dictionary construction based on the online dictionary learning method and sparse reconstruction cost were presented. Compared with the popular used algorithms, our proposed algorithm obtained good results for the abnormal event detection and localization, especially for the pixel-level abnormal event localization.

Multimed Tools Appl

From Table 1, we can find that the performance of the proposed method is not the best in the indoor scene about global abnormal event detection. From Table 2, we can find that our proposed method has a limited performance about local abnormal detection based on the frame-level groundtruth. How to improve the performances in the both situations is one of our future works. Also, optimizing our proposed method to decrease the processing time is another area of our forthcoming research, which will make the proposed method be a real-time operation. Acknowledgments This work is supported by the 973 Program (no. 2011CB302203), NSFC (nos. 61572067, 61272028, 61273274, 61672089, 61602538, and 61572064), PXM2016_014219_000025, National Key Technology R&D Program of China (no. 2012BAH01F03), NSFB (no. 4123104), Beijing Municipal Natural Science Foundation (no. 4162050), and Natural Science Foundation of Guangdong Province (no. 2016A030313708).

References 1. Aharon M, Elad M, Bruckstein A (2006) K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54(11):4311–4322 2. Bi C, Wang H, Bao R (2014) SAR image change detection using regularized dictionary learning and fuzzy clustering [J]. IEEE Int Conf Cloud Comput Intell Syst (CCIS): 327–330 3. Cong Y, Yuan J, Liu J (2011) Sparse reconstruction cost for abnormal event detection. IEEE Conf Comput Vision Pattern Recogn (CVPR): 3449–3456 4. Donoho DL, Tsaic Y (2006) Extensions of compressed sensing. Signal Process 86(3):533–548 5. Gu X, Cui J, Zhu Q (2014) Abnormal crowd behavior detection by using the particle entropy[J]. Optik-Int J Light Electron Optics 125(14):3428–3433 6. Haque M, Murshed M (2010) Panic-driven event detection from surveillance video stream without track and motion features. IEEE Int Conf Multimed Expo (ICME): 173–178 7. Hung T, Lu J, Tan Y (2013) Cross-scene abnormal event detection. IEEE Int Sym Circ Syst (ISCAS): 2844–2847 8. Junior JSJ, Musse S, Jung C (2010) Crowd analysis using computer vision techniques. IEEE Signal Process Mag 5(27):66–77 9. Kosmopoulos D, Chatzis SP (2010) Robust visual behavior recognition. IEEE Signal Process Mag 27(5):34–45 10. Lee CP, Lim KM, Woon WL (2010) Statistical and entropy based abnormal motion detection[C]. IEEE Student Conf Res Dev (SCOReD): 192–197 11. Li A, Miao Z, Cen Y, Wang T, Voronin V (2015) Histogram of maximal optical flow projection for abnormal events detection in crowded scenes. Int J Distrib Sensor Netw. doi:10.1155/2015/406941 12. Mahadevan V, Li W, Bhalodia V et al. (2010) Anomaly detection in crowded scenes. IEEE Conf Comput Vision Pattern Recogn (CVPR): 1975–1981 13. Mairal J, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. Proc 26th Ann Int Conf Mach Learn (ACM): 689–696 14. Mehran R, Oyama A, Shah M (2009) Abnormal crowd behavior detection using social force model. IEEE Conf Comput Vision Pattern Recogn (CVPR): 935–942 15. PETS (2009) Performance evaluation of tracking and surveillance (pets) 2009 benchmark data. http://www. cvg.reading.ac.uk/PETS2009/a.html 16. Quan Q, Hong-Yi C, Rui Z (2009) Entropy based method for network anomaly detection[C]. IEEE Pacific Rim Int Symp Depend Comput: 189–191 17. Ren H, Moeslund TB (2014) Abnormal event detection using local sparse representation. IEEE Int Conf Adv Video Sign Based Surveill (AVSS) 125–130 18. Rodriguez M, Ali S, Kanade T (2009) Tracking in unstructured crowded scenes. IEEE Int Conf Comput Vision: 1389–1396 19. Sandhan T, Srivastava T, Sethi A, Jin Y (2013) Unsupervised learning approach for abnormal event detection in surveillance video by revealing infrequent patterns. IEEE Int Conf Imag Vision Comput N Z (IVCNZ): 494–499

Multimed Tools Appl 20. Shi Y, Gao Y, Wang R (2010) Real-time abnormal event detection in complicated scenes. IEEE Int Conf Pattern Recogn (ICPR): 3653–3656 21. Sjarif NNA, Shamsuddin SM, Hashim SZ (2011) Detection of abnormal behaviors in crowd scene: a review. Int J Adv Soft Comput Appl 3(3):1–33 22. Thida M, Yong YL, Climent-Pérez P, Eng H-l, Remagnino P (2013) A literature review on video analytics of crowded scenes. Intell Multimed Surveill: 17–36 23. Tropp J, Gilbert AC (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inf Theory 53(12):4655–4666 24. Tziakos I, Cavallaro A, Xu L (2010) Local abnormal detection in video using subspace learning. IEEE Int Conf Adv Video Sign Based Surveill (AVSS): 519–525 25. UMN, Unusual crowd activity dataset of University of Minnesota, department of computer science and engineering. http://mha.cs.umn.edu/movies/crowd-activity-all.avi 26. Wang T, Snoussi H (2014) Detection of abnormal visual events via global optical flow orientation histogram. IEEE Trans Inform Forensics Sec 9(6):988–998 27. Xie M, Hu J, Guo S (2015) Segment-based anomaly detection with approximated sample covariance matrix in wireless sensor network. IEEE Trans Parallel Distrib Syst 26(2):574–583 28. Yang M, Dai D, Shen L et al. (2014) Latent dictionary learning for sparse representation based classification [C]. IEEE Comput Vision Pattern Recogn (CVPR): 4138–4145 29. Yang M, Zhang L, Feng X et al (2014) Sparse representation based fisher discrimination dictionary learning for image classification [J]. Int J Comput Vis 109(3):209–232 30. Yen S, Wang C (2013) Abnormal event detection using HOSF. Int Conf IT Converg Secur (ICITCS): 1–4 31. Zhan B et al (2008) Crowd analysis: a survey. Mach Vis Appl 19(5–6):345–357 32. Zhang Y, Qin L, Yao H, Huang Q (2012) Abnormal crowd behavior detection based on social attributeaware force model. IEEE Int Conf Imag Process (ICIP): 2689–2692 33. Zhang Y, Qin L, Yao H, Huang Q (2015) Social attribute-aware force model: exploiting richness of interaction for abnormal crowd detection. IEEE Trans Circ Syst Video Technol 25(7):1231–1245 34. Zhu X, Liu J, Wang J, Li C, Lu H (2014) Sparse representation for robust abnormality detection in crowded scenes. Pattern Recogn 47(5):1791–1799

Ang Li He received the Bachelor degree in 2011 from Harbin Institute of Technology. Now he is a Ph.D student of Beijing Jiaotong University. His research interests Compressed Sensing, Video Processing, Abnormal Event detection, Sparse Reconstruction, Low-rank Matrix Reconstruction etc.

Multimed Tools Appl

Zhenjiang Miao Doctor, Ph.D. Supervisor, 1990 and 1994 in BJTU got master’s and doctor degree separately. 1994–2004 studied and worked in France and Canada. In France, at first he worked in France TOULOUSE National Institute of Technology (INPT) to carry out post-doctoral study and then worked in the National Academy of Sciences (INRA); In Canada, at first he worked in the Canada National Research Council Institute of Information Technology (IIT-NRC), then he worked in the North Telecommunications (Nortel) and RIM company to develop wireless network for communications equipment DMS-MTX and 3G handheld device Blackberry 6750 system and the products now have a wide range of global sales and applications. He has published academic papers over 70 articles. At present, he has presided over the national 973 issue of Bvisual media interactive and integration deal" and other important national research projects.

Yigang Cen He received the Ph.D degree in Control Science & Engineering, in 2006 from the Huazhong University of Science & Technology. In Sept. 2006, he joined the Signal Processing Centre, Nanyang Technological University in Singapore as a research fellow. He is currently a professor and supervisor of Ph.D students of Beijing Jiaotong University. From Jan. 2014 to Jan., 2015, he was a visiting scholar of the department of Computer Science, University of Missouri in US. His research interests include Compressed Sensing, Sparse Representation, Low-rank Matrix Reconstruction, Wavelet construction theory etc.

Multimed Tools Appl

Yi Cen He received the Ph.D degree in 2014 from the Beijing University of Posts and Telecommunications. He is currently a lecturer of School of Information Engineering, Minzu University of China. His research interests include Compressed Sensing, Sparse Representation, Low-rank Matrix Reconstruction etc.

Anomaly detection using sparse reconstruction in crowded scenes

Recommend Documents