Multimed Tools Appl DOI 10.1007/s11042-013-1763-7
Loitering detection using an associating pedestrian tracker in crowded scenes Yunyoung Nam
© Springer Science+Business Media New York 2013
Abstract This paper presents a loitering detection method using an associating pedestrian tracker in public areas. We analyze the spatio-temporal characteristics to perform monitoring of people and generate alerts when loitering persons are detected. To determine and adjust a time threshold for raising an alarm, we obtain the mean time of stay for normal and abnormal situation. In addition, we consider an optimal threshold for staying time and escaping time to deal with various conditions. For object identification, we measure the mean square error and histogram of oriented gradients. In order to trace moving objects continuously, the HSI color model and a combination of Euclidean distance, color difference, and shape difference are measured based on consistent labeling tracking. To evaluate the performance of our method, we showed detection results of the PETS2007 dataset using thresholds obtained by our proposed methods. Our experiments show promising results with 75.45 % averaged recall rate and 87.12 % averaged precision rate were obtained in loitering objects. We also compared the proposed method to other reported methods. The experimental results showed a significant improvement on precision. Keywords Abnormality · Loitering · Optimal threshold · Moving objects · Object identification · Spatio-temporal database · HSI color model
1 Introduction The demand for video surveillance systems is immensely increasing in various fields of security applications. Advances in closed-circuit television (CCTV) technology are turning video surveillance equipment into the most valuable loss prevention, safety, and security tool available today for both commercial and residential applications. Traditional CCTV This research is supported by the International Collaborative R&D Program of the Ministry of Knowledge Economy (MKE), the Korean government, as a result of Development of Security Threat Control System with Multi-Sensor Integration and Image Analysis Project, 2010-TD-300802-002. Y. Nam () Department of Biomedical Engineering, Worcester Polytechnic Institute, Worcester, MA 01607, USA e-mail:
[email protected]
Multimed Tools Appl
requires lots of operators to continuously monitor a significant number of cameras in areas, such as military installations, roads, and airports that need security. On the contrary, intelligent surveillance systems with a relative few operators can provide automated services, such as abrupt incursion detection, robbery monitoring, people counting, and loitering detection. Loitering detection is to prevent vandalism and terrorist attacks as well as to identify potential dangerous people. For instance, individuals loitering near access doors may be looking to “tailgate” an authorized member of staff in order to gain access to the premises or to a restricted area [6, 17]. Loitering is the act of remaining in a particular public place for a protracted time, which is prohibited by local governments in several countries. There is often another criminal statute or ordinance which can be applied specifically to control aggressive begging, soliciting prostitution, drug dealing, blocking entries to stores, public drunkenness, or being a public nuisance. As a consequence, we need to characterize the configuration of the objects at the appearance in terms of paths, which we will call loitering patterns. Loitering detection work consumes a lot of time due to the large amount of data that a human operator has to analyze. Eventually, a human operator could fall into distraction. In [5, 23] and most of commercial products, the event of loitering is alarmed by locating and tracking an individual when he/she stays in the field of view (FOV) of the monitoring camera under temporal constraints. However, this approach requires lots of training images to build the classifier and the threshold value cannot be used in different environmental conditions. Thus, we consider several events to learn the threshold values in various conditions. In this paper, we propose the optimal time threshold determination for raising a loitering alarm which is desired in real-time systems. We utilize the mean square error and histogram of oriented gradients (HOG) [8] for object identification. To trace objects continuously, the hue saturation intensity (HSI) color model and a combination of Euclidean distance, color difference, and shape difference are measured based on consistent labeling tracking. The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 presents overview of the application model. Section 4 describes our proposed method. Section 5 presents the implementation and our proposed algorithm for loitering detection. Section 6 shows experimental results of proposed methods. Finally, Section 7 discusses some issues and concludes this paper with summarizing open issues and topics for future work for further improvement.
2 Related work Numerous researchers [1, 4, 12, 14] have developed approaches to detect loitering in public spaces such as bus stops and subway platforms, mainly to detect drug-dealers. However, the existing loitering detection systems did not recognize loitering based on stepping frameby-frame through the recorded sequence, using either purely manual, or semi-automated tools to characterize the targets in the scene. The manual method is quite straightforward, but is particularly tedious and time-consuming for marking the boundary, especially if it is required to perform the operation on many thousands of targets. The semi-automatic approach utilizes a detecting and tracking algorithm to find and identify likely corresponding targets from the previous frame, allowing manual intervention to correct for any algorithmic errors. In the foreground detection task, many approaches based on background subtraction have been proposed to detect foreground objects. Such methods differ mainly in the type
Multimed Tools Appl
of background model and the procedure used to update the model. In [22], a simple change detection module was used to consider pixel differences between consecutive frames and the frame-by-frame decisions accumulation. In [2], a scheme based on chromaticity distortion was presented. In [15], Li et al. proposed to use spectral, spatial and temporal features, incorporated in a Bayesian framework, to characterize the background appearance at each pixel. Recently, the Gaussian Mixture Model (GMM) is commonly used in machine learning and classification to extract the foreground of the video scene [3, 21, 24, 25]. In this paper, GMM-based foreground detection is used to deal with slow illumination changes, periodical motions, long term scene changes, and camera noises from clutter background. A good review about background modeling methods can be found in [19]. Some researchers [9, 20] used the trajectory obtained by tracking each observed object in the scene to characterize the anomalous situations. These approaches achieve low performance in crowded scenes due to the complexity of the problem. Because a crowded scene may contain hundreds or even thousands of objects and operations of object detection, tracking suffers from the problems such as varying proximity object occlusion, varying proximity of objects, and similar appearance. Previous approaches have reported many difficulties including shadows, occlusions, non rigid targets, and varying lighting conditions. Moreover, these approaches have all been demonstrated successfully on a longer temporal window, but due to their more global features it has been difficult to extend them to cluttered and partially occluded images in the real-world. Furthermore, none of the approaches described in the literature have considered the optimal time threshold for raising a loitering alarm in crowded scenes to reduce the number of false positives in detection of loitering objects. The proposed approach demonstrate the potential advantages of considering the optimal threshold values for staying time and escaping time based on system parameter adaptation. In the next section, we will present the overview of problem description and application model.
3 Problem description and application model 3.1 Problem description In general, a loitering detection method suffers from excessive false positives, this might result in many false alarms [9, 14]. Such experimental results are probably due to the fact that the segmentation errors in each frame caused by shadows, occlusions, nonrigid targets, and varying lighting conditions. The robustness of the features on the basis of human shape and body models relies heavily on the performance of foreground human segmentation and people tracking, which are hard-to-solve problems due to dynamic background and occlusions. Most of the trajectory based analysis systems rely on perfect tracking [16]. Further analysis is meaningless if objects cannot be accurately localized and tracked. Due to broken tracks caused by occlusions, algorithms for handling the tracking noise must be developed for a completely unsupervised system in the real-world. Especially, for crowd scenes, continuous tracking of individual objects is not possible because of occlusion or failures. The main contributions of this paper are as follows. The adaptive Gaussian mixture modeling method is used to extract the foreground of the video scene, and exclude the noise by applying the foreground as a mask. In order to deal with various loitering conditions, the optimal threshold values for staying time and escaping time are determined based on system parameter adaptation. The mean square error and HOG are measured to identify moving
Multimed Tools Appl
objects. For each foreground blob, the HSI color model and a combination of Euclidean distance, color difference, and shape difference are measured to consistently label blobs. 3.2 Application model The proposed system consists of two main parts which are pedestrian detection and loitering detection. To detect loitering objects, it is necessary to determine and adjust system parameters such as a time threshold of loitering for raising an alarm. This work aims to automatically detect loitering pedestrians. i.e., the event of loitering is alarmed by locating and tracking an individual when pedestrians stays in the field of view of the monitoring camera under temporal constrains. In order to achieve this, moving objects should be analyzed based on the spatio-temporal characteristics. Figure 1 shows the outline of basic operations of a loitering object detection approach using spatial information. Once moving objects are identified, each blob Bi is represented j j by its rectangular bounding box which has three type parameters: Ci,k , width wi , and height j
hi , where k means the four corners and the bottom center of a square, j is the frame number. The proposed method deals with different events of interest from image coordinates to realworld coordinates by assuming the bottom center of the blob or meanshift bounding trackers j box is touching the ground. From the figure, we use the bottom center Ci,5 of Bi to obtain coordinates of each blob in j th frame. All vectors of blobs are acquired from the tracker and are organized in the matrix AB for all blobs. Each blob is classified into two different vectors: SB and P B for still blobs and pedestrian blobs, respectively. All pedestrian blobs are analyzed to detect possible loitering object candidates. For example, as seen from the j figure, when B1 goes inside the restricted area, its position C1,5 is stored in the pedestrian buffer Bped for each frame j . When B2 is walking on the edge of the area, any blob cannot be detected for few frames. If B3 leaves the area, it will be removed from the matrix so that the index matrix will return to zero and the alarm will decline. Figure 2 shows the outline of basic operations of a loitering object detection approach using temporal information. Basically, if each staying time Tini of pedestrians exceeds the threshold value Tth , it means that the pedestrian has been inside the area for loitering and consequently the alarm will start. Although the definition is reasonable for most common cases, it is difficult to detect occluded objects in crowded environments and is unnecessary
Fig. 1 Outline of basic operations of a loitering object detection approach using spatial information
Multimed Tools Appl
Fig. 2 Outline of basic operations of a loitering object detection approach using temporal information
in many applications. The proposed system calculated the escaping time threshold Tout th . When the object goes outside the area over Tout th , those actions may cause the resetting of the timer and miss the detected objects. After splitting, the object leaves the area before the alarm starts. Any object can move out a region of interest (ROI) for a moment while the i of B is object is tracked in fields of view. To deal with this case, each escaping time Tout i i used to determine whether Tout is less than the escaping time threshold Tout th or not. 4 Proposed approach 4.1 Overview of loitering object detection To construct the loitering object detection algorithm, we analyze space and time characteristics of each frame in the video sequences. The detection of loitering objects in our method consists of three phases which are location analysis, temporal change analysis, spatial change analysis. Figure 3 illustrates a block diagram for the loitering detection approach. The first phase involves blob detection and pedestrian detection for object classification
Fig. 3 Block diagram of the loitering detection approach
Multimed Tools Appl
using the background modeling and subtraction (BGS) model. In order to achieve robustness, real-time processing, and high accuracy, the model has been integrated with a mixture of Gaussian background subtraction and filtering performed by using the functions provided by the OpenCV library [7]. To perform adaptive region-level background learning and updating, the model has been developed by two different methods that are the pixel based and non-motion based background update. In addition, we applied a ROI on each camera to exclude regions that are too far, and adaptive filters to calculate the object size and the average human height which eliminates any objects that are too small or large. In the second phase, spatio-temporal information is validated where the context information surrounding the static region is exploited to identify a loitering object. Finally, the loitering object is verified in the third phase to ensure that the object stays a minimum amount of time before an alert is fired. Figure 4 illustrates a flow chart of the loitering detection approach. 4.2 Pedestrian detection and tracking The proposed method proceeds cyclically by analyzing only one frame of the video sequences. These modules separate the foreground from the background to identify the areas occupied by moving people and to get the properties of bodies in blobs for each frame. Through utilizing this information, only the pedestrians are selected by eliminating the other blobs such as abandoned objects. Given the list of blobs that populate each frame, in each vector, the module stores only the blobs whose central coordinates (x, y) are located within the areas occupied by moving people. If there is more than one person in an investigation scene, the proposed algorithm should i distinguish between the different pedestrian blobs to correctly fill them into the matrix Bped that is constructed to store in order the coordination of the pedestrian blobs only. Pedestrians are distinguished with an associating pedestrian tracker, which simply compares the geometrical x-coordinate position of pedestrians among them. In order to effectively trace pedestrians, the Kalman filter tracking [27] was applied to the pedestrian tracking method.
Fig. 4 Outline of the proposed loitering detection approach
Multimed Tools Appl
4.3 Staying time in region of interest To measure the staying time in the ROI, the method utilizes 5 point-type inputs which are a bottom center position of a blob and the other 4 inputs are used to determine the area. The area is characterized by 4 points P1 , P2 , P3 , P4 that are obtained by the following equation of the four straight lines to determine whether the blob is located in the area or not: ⎧ y y Pi − Pi+1 ⎪ ⎪ ⎪ if 0 < i < 4 ⎨ Px − Px , i i+1 (1) ai = y y ⎪ Pi − P1 ⎪ ⎪ ⎩ x , otherwise Pi − P1x , y
where Pix and Pi are coordinates of ROI point Pi . y
bi = Pi − ai · Pix
(2)
To determine whether the coordinates of a point is in the ROI or not, two criteria y are calculated by y i if i is an even number xi = P a−b i yi = ai · P x + bi otherwise
x
and
(3)
,
where P x and P y are central coordinates of a moving blob. To mark the points, we introduce the following indicator Z to decide whether the moving blob is inside or not, ⎧ and P x ≥ xmax , 1, if P y ≥ ymax ⎪ ⎪ ⎪ ⎪ ⎨ ymax = max(yi ), xmax = max(xi ) and , if P y ≤ ymin and P x ≤ xmin Z= (4) ⎪ ⎪ ⎪ ymin = min(yi ), xmin = min(xi ) ⎪ ⎩ 0, otherwise , where max(e) and min(e) are the maximum and minimum value of elements e. 4.4 Associating threshold adaptation In each frame, Bped is reconstructed to store in order the coordination of the pedestrian blobs only, which are inside the area. Each row represents a frame and the coordinates of each blob are stored in two columns and the matrix can be filled with up to maxblob columns j Nmax . If blob Bi enters the area, its position Ci,5 is stored in the first row of Bped and if a blob stays in the area, the matrix will keep updating, adding new rows to the matrix. If Tini of a given blob exceeds Tth , it means that a pedestrian has been in the area for a time threshold and so the alarm will turn on. If a blob leaves the area, it will be removed from the matrix so that the index matrix will return to zero and the alarm will decline. In some cases there are problems in the recognition of the pedestrians, for example any blob cannot be detected for few frames if the pedestrian is walking on the edge of the area, he or she could go outside the area just for a few frames. In this case, actions of the pedestrians may cause the removing of the coordination blob from the matrix and consequently the possible alarm activation is compromised, even if there is a danger. In order to avoid this application i used as a threshold, under which if the blob goes outside the area behavior, there is also Tout i . or if there are no coordinates, the latter will not be removed from the buffer for Tout Figure 5 shows the staying time distribution in the PETS2007 dataset. The PETS2007 dataset is composed of 9 scenarios (S00-S08) captured from 4 type cameras. In this paper,
Multimed Tools Appl
(a) Distribution
0.35
S1-S8 S0
0.30
Probablity
0.25
0.20
0.15
0.10
0.05
0.00 0
100
200
300
400
time
(b) Normal Distribution (T
i in
)
0.008 S1-S8 S0
i
f(T in )
0.006
0.004
0.002
0.000 0
100
200
300
400
500
T iin
Fig. 5 Experimental results of staying time distribution
we used the third camera to avoid occlusions among pedestrians in crowded scenes. In the dataset, the S00 scenario is a control sequence in which none of the events defined (loitering, theft, unattended luggage) takes place. We used the S00 and S01-S08 to obtain each Tini in normal and abnormal situation, respectively. Tini of S00-S08 is obtained by measuring the movement time in the ROI. After collecting a set of each sequences, the distributions of their Tini thus is obtained by probability P Tini . The probability P Tini is calculated by P Tini =
i
2 2 1 √ e− Tin −μ 2σ , σ 2π
(5)
where μ and σ are the mean and the standard deviation of Tini . In the figure, Tini of S00 is drawn with blue solid lines and Tini of S01-S08 is dashed black. The range of distribution of Tini in S00 is varied from 0 to 153, the mean and standard deviation of that are 62 and 57.6. On the other hand, the range of distribution of Tini in S01-S08 is varied from 0 to 454, the
Multimed Tools Appl
mean and standard deviation of that are 93.2 and 114. In the next section, we will explain the comparison between the results in the dataset.
5 Loitering detection 5.1 Time threshold 5.1.1 Staying time threshold For loitering detection, Tini is computed by i Tini = tin · fps,
(6)
i tin
where represents the time staying in ROI and fps represents the frame rate. In addition, the risk R is calculated by N(i) × ρ R= , (7) Tini where N(i) is the number of frames for loitering and ρ is the rate for risk computation. As mentioned in the previous subsection, to distinguish appearances, the threshold is the most critical parameter for the problem of object identification and loitering detection. In some literatures [4], the threshold value is acquired off-line by various methods [11]. However, this approach requires lots of training images to build the classifier and the threshold value that cannot be used in some conditions such as different environmental conditions. Therefore, we propose a method by considering two specific events to learn
the threshold value to handle various conditions. In Section 4.4, the probabilities Pnor Tini and j Pabn Tin of normal situation and abnormal situation are illustrated in Fig. 5b. To compute the associating threshold, the following formula is used.
2 2 1 − T i −μnor 2σnor Pnor Tini = √ e in (8) σnor 2π 2 j 2 1 − T −μ 2σabn j Pabn Tin = √ e in abn (9) σabn 2π Tth is determined by solving the following equations: e(Tth −μnor
)2
Pnor (Tth ) = Pabn (Tth ) 2 σabn 2σabn = σnor
2 −(T −μ 2 2σnor th abn )
(10) (11)
5.1.2 Escaping time threshold The escaping time threshold Tout th is calculated by solving the following two equations: j Pnor Tini = Pabn Tin (12)
2 1 i 2 Pnor Tini = √ e− Tin −μnor 2σnor = 0 (13) σnor 2π There is no number to satisfy this equation. Thus, we used the following equation: 1 √
σnor 2π
e−
2 2 i 2σnor Tin −μnor
<
(14)
Multimed Tools Appl
where is determined by the size of the normal dataset and calculated by the cumulative
distribution function (CDF). The CDF is calculated probability in the interval −∞, Tini as follows: Tini Tini = Pnor Tini dTini (15) −∞
We can use the approximate value for X-intercept Tinx that is obtained by Tinx = arg max Tini . i
(16)
However, it does not take into consideration the moving time limitation. In this paper, we
use the minimum value of Tini as Tinx when Tini is over than cumulative percentage ϕ as follows: Tinx = arg min Tini , if Tini > ϕ. (17) i
Finally, Tout th is calculated by Tout th = Tinx − Tth .
(18)
In the next section, we will show the experimental results in the PETS2007 dataset using Tth and Tout th that are obtained by the equations above. 5.2 Object identification After background subtraction, each moving object is identified by the color histogram of each object in our system. When an object moves in the monitoring area, pixel-level subtraction results in the separation of background and object image. However, these subtracted data contain numerous useless noisy and ungrouped pixels. Thus, these pixels should be eliminated and blobs for the grouped pixels should be grouped as a moving object. To match color histogram with each blob, we used the mean square error (MSE) to quantify the difference between color histogram values Ha and Hb of two color a and b implied by an estimator and the true values of the quantity being estimated. The MSE is calculated by 1 (Ha − Hb )2 , n n
MSE =
(19)
i=1
where n is the total number of bins. In addition, HOG is used for pedestrian detection. The detection window with an overlapping grid of HOG descriptors and using the combined feature vector in a conventional Support Vector Machine (SVM) [18, 26] based window classifier provide human detection vector. The combined vectors are fed to a linear SVM for object/non-object classification. The detection window is scanned across the image at all positions and scales. Although our current linear SVM detector is reasonably efficient - processing 240×160 and 480×320 scale-space images in less than a second – there is still room for optimization and to further speed up detections it would be useful to develop a coarse-to-fine or rejection-chain style detector based on HOG descriptors. We tested our detector on the PETS2007 dataset. We used S0 and S2 for the training set containing 4500 frames. We used S01, S03, S04, S06, S07, and S08 for the testing set. 5.3 Occluded object tracking In order to solve the occlusion problem, we use the HSI color model until objects come into an occlusion situation and attempt to measure noise according to the change in blob surface
Multimed Tools Appl
during merging. The blob is defined as being a group of objects such as persons or cars, which acts as a container that can have one or more objects. For each foreground blob, the color model is initialized and a combination of Euclidean distance, color difference, and shape difference are consistently used to label the blobs for the rest of the sequence. The model accommodates the presence of partial occlusions, pose variations and illumination changes, through partial updating at every frame. There are two major approaches [13] for dealing with occlusion using a single camera. The first approach is the Merge-Split (MS) approach that merges the detected occlusion blobs into a single new blob. From that point on, the attribute of original objects is encapsulated into the new blob. The second approach is the Straight-Through (ST) approach that continues to track the individual blobs containing only one object through the occlusion without attempting to merge them. We used the MS approach in this paper. When a new set of blobs is computed for a frame, an association with the set of blobs from the previous frames is required. This association can be an unrestricted one. With each new frame, blobs can enter, exit, merge, and split. The system detects that two or more people have merged into a group when the total number of blobs in the frame has decreased and two or more blobs in the previous frame overlap with a single blob in the current frame. On the other hand, the group can also be split. This event is detected when the total number of blobs in the frame has increased and several blobs in the current frame overlap with a group blob in the previous frame. In order to assign labels after a split, each blob involved in the splitting is segmented as if it was still the group with all the components. Assuming that each person can only be present in one of the blobs involved in the splitting process, it is concluded that a person is present in the blob that contains the largest number of pixels labeled with its label. 5.4 Consistent labeling tracking In general, background subtraction produces a reference image from an initial image and compares the current image with the reference image to obtain a foreground image. The reference image can be produced easily by the average of the initial background image. However, if the fixed reference image is used, ghosts can appear in the reference image. On the other hand, if the reference image is updated rapidly, the stopped object can be absorbed into the background reference image, and cannot be traced continuously. In order to solve the problem, we utilize color histogram, object distance, moving direction, and object size. For each frame of the sequence, the tracking information of each object is continuously updated according to the results of object detection in the current frame. All objects in current frame are compared with objects in the previous frame. Each object currently being traced is updated with the information from an object detected in the current frame if the similarity between two objects is smaller than that of any other objects. 5.5 Loitering object detection Once a moving object is identified, a moving time is recorded into the P B by either updating the vector if the object has already been recorded, or adding a new record if it is a new moving object. After recording, the event of loitering can be detected by comparing the moving times of the objects. When an object candidate is identified, the object is then com-
Multimed Tools Appl
pared to the AB. The loitering object candidate will be identified as one of the objects in the vector and the moving time is updated if the difference between two objects is smaller than or equal to a pre-defined associating threshold. Otherwise, the object candidate will be added into the P B as a new object. As a result, the behavior of loitering is detected by comparing the moving time with loitering strategy. The loitering object detection algorithm is summarized in Algorithm 1. In the initializai tion step, Bped and threshold values are set to 0. Depending on the ROI, Tth and Tout th are i .x and C i .y of determined in training phase for normal situation. The coordinates Cbott bott i i blobs are stored in P Bx and P By while a moving object is located in the ROI. If the object i is increased in each frame. If B i is greater than T , the alarm moves out of the ROI, Tout th ped will start when the pedestrian is in the ROI. A red rectangle around the loitering pedestrian appears on the image and the “Alarm! Loitering” text is displayed in red on the loitering pedestrian. All results are stored in the indicator K depending on events.
Algorithm 1 Loitering object detection algorithm
Multimed Tools Appl
6 Experiments In this section, we will discuss the results on the datasets S01, S03, S04, S06, S07, and S08 of the PETS2007 benchmark data. For most of the clips, the overall illumination is low, and the locations of bright highlights on the walls and floor move. Additionally, camera 3 was moved slightly for clips S05 through S08. Based on the characteristics of scenes and loitering objects, we determine whether the current image points belong to backgrounds, moving objects, or loitering objects. To mark the points, we used the following indicator: ⎧ −1, no object, ⎪ ⎪ ⎪ ⎪ 0, an object is not in the ROI, ⎪ ⎪ ⎪ ⎪ ⎨ 1, an object is in the ROI, Ki,i−1 = 2, a pedestrian in the ROI more than (20) ⎪ ⎪ T seconds, ⎪ ⎪ ⎪ ⎪ 3, a pedestrian in the ROI more than ⎪ ⎪ ⎩ T seconds and blink timer seconds, where i and i − 1 refer to the current and previous frames, respectively. For the remainder of each sequence, the start time of any alarms will be logged and compared to ground truth data to evaluate the number of ‘true positive (T P )’, ‘false positive (F P )’ and ‘false negative (F N)’ alarms. Algorithm performance will be assessed using the F weighted harmonic mean of ‘recall’ r and ‘precision’ p as described below: TP T P + FN TP p= T P + FP (α + 1)rp F = r + αp r=
(21) (22) (23)
where α is the ‘recall bias’, a weighting of recall relative to precision declared in each scenario definition. α is ranging from 0.35 to 75 depending on the scenario. F is dependent on α which determines the influence of detection rate (recall) with respect to that of false alarm rate on the value of F . A higher value of recall bias is used to assess systems for the ‘Event Recording’ role since in this role false alarms are a less significant problem. Knowledge of the recall bias value enables manufacturers to optimize their systems for either role under evaluation. In this paper, α is set to 35 from empirical observation. Figure 6 shows the detection results of the PETS2007 dataset (S01, S03, S04) when Tth is 75 and Tout th is 10. Tth and Tout th are determined empirically. The values depend on the surroundings and the difference between two objects, and is only obtainable through experiment. In Fig. 6a, the object loiters from the 489th frame to the 3658th frame. In the figure, the rate of miss detected objects is 52.08 %, but precision is 99.8 %. In Fig. 6b, the object loiters from the 300th frame to the 2049th frame. In the figure, the recall and the precision are 17.66 % and 51.33 %, respectively. In the scene, two persons are loitering in the ROI. Due to occlusion, the blob is not classified as a human blob. For this reason, we achieved the low recall and precision. In Fig. 6c, the object loiters from the 848th frame to the 2448th frame. In the figure, the recall and the precision are 20.24 % and 32.83 %, respectively. In the scene, three persons are loitering in the ROI. Due to partial occlusion, the chance of false alarm was increased.
Multimed Tools Appl Fig. 6 Detection results of the PETS2007 dataset (S01, S03, S04)
(a)
4
T th = 75, T out_th = 10 Ground truth 3
Indicator
2
1
0
-1 0
1000
2000
3000
4000
frame
(b) 4
T th = 75, T out_th = 10 Ground truth 3
Indicator
2
1
0
-1 0
1000
2000
3000
frame
(c)
4
T th = 75, T out_th = 10 Ground truth 3
Indicator
2
1
0
-1 0
1000
2000
frame
3000
Multimed Tools Appl
Figure 7 shows detection results of the PETS2007 dataset (S01, S03, S04) based on the proposed method. Tth and Tout th are 132 and 27, respectively, which are obtained by (10) and (18). In S01, S03, and S04, miss detected objects are detected for 1484, 692, 346 frames, respectively. In addition, values of F N are reduced to 1533, 741, 396. However, the number of false alarms is increased to 23, 251, 208. Figure 8 shows detection results of the PETS2007 dataset (S06, S07, S08) when Tth is 75 and Tout th is 10. In Fig. 8a, the object loiters from the 339th frame to the 1800th frame. In the figure, the rate of miss detected objects is 40.36 %, but precision is 78.99 %. In the scene, two persons are talking in the ROI and one person takes away his luggage. The blobs that contain multiple people will be divided into individuals when occlusion happens. Projection histogram is combined with top head candidate to achieve robust results. In Fig. 8b, the object loiters from the 586th frame to the 1433th frame. In the figure, the recall and the precision are 95.99 % and 89.94 %, respectively. In the scene, a woman is loitering in the ROI and leave without her bag. Due to a short T , people passing through the ROI are classified as a loitering object. For this reason, we achieved the low precision. Also, a woman is standing in ROI from 2651th frame to the end frame of the sequence. The loitering alarm is raised and produces the low recall. In Fig. 8c, the object loiters from the 457th frame to the 1091th frame. In the figure, the recall and the precision are 64.09 % and 98.31 %, respectively. In the scene, a person is stopping in the ROI and does not move for a while. Due to the stopped object, the blob is classified as a ghost and a background image. For this reason, we achieved the low recall. Figure 9 shows the detection results of the PETS2007 dataset (S06, S07, S08) based on the proposed method (Tth is 132 and Tout th is 27). In S06 and S08, miss detected objects are detected for 207 and 81 frames, respectively. However, miss detected objects are not detected for 59 frames. In addition, values of F N in S06 and S08 are reduced to 257 and 131. However, the number of false alarms is increased to 66 and 15 in S06 and S08. On the other hand, even though value of F N in S07 is increased to 9, the number of false alarms is reduced to 503. Table 1 shows the detection results when values of Tth are 75, 132 and values of Tout th are 10, 27. The values 10 and 75 are given manually. The value 132 and 27 are obtained by the previous equation in Sections 5.1.1 and 5.1.2. The precision of our proposed method using associating threshold is greater than that of the basic method which uses the manually given thresholds. In the case of S06, people is standing and walking on the edge of the area, he or she could go outside the area just for a few frames. Thus, it produces low recall and precision. Figures 10 and 11 show tracking results of detected loitering objects in the datasets. We compared our proposed method to other related methods to evaluate the performance of our method. Our proposed method was compared to the method of Dalley 2009 [9, 10] as a baseline for PETS2007 dataset performance evaluation. Dalley 2009 presented T P and F N including the temporal errors. We measured the temporal errors of the method of Dalley 2009. It has been widely tested in real-time city surveillance as well as several public data sets, forming a strong baseline. In Dalley 2009, they reported three values that are loitering events, left luggage events, and theft or reattended luggage events. From their results, Dalley 2009 method did not detect loitering events in S06, S07, S08. For S00, there were no events that took place (and our system raised no alarms). For S02, we measured the third loitering clip. In S04, Dalley 2009 method detected a loitering event and two false positive event alarms. From the results of S04, we cannot measure recall and precision. In S05, Dalley2009 method did not track the couple all the way back to their entrance time (yielding a high temporal error in the detected loitering event). Thus, we chose S01 and
Multimed Tools Appl Fig. 7 Detection results of the PETS2007 dataset (S01, S03, S04) based on the proposed method
(a) 4 T th = 132, T out_th = 27 Ground truth 3
Indicator
2
1
0
-1 0
1000
2000
4000
3000
frame
(b)
4
T th = 132, T out_th = 27 Ground truth 3
Indicator
2
1
0
-1 0
1000
2000
3000
frame
(c) 4 T th = 132, T out_th = 27 Ground truth 3
Indicator
2
1
0
-1 0
1000
2000
frame
3000
Multimed Tools Appl Fig. 8 Detection results of the PETS2007 dataset (S06, S07, S08)
(a)
4
T th = 75, T out_th = 10 Ground truth 3
Indicator
2
1
0
-1 0
500
1000
1500
2000
2500
frame
(b)
4
T th = 75, T out_th = 10 Ground truth 3
Indicator
2
1
0
-1 0
500
1000
1500
2000
2500
frame
(c)
4
T th = 75, T out_th = 10 Ground truth 3
Indicator
2
1
0
-1 0
500
1000
1500
frame
2000
2500
Multimed Tools Appl Fig. 9 Detection results of the PETS2007 dataset (S06, S07, S08) based on the proposed method
(a)
4
T th = 132, T out_th = 27 Ground truth 3
Indicator
2
1
0
-1 0
500
1000
1500
2000
2500
frame
(b)
4
T th = 132, T out_th = 27 Ground truth 3
Indicator
2
1
0
-1 0
500
1000
1500
2000
2500
frame
(c)
4
T th = 132, T out_th = 27 Ground truth 3
Indicator
2
1
0
-1 0
500
1000
1500
frame
2000
2500
75
132
75
132
75
132
75
132
75
132
75
132
FBF
CLT
FBF
CLT
FBF
CLT
FBF
CLT
FBF
CLT
FBF
CLT
S01
S01
S03
S03
S04
S04
S06
S06
S07
S07
S08
S08
Tth
Method
Sets
27
10
27
10
27
10
27
10
27
10
27
10
Tout th
2880
2880
2880
2880
2625
2625
3360
3360
2851
2851
3841
3841
Frames
708
831
2201
2075
1999
1583
1958
1270
1630
1083
3465
2491
Events
624
551
1121
1049
1516
1239
1732
1107
1347
692
3254
1629
Din
84
137
1080
1026
483
344
226
163
283
393
211
862
Dout
540
414
779
905
1368
1104
1182
987
1001
602
3006
1522
Dloitering
488
407
755
814
1079
872
670
324
1001
309
3003
1519
TP
97
228
43
34
333
590
881
1277
700
1441
118
1651
FN
Table 1 Detection results with Tth =75, 132 and Tout th =10, 27 (FBF: Frame-By-Frame, CLT: Consistent Labeling Tracking)
87
72
311
814
202
136
418
210
321
70
133
110
FP
83.42 %
64.09 %
94.61 %
95.99 %
76.42 %
59.64 %
43.20 %
20.24 %
58.85 %
17.66 %
96.22 %
47.92 %
Recall
90.37 %
98.31 %
96.92 %
89.94 %
78.87 %
78.99 %
56.68 %
32.83 %
100 %
51.33 %
99.90 %
99.80 %
Precision
83.59 %
64.72 %
94.67 %
95.81 %
76.48 %
60.05 %
43.49 %
20.45 %
59.52 %
17.98 %
96.32 %
48.62 %
F
Multimed Tools Appl
Multimed Tools Appl
(a) Frame #1040
of
(c)
(b)
Frame #1015 of
Frame #1173 of
Fig. 10 Selected tracking results from the PETS2007 dataset (S01, S03, S04)
S03 for the comparison between the proposed method and Dalley 2009 method. As shown in Table 2, our proposed approach leads to 7.23 % and 63.24 % improvement on precision against Dalley 2009. Table 3 shows the computational time when image resolutions are 240×160 and 480×320. The test videos were encoded at 25 frames per second. For instance, the S01 video has 3,841 frames, to make the real time application possible, image sequences should be processed over a frame rate of 25 fps, i.e. the computational time should be below 153,640 ms. From the table, the computational time of all dataset was less than the running time of each dataset. We made the test sets available at https://sites.google.com/site/yynams/.
(a)
(b)
Frame #703 of
(c)
Frame #682 of
Frame #715 of
Fig. 11 Selected tracking results from the PETS2007 dataset (S06, S07, S08)
Multimed Tools Appl Table 2 Comparison of the precision of the proposed method to Dalley 2009 for PETS2007 (S01 and S03) Sets
Dalley 2009
S01
93.25 %
S03
61.26 %
Proposed method 99.99 % 100 %
7 Conclusion and discussion In this paper, we have presented a robust loitering detection approach for real-time applications in crowded scenes. In order to detect loitering objects, we analyzed the spatio-temporal characteristics including temporal and spatial change. We have measured staying time and escaping time for normal and abnormal situation to reduce false alarms. After measuring the staying time and the escaping time, the distributions of the staying time were obtained by the probability of the staying time. We used the mean square error and HOG for pedestrian detection. In experiments, we showed detection results of the PETS2007 dataset using thresholds obtained by our proposed methods. Our proposed approach leads to 84 % and 27.8 % improvements on recall and precision against the frame-by-frame stepping method which uses the manually given thresholds. To evaluate its effectiveness, we have compared our proposed approach with other approaches. Our proposed approach leads to 7.23 % and 63.24 % improvements on precision against Dalley 2009. Most of the proposed techniques for loitering object detection rely on tracking information. These methods do not work in complex environments like scenes involving crowds and large amounts of occlusion. Most methods do not deal with large crowds resulting in severe occlusion and location dependency. Thus, a decision has to be made by considering location dependency if false positives or misses are to be preferred. Furthermore, the proper thresholds for object tracking are necessary to secure the effective view and to manage the resource in an autonomous surveillance system. The proper threshold value determination is a critical issue since the threshold is used to determine whether the objects loiter or not. It is important to choose environmental parameters to achieve high accuracy of detection rate. We improved our methods so that they can calculate these parameters and thresholds automatically using the incremental learning phase. Due to broken tracks caused by occlusions, algorithms for handling the tracking noise must be developed for a completely unsupervised system in the real-world. Especially, for crowd scenes, continuous tracking of individual objects is not possible because of occlusion or failures. In future work, we will extend our spatial-temporal features to 3-dimensional space and improve our method for various complex environments. Table 3 Computational time with respect to the different resolutions Resolution
Computational time (ms) S1
S3
S4
S6
S7
S8
Raw data
153,640
114,040
134,400
105,000
115,200
115,200
240×180
55,139
38,652
51,282
56,016
33,154
39,075
480×320
148,946
104,970
126,639
97,461
107,690
113,293
Multimed Tools Appl
References ˚ om K (2007) Multi sensor loitering detection using online viterbi. In: Tenth IEEE 1. Ard¨o H, Astr¨ international workshop on performance evaluation of tracking and surveillance 2. Beynon MD, Van Hook DJ, Seibert M, Peacock A, Dudgeon D (2003) Detecting abandoned packages in a multi-camera video surveillance system. In: Proceedings of the IEEE conference on advanced video and signal based surveillance, AVSS ’03. IEEE Computer Society, Washington, p 221. http://dl.acm.org/ citation.cfm?id=937803.937872 3. Bird N, Atev S, Caramelli N, Martin R, Masoud O, Papanikolopoulos N (2006) Real time, online detection of abandoned objects in public areas. In: Proceedings 2006 IEEE international conference on robotics and automation 2006, ICRA 2006, pp 3775–3780. doi:10.1109/ROBOT.2006.1642279 4. Bird N, Masoud O, Papanikolopoulos N, Isaacs A (2005) Detection of loitering individuals in public transportation areas. IEEE Trans Intell Transp Syst 6(2):167–177. doi:10.1109/TITS.2005.848370 5. Black J, Velastin S, Boghossian B (2005) A real time surveillance system for metropolitan railways. In: IEEE conference on advanced video and signal based surveillance, 2005. AVSS 2005, pp 189–194. doi:10.1109/AVSS.2005.1577265 6. Choi YJ, Kim KJ, Nam Y, Cho WD (2008) Retrieval of identical clothing images based on local color histograms. In: Third international conference on convergence and hybrid information technology 2008, ICCIT ’08, vol 1, pp 818 –823. doi:10.1109/ICCIT.2008.116 7. Corporation I (2012) Sourceforge.net: open computer vision library. http://sourceforge.net/projects/ opencvlibrary/ 8. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition, 2005, CVPR 2005, vol 1. IEEE, pp 886– 893 9. Dalley G, Wang X, Grimson WEL (2007) Event detection using an attention-based tracker. In: PETS workshop. IEEE, pp 71–78 10. Dalley GE (2009) Improved robustness and efficiency for automatic visual site monitoring. Ph.D. thesis, Massachusetts Institute of Technology 11. Duda R, Hart P, Stork D (2001) Pattern classification. Pattern classification and scene analysis: pattern classification. Wiley. http://books.google.com/books?id=YoxQAAAAMAAJ 12. Gasserm G, Bird N, Masoud O, Papanikolopoulos N (2004) Human activities monitoring at bus stops. In: Proceedings 2004 IEEE international conference on robotics and automation, 2004, ICRA 04, vol 1, pp 90–95. doi:10.1109/ROBOT.2004.1307134 13. Gabriel PF, Verly JG, Piater JH, Genon A (2003) The state of the art in multiple object tracking under occlusion in video sequences. In: Advanced concepts for intelligent vision systems. Citeseer, pp 166–173 14. Huang CH, Wu YT, Shih MY (2009) Unsupervised pedestrian re-identification for loitering detection. In: Wada T, Huang F, Lin S (eds) Advances in image and video technology, third pacific rim symposium, PSIVT 2009, Tokyo, Japan, January 13-16, 2009. Lecture notes in computer science, vol 5414. Springer, pp 771–783 15. Li L, Huang W, Gu IYH, Tian Q (2004) Statistical modeling of complex backgrounds for foreground object detection. IEEE Trans Image Process 13(11):1459–1472. doi:10.1109/TIP.2004.836169 16. Morris B, Trivedi M (2008) A survey of vision-based trajectory learning and analysis for surveillance. IEEE Trans Circ Syst Video Technol 18(8):1114–1127 17. Nam Y, Rho S, Park J (2012) Intelligent video surveillance system: 3-tier context-aware surveillance system with metadata. Multimed Tools Appl 57:315–334. doi:10.1007/s11042-010-0677-x 18. Osuna E, Freund R, Girosit F (1997) Training support vector machines: an application to face detection. In: Proceedings on IEEE computer society conference on computer vision and pattern recognition, 1997. IEEE, pp 130–136 19. Piccardi M (2004) Background subtraction techniques: a review. IEEE Int Conf Syst Man Cybern 4:3099–3104. doi:10.1109/ICSMC.2004.1400815 20. Piciarelli C, Micheloni C, Foresti G (2008) Trajectory-based anomalous event detection. IEEE Trans Circ Syst Video Technol 18(11):1544–1554. doi:10.1109/TCSVT.2008.2005599 21. Porikli F (2007) Detection of temporarily static regions by processing video at different frame rates. In: IEEE conference on advanced video and signal based surveillance, 2007. AVSS 2007, pp 236 –241. doi:10.1109/AVSS.2007.4425316 22. Sacchi C, Regazzoni C (2000) A distributed surveillance system for detection of abandoned objects in unmanned railway environments. IEEE Trans Veh Technol 49(5):2013–2026. doi:10.1109/25.892603
Multimed Tools Appl 23. Siebel NT, Maybank SJ (2002) Fusion of multiple tracking algorithms for robust people tracking. In: Proceedings of the 7th European conference on computer vision-part IV, ECCV ’02. Springer-Verlag, London, pp 373–387. http://dl.acm.org/citation.cfm?id=.645318.649351 24. Stauffer C, Grimson W (2000) Learning patterns of activity using real-time tracking. IEEE Trans Pattern Anal Mach Intell 22(8):747–757. doi:10.1109/34.868677 25. Tian YL, Lu M, Hampapur A (2005) Robust and efficient foreground analysis for real-time video surveillance. In: IEEE Computer society conference on computer vision and pattern recognition, 2005, CVPR 2005, vol 1, pp 1182–1187. doi:10.1109/CVPR.2005.304 26. Vapnik V (2000) The nature of statistical learning theory. Springer 27. Welch G, Bishop G (1995) An introduction to the kalman filter. Technical Report. University of North Carolina at Chapel Hill, Chapel Hill
Yunyoung Nam received the B.S., M.S., and Ph.D. degrees in computer engineering from Ajou University, Korea in 2001, 2003, and 2007, respectively. He was a Senior Researcher in the Center of Excellence in Ubiquitous System (CUS) from 2007 to 2009. He was a Research Professor in the CUS at Ajou University from 2010 to 2011. He also spent time as a Visiting Scholar at Center of Excellence for Wireless & Information Technology (CEWIT), Stony Brook University, New York, USA. He was a Postdoctoral Associate at Stony Brook University from 2011 to 2012. He is currently a Postdoctoral Fellow in the Department of Biomedical Engineering at Worcester Polytechnic Institute, Worcester, Massachusetts, USA. His research interests include multimedia database, ubiquitous computing, image processing, pattern recognition, contextawareness, conflict resolution, wearable computing, intelligent video surveillance, cloud computing, and biomedical signal processing.