J Sign Process Syst DOI 10.1007/s11265-017-1265-3
Real-Time Embedded Motion Detection via Neural Response Mixture Modeling Mohammad Javad Shafiee1 · Parthipan Siva2 · Paul Fieguth1 · Alexander Wong1
Received: 28 July 2016 / Revised: 16 March 2017 / Accepted: 29 June 2017 © Springer Science+Business Media, LLC 2017
Abstract Deep neural networks (DNNs) have shown significant promise in different fields including computer vision. Although previous research have demonstrated the ability of DNNs, utilizing these types of networks for realtime applications on embedded systems is not possible without requiring specialized hardware such as GPUs. In this paper, we propose a new approach to real-time motion detection in videos that leverages the power of DNNs while maintaining low computational complexity needed for realtime performance on existing embedded platforms without specialized hardware. The rich deep features extracted from the neural responses of an efficient, stochastically-formed deep neural network (StochasticNet) are utilized for constructing Gaussian mixture models (GMM) to detect motion in a scene. The proposed Neural Response Mixture (NeRM) model was embedded on an Axis surveillance camera, and results demonstrate that the proposed NeRM approach can improve the GMM performance (i.e., with less false detection and noise) in modeling the foreground and background compared to other state-of-the-art approaches applied for
Mohammad Javad Shafiee
[email protected] Parthipan Siva
[email protected] Paul Fieguth
[email protected] Alexander Wong
[email protected] 1
Vision & Image Procssing Research Group, University of Waterloo, Waterloo, ON, Canada
2
Aimetis Corporation, Waterloo, ON, Canada
motion detection on embedded systems while operating at real-time performance. Keywords Deep learning · StochasticNet · Embedded device · Motion detection · Convolutional neural network
1 Introduction One of the basic functionalities brought by motion detection framework is the ability to record video when a moving object is detected in the field of view of the camera. This functionality helps reduce the storage required for saving the video, by recording only when motion is detected by the camera. Furthermore, motion detection allows for quickly summarizing and reviewing the activity in a scene to identify unusual events or security breaches. The importance of such functionalities has motivated surveillance camera manufacturers (e.g., Axis, Samsung, etc.) to embed motion detection algorithms right on the camera. However, the computational resources on the cameras are very limited, which forces the embedded motion detection algorithms to be very simple pixel change detection algorithms. Pixel change detection algorithms model the pixel colour using an on-line Gaussian mixture model [32] then identifies any pixel values which do not conform to the modeled Gaussian as “in-motion” (i.e., the pixel has changed value due to a moving object in the scene). Gaussian mixture models (GMM) [39] represent a simple and real-time algorithm that can perform motion detection right on a camera. Although the GMM approach performs real-time using color as the feature (i.e., Red, Green, Blue), utilizing only color in the model has several drawbacks. Mainly, the use of color results in false positives (false motion detection) due to illumination changes in the scene
J Sign Process Syst
(e.g., indoor light flickering, shadows, overhanging clouds passing by, and strong sunlight), or subtle motions from waving background objects (e.g., branches and leaves of trees moving because of wind). Different strategies have been proposed to address the illumination change problems associated with popular motion detection algorithms [1, 15, 31]; however, such methods are not effective in dealing with subtle motions. Statistical background subtraction methods [6, 24, 32, 37] have addressed, to some extent, the false positives caused by subtle motions and dynamic backgrounds. However, their performance is highly dependent on the appropriate choice of learning rate to update the background model and account for gradual illumination changes. As a result, these methods are prone to large errors (false alarms) when subject to sudden illumination changes in the background. More importantly, the computational complexity of such methods restricts their usage on embedded devices. One of the most feasible solutions to this problem is to utilize different illumination invariant features such as different color spaces or texture features [2] in the GMM model. Use of different features, allow for a more robust motion detection algorithm while maintaining a low computational complexity. For example, the correlation between two blocks in the image, based on a normalized vector distance function is obtained for background modeling by Matsuyama et al. [20]. Mason and Duric [19] utilized edge and color histograms in a block to model the background. Local binary pattern (LBP) [23] is another well-known texture feature to capture the background statistics. The use of these different color or texture features have shown improvements in the motion detection accuracy. However, these features are all hand-crafted based on our understanding of the visual cortex. As a result, these types of features are still somewhat illumination variant, and their generic nature limits their ability to comprehensively capture the unique traits of objects (e.g., people, vehicles, animals, etc.) in real-world surveillance environments. Recent developments in deep learning and deep neural networks (DNNs) [38] have demonstrated the potential of this machine learning tool in different applications. DNNs are computational models where the combination of several processing layers with different levels of abstractions are incorporated to represent the input data. Convolutional neural networks (CNNs) [16–18] are the most well-known framework of deep learning applied on structural learning applications such as computer vision, image processing and speech recognition. CNNs have been applied to image classification [7, 8, 33], pose estimation [34], speech recognition [13, 21, 26] and outperformed the conventional machine learning approaches in all applications. The main factor in the success of deep learning frameworks is the computational model incorporated to abstract the
input data. This abstraction procedure learns automatically to represent the input data without needing hand-crafted features. Recent work demonstrated that deep features obtained from convolutional neural networks (CNNs) [17] learned via large natural image datasets can be used to obtain significant improvements over hand-crafted texture features such as histogram of gradients [9]. Girshick et al. [9] applied pre-trained CNNs in a hierarchical framework to compute region proposals based on deep features. They reported a 30% improvement on the PASCAL VOC object detection problem. Gupta et al. [11] extracted deep features via a CNN learned from RGB-D images for object detection and image segmentation. Razavian et al. [25] examined deep features extracted from CNNs as generic descriptors for different recognition tasks and reported consistent superior results compared to the highly-tuned, state-of-the-art systems in all visual classification tasks over various datasets. The deep features have demonstrated great applicability and achieved significant performance improvements over state-of-the-art methods in several computer vision tasks and hold great potential for achieving improved motion detection performance. Nevertheless, the ability to leverage them for real-time performance on embedded systems have been all but impossible so far without requiring the integration of custom GPUs or specialized processors designed for accelerating DNNs. Not only are the vast majority of surveillance cameras not equipped with GPUs or specialized deep processors, their embedded CPU capabilities are also far inferior to most modern computers and thus further prohibiting the use of existing DNN architectures for real-time embedded motion detection. As such, alternative approaches to leverage DNNs for improved real-time, embedded motion detection is highly desired. Efficient deep neural networks is a fairly new field of research in deep learning community. However, several research have been conducted in this area to produce efficient deep neural networks within both memory and running-time computational complexity. Low rank matrix approximation [14] is a technique utilized to exploit crosschannel or filter redundancy to construct a low rank basis and to decrease the computational complexity of the convolutional neural networks. Network pruning, hashing and compression [3, 12] are the other approaches proposed to address the huge memory requirement of deep neural networks. Structural sparsity [36] and statistical and evolutionary based approaches [27, 29, 30] to generate efficient deep neural networks are some other proposed algorithms in this field. The main contribution of this paper is a novel approach to motion detection that leverages the power of DNNs while maintaining low computational complexity necessary for real-time embedded performance without the need for
J Sign Process Syst
specialized hardware. This paper is an extension of [28] with a more comprehensive explanation and experimental results. The proposed method is compared with state-of-theart algorithms utilized on embedded systems for the motion detection application. In the proposed Neural Response Mixture (NeRM) model, rich deep features are extracted from the neural responses of a highly efficient deep neural network called a StochasticNet [29], where the synaptic connectivity of such networks is sparsely formed in a stochastic manner. Such StochasticNets have been shown to achieve the same level of modelling accuracy as general DNNs while containing only a small fraction of the synaptic connectivities, thus greatly reducing computational complexity. These deep features, which are obtained from StochasticNets pre-trained on large natural image datasets, are then used to construct Gaussian mixture models in an unsupervised manner to model the background based on past frames sequences. Given its low computational complexity compared to existing DNN approaches, NeRM was embedded onto Axis surveillance cameras to demonstrate that strong motion detection (i.e., less false alarm and noise) accuracy can be achieved while operating at real-time performance. It is worth noting that the primarily results of the proposed framework is reported in [28] and this paper brings more through experimental evaluation and discussion to this approach. The paper is organized as follows. The theory and design considerations behind NeRM, along with implementation details on the embedded system are discussed in Section 2. Experimental results where we examine the proposed NeRM framework on very difficult video datasets for motion detection are reported and discussed in Section 3. Finally, conclusions are drawn in Section 4.
2 Methodology The proposed NeRM framework for motion detection is described in detail as follows. First, the motivation behind Neural Response Mixture (NeRM) and the model itself is explained in detail. Next, the problem of motion detection via background modelling is described where the NeRM outputs can be utilized to tackle the problem. Third, the probabilistic framework to form the StochasticNets used in NeRM is explained and finally the implementation details of NeRM on an embedded system is discussed in detail. 2.1 Neural Response Model It has been shown that deep convolution kernels can extract fruitful features from the input image for different applications. Although extracted features with this approach are powerful, the main issue is the computational complexity
of the extraction procedure. The deep networks are reasonably fast and real-time with parallel implementations using GPUs. However, for industrial applications such as embedded video analytics, it is not possible to take advantage of GPUs and only a low computational processing unit are available on surveillance cameras. The experiments in [29] showed that a sparse receptive field can produce the same performance as a conventional convolutional neural network, where the receptive fields of a deep neural network are formed sparsely to make the computational complexity of the framework as cheap as possible. Motivated by those results, we propose a new approach to obtain the neural responses and utilize them as rich features for embedded motion detection on surveillance cameras. The neural responses (NeRs) are extracted via a sparse deep neural network where the sparse receptive fields of the network are formed by a stochastic procedure in a one-time, off-line training procedure. The obtained NeRs is used in an adaptive learning framework to detect moving objects in the scene. The main focus here is to tackle the motion detection problem on embedded devices where they mostly use Gaussian mixture model (GMM) to learn the background model and detect moving objects; to this end, the framework is explained in the context of mixture models and the proposed algorithm is named as Neural Response Mixture (NeRM). However, it is worth noting that the proposed framework can be applied for various other learning tasks. The first step to the NeRM framework is to extract rich deep features with which to build a reliable GMM model of the background that can capture the unique traits of objects (e.g., people, vehicles, animals, trees, etc.) in realworld surveillance environments. Training a DNN for the task of video motion detection requires a large training set under different scenarios such as lighting changes, weather conditions, camera jitter, etc. with full manual annotation indicating the pixels in motion (corresponding to objects like people, vehicles, etc.). Obtaining such a large manually annotated dataset is highly difficult. However, DNNs can be trained on an extensive image dataset, such as ImageNet, with millions of images for object classification, with hundreds of object categories. It is worth noting that this dataset (i.e., ImageNet) is not related to motion detection problem. ImageNet [5] is a large scale ontology of images built upon the backbone of the WordNet structure which is mostly utilized for image classification tasks. This can allow the neural responses of the DNN to provide powerful deep features that can better characterize the unique traits of objects, many of which are present in real-world surveillance environments. Considering that true motion in videos is caused by moving objects of interest such as those found in the aforementioned image datasets, we are motivated to leverage the neural responses of DNNs trained in
J Sign Process Syst
this manner to provide rich features for GMM modelling of the background. Specifically, we take the first synaptic layer of a highly efficient StochasticNet trained on the ImageNet dataset as a primitive, low-level, feature representation that can isolate important features required for object classification. Therefore, the neural responses of the first synaptic layer at all pixels in the frame can be used as a feature to distinguish motion caused by objects moving in the scene. It is worth noting that the formation of StochasticNets used in the NeRM framework is a one-time and off-line procedure which is not implemented on an embedded system. The final formed StochasticNet is transferred to the embedded system after as described in Section 2.2.2. 2.2 Motion Detection via Background Modelling Motion is detected in the scene by evaluating their likelihood of belonging to the background based on a background model. A common approach to background modelling, for motion detection on an embedded system, is the use of a Gaussian mixture model (GMM). At time t, pixel x¯it ∈ Xt (where Xt = {x¯1t , · · · , x¯nt }) of frame t, is classified as background if the probability of being background is larger than 0.5. The Gaussian mixture model is formulated as P (x¯it = bg) =
M
t t2 ωi,m · Nm x¯it ; μ¯ ti,m , σ¯ i,m
(1)
m=1
where the GMM model contains M normal distributions with mean μ¯ tm and standard deviation σ¯ mt at time t, and t encodes the weight of the normal distribution m in ωi,m GMM model at time t of pixel i. The probability of being background (“bg”) is evaluated via (1). At each frame, the 2 normal distributions Nm (x¯it ; μ¯ tm , σ¯ mt ) are updated based on which mixture the pixels was assigned to: t+1 t t = ωi,m + α · 1 − ωi,m (2) ωi,m α ¯ ti,m + (3) μ¯ t+1 · dit i,m = μ t ωi,m α t+12 t2 t t2 σ¯ i,m = σ¯ i,m + (4) − σ ¯ · d i,m i,m t ωi,m 2 x¯it − μ¯ ti,m t di,m = (5) t2 σ¯ i,m t represents the distance between the new sample where di,m (pixel) x¯it and the mth normal distribution and α encodes the learning rate. It is worth noting that x¯it can be pixel intensity or a set of features extracted from pixel i at frame t. x¯it is commonly modeled with the RGB (Red, Green and Blue) pixel intensities in embedded systems because
there are no additional feature computation costs. However, while computationally efficient, pixel intensity is highly sensitive to illumination changes, subtle motion in the background, and camera sensor noises. Texture features have been explored to mitigate some of the issues when dealing with the aforementioned factors, but their generic, handcrafted nature limits their ability to comprehensively capture the unique traits of objects (e.g., people, vehicles, animals, etc.) in real-world surveillance environments, thus limiting their robustness. Motivated to leverage the recent advances in DNNs while maintaining low computational complexity needed for realtime embedded performance without specialized hardware, we propose a novel method for motion detection via background modeling based on rich deep features obtained from the neural responses of a highly efficient, stochasticallyformed deep neural network kno-wn as StochasticNets [29]. Such deep features are highly discriminative and powerful mechanism to model the background while maintaining low computational complexity, which is the key for realtime embedded performance. An overview of the proposed Neural Response Mixture (NeRM) framework for motion detection is shown in Fig. 1. First, the neural responses of a StochasticNet pre-trained on a large natural image dataset (e.g., ImageNet [5]) are extracted from the input video frame. A Gaussian mixture model (GMM) is then constructed based on the extracted neural responses to act as the background model of the scene. Finally, motion is detected in the scene by evaluating their likelihood of belonging to the background based on this constructed neural response GMM model. One of the main contribution of the proposed approach is the network formation procedure which is done based on a probabilistic framework to find the optimal neuron connectivities. Following, the network formation is explained in the context of StochasticNet formation [29]. 2.2.1 StochasticNet Formation The efficient and rich deep features in the NeRM framework are obtained from a sparsely formed deep network. A StochasticNet formation procedure is applied based on a pre-trained network on ImageNet. In this work, we leverage knowledge gained from traditional deep convolutional neural networks designed for image classification to form highly-efficient StochasticNets that are adapted for background modelling and optimized for minimal synaptic connectivity for real-time embedded performance. The stochastic formation of sparse network is obtained by utilizing the prior information provided by a pre-trained deep convolutional neural network on the Imagenet dataset for the task of image classification. A StochasticNet is then formed by stochastically selecting a very small set of
J Sign Process Syst
Figure 1 Motion detection based on the NeRM framework. The neural responses from a highly efficient StochasticNet are used as rich deep features. These deep features are then used to construct the Gaussian mixture model to model the background. Finally, motion is
detected in the scene by evaluating their likelihood of belonging to the background based on this constructed neural response GMM model. The Flow-digram is extracted from [28].
synaptic connections from this pre-trained deep convolutional neural network. Selection is based on an energy function guided by a smaller annotated dataset with ground truth motion detection labels while considering their weights within the pre-trained network, and adapting them as synaptic connections in the StochasticNet geared for the task of background modeling. The goal here is to have the best representative deep features to model the background in a video scene based on a Gaussian mixture model, while limiting the number of synaptic connections in the formed StochasticNet to enable real-time embedded performance. Here, we define an energy function to minimize false detections while minimizing the number of synaptic connections in the formed StochasticNet:
network, are included to the network using a stochastic acceptance-rejection criteria based on the energy function gradient (E) between consecutive iterations (see Algorithm 1). The proposed approach provides more chance to form a synaptic connection if the neuron connectivity does not decrease the energy function E(·), where T is the controlling parameter to adjust the acceptance probability. To have a better sampling approach, a new criteria is added to the sampling step where the thicker (more powerful) neuron connectives have more chances to form the sparse network. The justification behind this new quantitative measure in sampling is that the neurons with higher weights usually have more impact on the output results and, therefore, they should have more chance to be selected in the sparse formed network. Algorithm 1 demonstrates the stochastic formation procedure to form StochasticNets for NeRM. As seen each neuron has specific probability to be selected in the sparsely formed network which is determined by the its weight in the pre-trained network and neurons with lower weights have lower probability to form StochasticNet as shown in line 8 of Algorithm 1. To enforce the sampling process, the weights in the pre-trained network are normalized based on the maximum weight value. The σ is a controling factor to determine how the weight values should affect the sampling procedure. The StochasticNet is formed based on a pre-trained network as mentioned in the paper. However this network was not trained for motion detection but for image classification purposes. In other words, the utilized network was off-theshelf network and no training was applied. The formation of the StochasticNet approach is offline procedure and as explained in the paper it has been done only once and the formed StochasticNet model has been utilized for all types of input data reported in the paper; therefore, it is possible to consider the proposed approach as an unsupervised approach.
El =
T N 1 ˆl δ bij = bij S
(6)
i=1 j =1
where S encodes the number of synaptic connections in the StochasticNet, bij is the ground truth label for pixel j at l encodes the estimated label (i.e., in motion frame i and bˆij or not in motion classification from the GMM) for pixel j at frame i via the extracted neural response features at iteration l. T represents the total number of frames in the training video which the number of pixels in each frame is represented by N and δ(·) is Dirac function. The utilized energy function takes two objectives into account to form the StochasticNet and generating the NeRM features. The first objective is to form a StochasticNet with the most possible sparsity while the second objective is to select those synaptic connectivities which bring the best representation for the NeRM features in model. The offline, stochastic formation of the StochasticNet used for neural response extraction is an iterative procedure such that in each iteration new synaptic connectivities, stochastically selected from the deep convolutional neural
J Sign Process Syst
For the implementation of NeRM used in our experiments, a deep convolutional neural network based on the AlexNet [17] network architecture, trained on ImageNet, is utilized to form a StochasticNet with just small portion of synaptic connectivities compared to the AlexNet network
architecture. A set of 200 frames from the “Baseline” category of the Change Detection dataset [10] is used as the small annotated dataset for the formation procedure. The StochasticNet formation in this study is implemented via MatConvNet framework [35].
2.2.2 Embedded System Implementation Details
servers using floating points computations and port the networks to the camera using a 32 bit fixed point representation with a dedicated 16 bits for the decimal components. To reduce the computational complexity of the GMM modeling we employ only two modes and the same fixed point representation as for the convolutional layer. It is worth to note that increasing the number of modalities in the GMM increases the computational complexity exponentially and it is not possible to run the model in a real-time manner on embedded systems.
There are many camera manufacturers with open platforms for developing embedded applications; however, Axis cameras and their 3rd party development platform is among the most popular and mature platforms currently available in the surveillance industry. Their latest cameras are available with Axis’ ARTPEC-5 chip-set (MIPS 1004Kc V2.12 CPU model) running a striped down version of Linux. While our implementation will work on any Axis camera that supports embedded development and has the ARTPEC-5 chip-set, we test our algorithms on the Axis Q7436 Encoder.1 Our approach is implemented using C++ and compiled with the Axis development SDK. The algorithm has two main parts: StochasticNet (3 layers: convolutional, ReLU and pooling) and Gaussian mixture model (GMM) modelling using the deep features. Most deep neural networks employ floating point computations for the convolutional layer. While most desktop computers have a floating point unit (FPU) to handle floating point operations, a majority of surveillance cameras do not have a dedicated FPU. As a result, floating point computations are significantly slower. To overcome this issue, we form StochasticNets on 1 http://www.axis.com/ca/en/products/axis-q7436.
3 Result & Discussion The proposed NeRM framework was evaluated via Change Detection.Net [10] (CD.Net) dataset and compared with three different hand-crafted features. The proposed framework was compared with RGB, contrast histogram (CHist) [4] and LBP [22] features, all utilized in the same Gaussian mixture model (GMM) framework. It is worth noting that all methods except LBP were evaluated using the same codebase that is used on the embedded system but were run on a personal computer when processing the change detection dataset. Since the implementation is same as the embedded system, the results can be considered as the results of
J Sign Process Syst
the embedded system with the same accuracy. The LBP framework was implemented in Matlab.
both moving and the background pixels in the whole image,
3.1 Dataset
PWC =
The CD.Net [10] is one of the largest motion detection datasets with a variety of challenging scenarios such as bad weather conditions, night vision, dynamic backgrounds, shadows and thermal cameras. The dataset contains more than 90,000 manually labeled ground truth frames. Several diverse realistic set of videos are provided in this dataset which cover wide range of detection challenges including indoor and outdoor scenarios. The following categories were selected to evaluate methods: dynamic background, camera jitter, intermittent object motion, shadows, thermal, challenging weather, low frame-rate, acquisition at night and air turbulence. We have excluded the PTZ categories since it is not usually considered as an objective for general motion detection problems and the baseline category was excluded since we use it for our training procedure. 3.2 Evaluation Measures Competing algorithms are evaluated via several quantitative measures which are considered as the standard evaluation metrics for motion detection problem, including: –
Specificity (Sp), the number of pixels which the algorithm selected as background correctly over the whole number of true background pixels, TN T N + FP
Sp = –
Recall (Re), the number of pixels selected as moving pixel correctly over the whole number of true moving pixels, Re =
–
FN . (13) T P + FN where T P represents the number of true classified pixels as moving pixels, T N is the number of true classified background pixels, F P encodes the number of pixels wrongly classified as moving pixels and F N is the number of pixels classified as background pixels incorrectly. The combination of all aforementioned evaluation measures can illustrate the behavior of the evaluated methods comprehensively in different situations and when facing complex scenarios. F NR =
3.3 Competing Methods The main objective here is to take advantage of neural responses as deep features which are not hand-crafted and are not biased by human beliefs to address the motion detection problem via a simple framework such as GMM. To evaluate the performance of the proposed NeRM framework, it is compared with three different hand-crafted features considered as common approaches utilized in motion detection and modeling. A same GMM approach was utilized for all competing methods to show the superiority of the neural response against hand-crafted features: –
Precision (Pr), the number of pixels selected as moving pixels correctly over the whole number of pixels selected as moving pixels,
–
TP T P + FP
(9)
F-Measure, the conventional region F-measure is evaluated based on the region-of-interest specified by the ground truth images, F =
–
TP T P + FN
–
FP (12) T P + FN False Negative Rate (FNR), the ratio of pixels which were selected as background incorrectly over the whole number of true background pixels, FPR =
(8)
Pr = –
(7)
–
100 × (F N + F P ) (11) T P + FN + FP + T N False Positive Rate (FPR), the ratio of pixels which were selected as moving pixels wrongly over the whole number of true moving pixels,
2 × P r × Re P r + Re
–
RGB, three channels of red, green and blue are selected as feature set utilizing in the GMM framework. Contrast Histogram (CHist-RGB), the contrast histogram [4] features are extracted from the image combined with RGB features via equal weighting are fed into the GMM framework to model the moving targets. Local Binary Pattern (LBP-RGB), the conventional LBP [22] features combined with RGB color features (equal weighting) are obtained to model the moving targets.
3.4 Quantitative Analysis (10)
Percentage of Wrong Classifications (PWC), the percentage of the pixels which were classified wrongly,
Tables 1 and 2 shows the quantitative comparison of the proposed NeRM framework against other competing approaches. As the first experiment analysis the proposed
RGB RGB-Morph NeRM NeRM-RGB RGB RGB-Morph NeRM NeRM-RGB RGB RGB-Morph NeRM NeRM-RGB RGB RGB-Morph NeRM NeRM-RGB RGB RGB-Morph NeRM NeRM-RGB RGB RGB-Morph NeRM NeRM-RGB RGB RGB-Morph NeRM NeRM-RGB
0.99 0.99 0.99 0.99 0.73 0.76 0.58 0.74 0.77 0.79 0.65 0.84 0.77 0.79 0.65 0.78 0.002 0.002 0.005 0.002 0.675 0.615 1.082 0.685 0.265 0.232 0.410 0.256
badWeather 0.91 0.87 0.97 0.97 0.70 0.81 0.45 0.53 0.41 0.38 0.46 0.51 0.41 0.38 0.46 0.51 0.082 0.124 0.020 0.023 9.038 12.735 4.124 4.103 0.298 0.188 0.547 0.465
cameraJitter 0.96 0.92 0.99 0.98 0.73 0.89 0.40 0.69 0.32 0.25 0.40 0.52 0.32 0.25 0.40 0.52 0.039 0.078 0.003 0.015 4.144 7.840 1.195 1.887 0.261 0.102 0.593 0.302
dynamicBackground 0.98 0.98 0.98 0.98 0.45 0.52 0.39 0.42 0.47 0.54 0.42 0.54 0.47 0.54 0.42 0.43 0.014 0.013 0.015 0.016 5.321 4.784 5.795 5.695 0.546 0.476 0.608 0.575
intermittentObjectMotion 0.95 0.95 0.99 0.99 0.67 0.75 0.62 0.61 0.41 0.42 0.53 0.56 0.41 0.42 0.53 0.50 0.040 0.048 0.003 0.008 4.787 5.254 1.447 2.022 0.324 0.245 0.377 0.388
lowFramerate 0.97 0.97 0.98 0.97 0.53 0.56 0.54 0.57 0.38 0.39 0.43 0.33 0.38 0.39 0.43 0.39 0.025 0.025 0.017 0.027 3.473 3.454 2.711 3.601 0.465 0.434 0.453 0.425
nightVision 0.98 0.98 0.96 0.98 0.74 0.82 0.72 0.75 0.74 0.74 0.65 0.72 0.72 0.74 0.65 0.73 0.013 0.015 0.030 0.013 2.460 2.386 4.149 2.508 0.256 0.172 0.274 0.249
shadow 0.99 0.98 0.97 0.95 0.56 0.66 0.58 0.69 0.67 0.67 0.55 0.59 0.62 0.67 0.55 0.61 0.007 0.010 0.022 0.041 2.452 4.192 5.964 6.044 0.435 0.338 0.419 0.307
thermal 0.98 0.98 0.99 0.99 0.80 0.85 0.61 0.71 0.40 0.40 0.70 0.67 0.42 0.40 0.70 0.63 0.011 0.016 0.000 0.001 1.240 1.708 0.234 0.338 0.197 0.143 0.380 0.284
turbulence
0.97 0.96 0.98 0.98 0.66 0.74 0.54 0.63 0.50 0.49 0.61 0.59 0.50 0.51 0.53 0.57 0.026 0.037 0.013 0.016 3.954 4.774 2.967 2.987 0.339 0.259 0.451 0.361
Overall
The proposed NeRM framework is tested with two different scenarios, I) neural responses as feature fed into GMM, II) neural responses combined with RGB channels as features fed into GMM. The proposed method is compared with RGB features and RGB features with a morphological post-processing (RGB-Morph).
FNR
PWC
FPR
F-Meaure
Precision (Pr)
Recall (Re)
Specificity (Sp)
Method
Table 1 Quantitative comparison on CD.Net dataset.
J Sign Process Syst
RGB CHist-RGB LBP-RGB NeRM-RGB RGB CHist-RGB LBP-RGB NeRM-RGB RGB CHist-RGB LBP-RGB NeRM-RGB RGB CHist-RGB LBP-RGB NeRM-RGB RGB CHist-RGB LBP-RGB NeRM-RGB RGB CHist-RGB LBP-RGB NeRM-RGB RGB CHist-RGB LBP-RGB NeRM-RGB
0.99 0.95 0.99 0.99 0.73 0.86 0.24 0.74 0.77 0.47 0.35 0.84 0.77 0.59 0.35 0.78 0.0025 0.044 0.009 0.002 0.675 4.473 2.185 0.685 0.265 0.130 0.751 0.256
badWeather 0.91 0.55 0.95 0.97 0.70 0.97 0.41 0.53 0.41 0.09 0.38 0.51 0.41 0.16 0.38 0.51 0.082 0.447 0.040 0.023 9.038 43.08 6.213 4.103 0.298 0.024 0.581 0.465
cameraJitter 0.96 0.56 0.99 0.98 0.73 0.98 0.63 0.69 0.32 0.02 0.59 0.52 0.32 0.05 0.59 0.52 0.039 0.434 0.007 0.015 4.144 42.848 1.016 1.887 0.261 0.014 0.361 0.302
dynamicBackground
RGB channels are utilized as extra features fed into the GMM in all competing methods.
FNR
PWC
FPR
F-Meaure
Precision (Pr)
Recall (Re)
Specificity (Sp)
Method 0.98 0.89 0.85 0.98 0.45 0.72 0.64 0.42 0.47 0.31 0.39 0.54 0.47 0.40 0.39 0.43 0.014 0.107 0.147 0.016 5.321 12.217 15.621 5.695 0.546 0.273 0.354 0.575
intermittentObjectMotion
Table 2 Comparison of hand-crafted features with the proposed framework on the CD.Net dataset.
0.95 0.72 0.91 0.99 0.67 0.94 0.25 0.61 0.41 0.15 0.13 0.56 0.41 0.24 0.13 0.50 0.040 0.278 0.082 0.008 4.787 27.331 1.072 2.022 0.324 0.056 0.746 0.388
lowFramerate 0.97 0.84 0.93 0.97 0.53 0.88 0.39 0.57 0.38 0.18 0.13 0.33 0.38 0.27 0.13 0.39 0.025 0.158 0.069 0.027 3.473 15.852 7.981 3.601 0.465 0.112 0.604 0.425
nightVision 0.98 0.92 0.96 0.98 0.74 0.92 0.77 0.75 0.74 0.39 0.61 0.72 0.72 0.51 0.61 0.73 0.013 0.071 0.032 0.013 2.460 7.347 4.047 2.508 0.256 0.078 0.229 0.249
shadow 0.99 0.98 0.95 0.95 0.56 0.72 0.62 0.69 0.67 0.67 0.61 0.59 0.62 0.67 0.61 0.61 0.007 0.016 0.042 0.041 2.452 4.145 5.573 6.044 0.435 0.272 0.374 0.307
thermal 0.98 0.83 0.98 0.99 0.80 0.97 0.48 0.71 0.40 0.03 0.57 0.67 0.42 0.06 0.57 0.63 0.011 0.164 0.010 0.001 1.240 16.355 1.254 0.338 0.197 0.020 0.516 0.284
turbulence
0.97 0.80 0.95 0.98 0.66 0.89 0.49 0.63 0.50 0.26 0.42 0.59 0.50 0.33 0.39 0.57 0.026 0.191 0.049 0.016 3.954 19.295 5.996 2.987 0.339 0.109 0.502 0.361
Overall
J Sign Process Syst
J Sign Process Syst
method is compared with RGB channels as features fed into the GMM with two different scenarios: I) RGB, utilizing the RGB features and reporting the motion detection results without any post-processing II) RGB-Morph, applying post-processing morphological operations over the motion detection results. Two erosions and dilations operations with the windowing size of 3 × 3 are applied on the output of GMM before evaluating the results. The NeRM framework is also evaluated with two different scenarios: I) NeRM, using the neural response features and feeding them into the GMM and evaluate the motion detection results. II) NeRM-RGB, combining neural response features with the RGB channels as features and utilize them in the GMM to detect moving targets. In both scenario the NeRM framkework utilizes a sparse deep network with 500 neural connectivities. As evident in Table 1 applying morphological operations on the results of motion detection (i.e. RGB-Morph) improves the F-measure by 1% demonstrating the effectiveness of morphological operations. The morphological operations help to remove the false negative pixels by filling the holes in moving objects as observed by the increase of the Recall value. It can be observed that the NeRM approach (third row) leads RGB-Morph by 2% compared via the F-measure demonstrating that utilizing the neural response features results to a more robust outputs such that it detects the moving targets with less false positives as observed by the increase in the precision and the decrease in FPR. Combining the neural response features with RGB channels as an extra features improves the performance of the motion detection framework by 7% comparing the Fmeasure values. The precision increases by 9% compared to RGB, which shows the proposed framework has far less false positives. Table 1 demonstrates that combining the neural response features with RGB channels information leads to a better performance compared to RGB. Therefore, further experiments are conducted where the RGB channels are used as extra features in all methods (i.e., NeRM-RGB, CHistRGB and LBP-RGB). It is worth noting that contrast histogram (CHist) and LBP methods were applied in a non-overlapping procedure because of their computational complexity. Table 2 compares the performance of the proposed NeRM framework with other hand-crafted feature approaches when using RGB channels as extra features. It can be observed that the proposed framework outperforms other methods by at least 7% in terms of F-measure. NeRM also produced results with less false positive error as evident by FPR which decreases by at least 1%. The proposed approach leads other competing method when evaluated by wrong classification percentage (PWC) showing the overall classification accuracy of the methods.
3.5 Qualitative Analysis Figures 2 and 3 demonstrate the qualitative comparison of the competing methods for different categories. As seen in Fig. 2, NeRM detects the moving targets with less number of noise in complex situations like the “CameraJitter” example. In this example camera has a small shuddering motion while capturing which fools the motion detection methods and results to assign some background regions as moving targets. It can be observed that the proposed NeRM algorithm is more robust than other competing methods in detecting those regions as background (non-motion) regions correctly. “Bad Weather” (fourth row) is a good example of noisy situations, where very tiny objects are floating in the scene (snow) that must be considered as background but usually are detected as moving target wrongly. This example demonstrates the effectiveness of NeRM compared to other methods facing these situations. It can be justified that since NeRM framework extracts fruitful features from a small regions instead of only one pixel compared to RGB and has optimized trained feature extractor compared to CHist and LBP, is more robust in noisy environments to classify pixels correctly. Figure 3 shows more visual evaluations of the proposed method and the competing algorithms. As seen the proposed frameworks captures moving targets more accurately when the background scene is also moving, evidence by “Dynamic Background” (second row). The “Low Frame Rate” demonstrates a situation which the moving targets has high speed since the video captured by 1 FPS. As seen, NeRM detects moving targets with less number of false alarms compared to other methods. 3.6 Discussion The proposed method were evaluated quantitatively and qualitatively as analyzed in two previous sections. The proposed method outperforms other competing approaches (i.e, hand-crafted features) since deep neural networks can abstract input images more effectively than hand-crafted features due to their comprehensive training procedure. We take advantage of these type of features and address the motion detection problem via proposing a real-time framework. The proposed framework forms a sparse neural network via StochaticNet such that a limited set of synaptic connectivities are incorporated to result the neural responses. By use of this approach, the computational complexity of deep neural networks is addressed such that they can be deployed on an embedded system with realtime efficiency. A StochasticNet comprises of synaptic connectivities that are significantly less than full synaptic connectivity. To examine the impact of the number of synaptic connectivities on the running time, two experiments were
BadWeather
CameraJitter
Thermal
BadWeather
J Sign Process Syst
Image
Ground Truth
RGB
LBP-RGB
CHist-RGB
NeRM-RGB
conditions. The blockiness artifact in CHist and NeRM results is due to the downsampling inherent in their feature extraction step.
conducted on a regular CPU and an embedded system. Figure 4a demonstrates the running time versus the number of synaptic connectivities on a CPU and (b) shows running time versus the number of synaptic connectivities when the algorithm is performed on an embedded device. The tested CPU processor is Intel Core i7-2770QM, 2.20Mhz. It can
be observed that a network with 500 synaptic connectivities, can be used within our motion detection framework at ∼200 fps on the Intel CPU. However, the running time on the Intel CPU drops to 3 FPS when a full network (i.e., pre-trained AlexNet) with all synaptic connectivities (i.e. 34848 synaptic connectivities) is utilized. Figure 4b shows the running
NightVideos
LowFrameRate DynamicBackground
Shadow
Figure 2 The competing methods are compared with videos captured in bad weather conditions, camera jitter and thermal cameras. Results show that the proposed method outperforms other approaches in these
Image
Ground Truth
RGB
LBP-RGB
CHist-RGB
NeRM-RGB
Figure 3 Qualitative results for complex situations. In this Figure, the competing methods are compared with video categories which are considered as difficult conditions for motion detection. The comparison demonstrates that the proposed NeRM approach performs better than other algorithms.
J Sign Process Syst CPU
0.025
0.4
Running Time (s)
Running Time (s)
0.02
0.015
0.01
0.005
0
Embedded Device
0.5
0.3
0.2
0.1
0
500
1000
1500
2000
2500
3000
0
0
500
1000
1500
2000
Synaptic Connectivies
Synaptic Connectivities
a
b
2500
3000
Figure 4 Running time, for video resolution of 352 × 240, on a laptop CPU and an embedded device; (a) running time vs the number of synaptic connectivities on an Intel Core i7-2770QM, 2.20 Mhz CPU. (b) running time vs the number of synaptic connectivities on the Axis Q7436 Video Encorder.
time experiments of the NeRM framework on an embedded device. As mention before, an Axis Q7436 encoder with the ARTPEC-5 CPU was utilized to get the results. As seen, the proposed motion detection algorithm can process frames at 10 FPS with a sparse network of 500 synaptic connectivities via StochasticNet. All conducted experiments for NeRMRGB are based on a StochasticNet framework comprising of ∼500 synaptic connectivities to perform motion detection at ∼10FPS. The convolutional layer of the full network (pretrained AlexNet) is comprised of 34848 weights, therefore 1,214,383,104 multiplication and the same amount of addition operations are required to process one frames which both weight matrix and the input matrix are in the dense shape. However, the number of weights of the formed StochasticNet is 500 resulting to 250,000 multiplications and additions to process one frame. The computation is applied in the form of sparse-dense approach as the weight matrix is very sparse. It is also worth to mention that multiplication is the most important operation in this process since the embedded environments usually does not provide floating point unit (FPU) and as a result this process is very slow therefore, reducing the number of operations speed up the process. The proposed method is also compared with the pretrained AlexNet network to analyze the effect of the sparsification on accuracy. Both methods are performed within a same GMM framework with the same parameters settings. Table 3 shows the comparison results between the NeRM-RGB and the complete AlexNet network structure (AlexNet) which all synaptic connectivities are utilized without any sparsification (same network structure like as NeRM-RGB with full synaptic connectivity). Results illustrate that the pre-trained deep neural networks with full network structures are not always useful for all applications.
The impact of the StochasticNet formation can be observed by comparison of these two methods demonstrating the StochasticNet formation process significantly improves the modeling efficiency of the motion detection based on deep features by specializing the receptive field structures for the application of motion detection while maintaining the accuracy and running-time. It is worth to note that, GMM framework utilizes only two modes and a full network structure transforms the input frame to a more complex feature domain compared to NeRM framework. However the formed StochastcNet model is optimized (formed) for the purposes of motion detection application based on the defined energy function, therefore, the extracted features are more appropriate for the purposes of motion detection. The last experiment has been conducted to demonstrate the effect of the number of modalities on the performance of motion detection algorithm. The same extracted features from videos of “baseline” category (via pre-trained AlexNet network) were modeled via two and three modes GMM models to examine the effect of more modes in GMM performance. Results in Table 4 shows that the number of modes of the GMM models does not have any significant impact on the performance of a motion detection models performing on the features extracted via pre-trained AlexNet. The neural responses extracted via the pre-trained AlexNet contain the high frequency features which causes variance to be higher resulting in false motion detection but better to detect all true motion. The precision and recall of “dynamicBackground” or “intermittentObjectMotion” video categories in Table 3 can support this statement where the recall values are higher than precisions. However, the neural responses extracted from StochasticNet looses some of the high frequency features resulting in a more
NeRM-RGB AlexNet NeRM-RGB AlexNet NeRM-RGB AlexNet NeRM-RGB AlexNet NeRM-RGB AlexNet NeRM-RGB AlexNet NeRM-RGB AlexNet
0.99 0.95 0.74 0.59 0.84 0.48 0.78 0.41 0.002 0.048 0.685 5.174 0.256 0.409
badWeather 0.97 0.67 0.71 0.84 0.78 0.66 0.73 0.72 0.006 0.014 1.998 2.158 0.280 0.154
cameraJitter 0.98 0.64 0.53 0.87 0.51 0.11 0.51 0.20 0.023 0.323 4.103 31.597 0.465 0.122
dynamicBackground 0.98 0.96 0.69 0.94 0.52 0.03 0.52 0.06 0.015 0.359 1.887 35.430 0.302 0.054
intermittentObjectMotion 0.99 0.69 0.42 0.56 0.54 0.49 0.43 0.46 0.016 0.039 5.695 6.948 0.575 0.432
lowFramerate 0.97 0.83 0.61 0.84 0.56 0.14 0.50 0.22 0.008 0.308 2.022 30.886 0.388 0.154
nightVision
0.98 0.97 0.57 0.67 0.33 0.17 0.39 0.21 0.027 0.161 3.601 16.711 0.425 0.324
shadow
0.95 0.98 0.75 0.83 0.72 0.54 0.73 0.64 0.013 0.029 2.508 3.611 0.249 0.161
thermal
0.99 0.81 0.69 0.68 0.59 0.72 0.61 0.68 0.041 0.011 6.044 3.754 0.307 0.319
turbulence
0.98 0.83 0.71 0.94 0.67 0.03 0.63 0.06 0.001 0.187 0.338 18.715 0.284 0.054
Overall
RGB channels are utilized as extra features fed into the GMM in all competing methods. Both frameworks are examined with the same GMM approach and the same parameters settings. NeRM-RGB is sparsely connected network formed with 500 connectivities, and AlexNet is the fully connected pre-trained AlexNet without any sparsification.
FNR
PWC
FPR
F-Meaure
Precision (Pr)
Recall (Re)
Specificity (Sp)
Method
Table 3 Comparison of AlexNet deep nueral network and NeRM framework on the CD.Net dataset.
J Sign Process Syst
J Sign Process Syst Table 4 Comparison of AlexNet deep nueral network modeled via 2- and 3-mode GMM models for videos of “baseline” category of Change Detection dataset.
AlexNet (2-Mode) AlexNet (3-Mode)
Recall
Specificity
FPR
FNR
PBC
Precision
F-Measure
0.6811 0.6724
0.9655 0.9664
0.0344 0.0335
0.3188 0.3275
5.0421 5.0115
0.3498 0.3519
0.4108 0.4092
As seen, increasing the number of modes of GMM when the neural responses extracted by a pre-trained AlexNet does not have significant impact on the modeling accuracy. It is also worth to note that increasing the number of modes enforces an extra computational complexity which makes it impossible to deploy on embedded systems.
stable performance without false motion detection but has the potential to loose some true motion pixels which is evident by Table 3. As seen, AlexNet performed better in overall Recall value while StochasticNet approach resulted better precision. Since the dataset contains more nonmotion non-motion pixels/frame in comparison to motion pixel/frame, reducing the false positive even at some misses of true motion pixels can result in a better F-Measure. On the other hand, utilizing more than 2 modes in the GMM model enforces an extra computational complexity which increases the running time significantly. As a results, obtaining neural responses via a sparse network is considered best option which preserves model accuracy and real-time processing simultaneously.
4 Conclusion A motion detection algorithm was proposed by taking advantage of neural responses from deep neural networks. The proposed framework utilizes the neural responses in a Gaussian mixture model (NeRM) framework to model the background in scenes and detect moving objects, feasible on embedded systems. The proposed NeRM method takes advantage of sparse synaptic connectivities and resolves the computational complexity of running a deep neural network on embedded systems while maintaining its performance and accuracy. Experimental results showed the proposed framework can perform motion detection on an embedded system with ∼10 FPS while outperforming competing methods which use RGB pixel intensity or even hand-crafted texture features as input features of the GMM modeling. This new approach can open a new avenue to facilitate the use of deep neural networks on embedded systems which has huge applicability in different industrial problems. The proposed method is examined in RGB color space since the pre-trained network was trained in RGB color space. However there exist more appropriate color spaces for motion detection which can improve the performance of the model which is suggested as a future direction to extend the proposed method.
References 1. Barnich, O., & Van Droogenbroeck, M. (2011). Vibe: a universal background subtraction algorithm for video sequences. Transactions on Image Processing, 20(6). 2. Bouwmans, T., El Baf, F., & Vachon, B. (2008). Background modeling using mixture of gaussians for foreground detection-a survey. Recent Patents on Computer Science, 1(3). 3. Chen, W., Wilson, J.T., Tyree, S., Weinberger, K.Q., & Chen, Y. (2015). Compressing neural networks with the hashing trick. In ICML (pp. 2285–2294). 4. Chen, Y., Chen, C., Huang, C., & Hung, Y. (2007). Efficient hierarchical method for background subtraction. Pattern Recognition, 40(10). 5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. In IEEE computer vision and pattern recognition (CVPR). 6. Elgammal, A., Harwood, D., & Davis, L. (2000). Non-parametric model for background subtraction. In European conference on computer vision (ECCV). Springer. 7. Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D. (2014). Scalable object detection using deep neural networks. In Conference on computer vision and pattern recognition (CVPR). 8. Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1915– 1929. 9. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Conference on computer vision and pattern recognition (CVPR). IEEE. 10. Goyette, N., Jodoin, P., Porikli, F., Konrad, J., & Ishwar, P. (2012). Changedetection. net: a new change detection benchmark dataset. In Computer vision and pattern recognition workshops (CVPRW). IEEE. 11. Gupta, S., Girshick, R., Arbel´aez, P., & Malik, J. (2014). Learning rich features from rgb-d images for object detection and segmentation. In European conference on computer vision (ECCV). Springer. 12. Han, S., Mao, H., & Dally, W. (2015). Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149. 13. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & et al. (2012). Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. Signal Processing Magazine, 29(6), 82–97. 14. Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Speeding up convolutional neural networks with low rank expansions. In British machine vision conference (BMVC).
J Sign Process Syst 15. Jenifa, R., Akila, C., & Kavitha, V. (2012). Rapid background subtraction from video sequences. In International conference on computing, electronics and electrical technologies (ICCEET). IEEE. 16. Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). Densecap: fully convolutional localization networks for dense captioning. In Conference on computer vision and pattern recognition (CVPR). 17. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS). 18. Mahendran, A., & Vedaldi, A. (2016). Visualizing deep convolutional neural networks using natural pre-images. In International journal of computer vision (IJCV). 19. Mason, M., & Duric, Z. (2001). Using histograms to detect and track objects in color video. In Applied imagery pattern recognition workshop. IEEE. 20. Matsuyama, T., Ohya, T., & Habe, H. (1999). Background subtraction for non-stationary scenes. Department of Electronics and Communications, Graduate School of Engineering, Kyoto University, Sakyo, Kyoto, Japan. ˙ 21. Mikolov, T., Deoras, A., Povey, D., Burget, L., & Cernock` y, J.. (2011). Strategies for training large scale neural network language models. In IEEE Workshop on automatic speech recognition and understanding (ASRU) (pp. 196–201). IEEE. 22. Ojala, T., Pietik¨ainen, M., & Harwood, D. (1996). A comparative study of texture measures with classification based on featured distributions. Pattern Recognition. 23. Ojala, T., Pietik¨ainen, M., & M¨aenp¨aa¨ , T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 24(7). 24. Oliver, N., Rosario, B., & Pentland, A. (2000). A bayesian computer vision system for modeling human interactions. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8). 25. Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: an astounding baseline for recognition. In Conference on computer vision and pattern recognition workshops (CVPR). IEEE. 26. Sainath, T., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for lvcsr. In International conference on acoustics, speech and signal processing (ICASSP) (pp. 8614–8618). IEEE. 27. Shafiee, M.J., Mishra, A., & Wong, A. (2016). Deep learning with Darwin: evolutionary synthesis of deep neural networks. arXiv:1606.04393. 28. Shafiee, M.J., Siva, P., Fieguth, P., & Wong, A. (2016). Embedded motion detection via neural response mixture background modeling. In Computer vision and pattern recognition workshop. IEEE. 29. Shafiee, M.J., Siva, P., & Wong, A. (2016). Stochasticnet: forming deep neural networks via stochastic connectivity. IEEE Access (99). 30. Shafiee, M.J., & Wong, A. (2016). Evolutionary synthesis of deep neural networks via synaptic cluster-driven genetic encoding. arXiv:1609.01360. 31. Siva, P., Shafiee, M.J., Li, F., & Wong, A. (2015). Pirm: fast background subtraction under sudden, local illumination changes via probabilistic illumination range modelling. In International conference on image processing (ICIP). IEEE. 32. Stauffer, C., & Grimson, W. (1999). Adaptive background mixture models for real-time tracking. In Computer society conference on computer vision and pattern recognition (CVPR) (Vol. 2). IEEE.
33. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Conference on computer vision and pattern recognition (CVPR) (pp. 1–9). 34. Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems (pp. 1799–1807). 35. Vedaldi, A., & Lenc, K. Matconvnet – convolutional neural networks for matlab. 36. Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (pp. 2074–2082). 37. Wren, C., Azarbayejani, A., Darrell, T., & Pentland, A. (1997). Pfinder: real-time tracking of the human body. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 19(7). 38. Yann, L., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. 39. Yu, G., Sapiro, G., & Mallat, S. (2012). Solving inverse problems with piecewise linear estimators: from gaussian mixture models to structured sparsity. IEEE Transactions on Image Processing.
Mohammad Javad Shafiee received the B.Sc. and M.Sc. degrees in computer science and artificial intelligence from Shiraz University, Shiraz, Iran, in 2008 and 2011 respectively; and the Ph.D. degree in systems design engineering from the University of Waterloo, Waterloo, ON, Canada in 2017. He is currently a postdoctoral fellow at University of Waterloo. His main focus is on statistical learning and graphical models, such as conditional random fields, Markov random fields, deep learning, and convolutional neural networks. His research interests include computer vision, machine learning, and biomedical image processing.
Parthipan Siva received the B.A.Sc. and M.A.Sc. degrees in systems design engineering from the University of Waterloo, ON, Canada, and the Ph.D. degree in computer science from the Queen Mary University of London. He is currently a Senior Computer Vision Scientist at Aimetis Corporation. He has over ten years of industrial experience in developing real-time video analytics software for surveillance applications. His research interests include image segmentation, video analytics, and pattern recognition, with a focus on activity detection and tracking for surveillance applications.
J Sign Process Syst Paul Fieguth received the B.A.Sc. degree from the University of Waterloo, Ontario, Canada, in 1991 and the Ph.D. degree from the Massachusetts Institute of Technology, Cambridge, in 1995, both degrees in electrical engineering. He joined the faculty at the University of Waterloo in 1996, where he is currently Professor and Department Chair in Systems Design Engineering and a Co-Director of the Vision and Image Processing Research Group. His research interests include statistical signal and image processing, hierarchical algorithms, data fusion, and the interdisciplinary applications of such methods. He is the author of a 2010 Springer textbook on Statistical Image Processing and Multidimensional Modeling, and a 2017 Springer textbook on Complex Systems.
Alexander Wong received the B.A.Sc. degree in computer engineering, the M.A.Sc. degree in electrical and computer engineering, and the Ph.D. degree in systems design engineering from the University of Waterloo, Waterloo, ON, Canada, in 2005, 2007, and 2010, respectively. He is currently the Canada Research Chair of Medical Imaging Systems, the Co-Director of the Vision and Image Processing Research Group, and an Associate Professor with the Department of Systems Design Engineering, University of Waterloo. He has authored over 400 refereed journal and conference papers, and patents, in various fields, such as computational imaging, artificial intelligence, computer vision, and multimedia systems. His current research interests revolve around computational imaging and artificial intelligence, with a focus on integrative computational imaging systems for biomedical imaging and operational artificial intelligence. He has received two Outstanding Performance Awards, a Distinguished Performance Award, an Engineering Research Excellence Award, a Sandford Fleming Teaching Excellence Award, an Early Researcher Award from the Ministry of Economic Development and Innovation, a Best Paper Award at the NIPS Workshop on Efficient Methods for Deep Neural Networks, two Best Paper Awards by the Canadian Image Processing and Pattern Recognition Society (CIPPRS), a Distinguished Paper Award by the Society of Information Display, a Best Paper Award for the Conference of Computer Vision and Imaging Systems (CVIS), two Magna Cum Laude Awards and one Cum Laude Award from the Annual Meeting of the Imaging Network of Ontario, and the Alumni Gold Medal.