SCIENCE CHINA Information Sciences
. RESEARCH PAPERS . Special Focus
December 2011 Vol. 54 No. 12: 2522–2529 doi: 10.1007/s11432-011-4493-3
Part-based on-road vehicle detection using hidden random field ZHANG XueTao1 , HE YongJian1,2 & WANG Fei1 ∗ 1Institute
of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China; 2Xi’an Communication Institute, Xi’an 710049, China Received June 15, 2011; accepted September 26, 2011
Abstract This paper addresses the problem of detecting on-road vehicles in still images captured by the onboard cameras. We model this as a labelling inference procedure and incorporate the part-based representation of the rear-ends of vehicle within a hidden random field based probabilistic model. Representing objects with parts inherently good for dealing with occlusions. In the proposed model, the part labels form a hidden layer in the graphical model. Our approaches can automatically find the latent parts without explicit indication during training. The experiment is performed on the database with real images with a promising result. Keywords
vehicle detection, hidden random field, part-based model
Citation Zhang X T, He Y J, Wang F. Part-based on-road vehicle detection using hidden random field. Sci China Inf Sci, 2011, 54: 2522–2529, doi: 10.1007/s11432-011-4493-3
1
Introduction
Vehicle detection based on vision sensor is a basic technology for autonomous vehicles. In order to drive fast and safely, ego-vehicle should monitor the behavior of surrounding vehicles. Among them, the vehicles in front are more important for safety. In addition, the environment perception system of autonomous vehicle relies heavily on the performance of vehicle detection. However, since the shape of vehicles, their appearance, lighting conditions vary a lot across scenes, it is crucial to develop computer vision algorithms that can precisely and robustly detect surrounding vehicles, providing effective information for autonomous vehicle system. This paper addresses the problem of detecting both clean and partially occluded front vehicles using an onboard camera. We model both local parts and structure for vehicle detection under a discriminative graphical model. The proposed model is powerful in coping with uncertainties due to occlusion and noise, which often occur in the urban traffic scene. Moreover, the model can discover parts automatically and this greatly relieves the effort to collect the training images. Compared with the sliding window based approaches, the representation of objects by a sparse set of parts can utilize the spatial interactions between parts, not only helping the object detection, but also allowing for detection under partial occlusion. For example, knowing the location of the left rear wheel can help find the right one. Furthermore, using the information from the left rear wheel and top-left corner of the car’s rear-end with their interaction, we can infer that there is a car confidently. ∗ Corresponding
author (email:
[email protected])
c Science China Press and Springer-Verlag Berlin Heidelberg 2011
info.scichina.com
www.springerlink.com
Zhang X T, et al.
Sci China Inf Sci
December 2011 Vol. 54 No. 12
2523
A key aspect of the proposed learning framework for the part-based detection task is finding parts automatically, and therefore, only coarse labels of training images are required. This can greatly reduce the human labeling effort, as only the instances in the images need to be manually labeled. Although the learned parts are not guaranteed to be meaningful, the algorithm can model parts automatically chosen for the task. Therefore, the detection can be more accurate. In general, most approaches to vehicle detection can be divided into two stages: the feature-based stage to generate vehicle candidates, and the appearance-based stage for validation. Usually, the latter uses pre-designed vehicle templates to capture holistic characters of the vehicle region. A thorough review on this topic can be found in [1]. Recently, Aytekin and Altug developed a vehicle detection and tracking system, in which the shadow of vehicles was used to generate the candidate regions, and they were verified by detecting vertical edges [2]. Song et al. extracted regions of interest by edge based texture analysis and classified these candidates using the AdaBoost algorithm. Sivaraman and Trevide trained an Adaboost classifier with Harr-like features in an active learning framework [3]. However, this kind of approaches is sensitive to some uncertainties caused by background noise and occlusion. In urban area, a large amount of edge information from buildings in the background will erode the candidate generation stage. Moreover, as the validation needs a whole description of the vehicle region, incomplete information will be harmful to the classifier. Another set of vehicle detection methods is known as local methods, which represent an object as a collection of parts. These approaches have been wildly studied recently in the Compute Vision and Machine Learning domains. Usually, considerable human efforts are needed to label parts used in classifier training [4, 5]. Other methods learn the parts by clustering similar image patches and from a “bags of words” representation ignore the geometric structure entirely [6, 7]. Additional structural constraints can also be added to the model which is helpful for discriminating the object [8, 9]. While generative approaches successfully model the distribution of spatially coherent parts [10], discriminative models can be more effective when the relations between observations and labels are complicated and hard to describe in a generative way [5, 11].
2
Probabilistic graphical model
We aim at locating all the instances of vehicles in a still image. As the image is captured by an incar forward camera, only the rear-end of leading vehicles can be seen. Traditional methods are based on sliding window classifiers, which use common classifiers, such as SVM and Adaboost, to determine whether a local window or patch contains an instance of a vehicle. However, it is often the case that one vehicle may be occluded by another one. The classifiers based on holistic appearance information cannot correctly distinguish these partial vehicle patches from the background. Therefore, we resort to part-based models, in which various parts of the image are classified into object parts or background. We adopt a probabilistic framework based on recently proposed hidden random field (HRF) [12], in which the object is modeled by a flexible collection of parts conditioned on image observations. In our approach, local observations are extracted around detected points in interest. Our task is to infer class label (i.e., object or background) of each observation. Let V denote the collection of all local observations. Each observation corresponds to a site in the random fields. The observed features extracted from these sites are represented by x = {xi }i∈V . For each site i ∈ V , there is a class label yi ∈ {0, 1}, where yi = 1 indicates that the observation associated with this site is from an instance of vehicle, and yi = 0 from the background. Note that, at this time, we do not indicate which instance the local observation belongs to. Some post-processing operations, which are used to separate instances, will be introduced in section 4. One possible way to model the interaction between neighboring sites is to smooth the labels; that is, the neighboring sites usually have same class labels. However, in this situation, the performance relies more on single site classifiers. Thus, some strong general local features are required. Also, it is hard to incorporate geometric structure of the object into the probabilistic framework. In the present work, we use HRF. The model augments the conditional random field with a hidden layer which is not observed
Zhang X T, et al.
2524
Figure 1
Sci China Inf Sci
December 2011 Vol. 54 No. 12
The proposed graphical model. All part variables h are conditioned on the image x. There are also links
between class labels y and h. Note that, in our situation, the graph is irregular.
during training. The variables in this layer are used to indicate which parts the local observations belong to. Then, the interactions between these part variables are constructed. This enhances the modeling power of the framework. Let h = {hi }i∈V denote the set of these variables. Each hi indicates the part the ith observation belongs to. In this paper, we extend the HRF by replacing the linear functions in the unary potentials with logistic regression functions [11, 13]. Although other classifiers can also be used, such as the random decision trees in [14] and the neural network in [15], a separate step is needed to learn the parameters in the unary potentials. However, by modeling with logistic regression classifiers, these parameters can be learned simultaneously in the HRF framework. Another difference is that we carefully choose edge features for the pairwise potentials, which incorporate both appearance and geometric information on edges. An example of the graphical model is depicted in Figure 1. The conditional distribution for the class labels y and part labels h given the image x is defined as 1 eφ(hi ,x) δ(y(hi ) = yi ) eψ(hi ,hj ,x) , (1) P (y, h|x; θ) = Z(θ, x) i∈V
(i,j)∈E
where θ represents the learned parameters of the model, and E is the set of neighbors between pairs of part labels. The unary potentials φ(hi , x) measure the compatibility between local observations and part labels, and this measurement takes the form of multi-class logistic regression. The pairwise potentials ψ(hi , hj , x) measure the compatibility between the part labels and local observations. Finally, the potentials δ(yi = y(hi )) enforce the deterministic mapping from part labels to class labels. Therefore, our final purpose is to estimate the posterior probability distribution of class labels for all the local observations, that is, P (y, h|x; θ). (2) P (y|x; θ) = h
2.1
Unary potentials
For the unary potentials, instead of linear function, we use the multiclass version of the logistic function for its simplicity and efficiency. For ith local observation in image x, the potential is modeled using a local discriminative model that outputs the association of this site with part class hi . Thus the potential function φ(hi , x) can be written as φ(hi , x) =
H
δ(hi = k) log F (hi = k|x),
(3)
k=0
and the function F is in the form of multiclass logistic regression. For each site i, let fi (x) be a function that maps the observations x on a feature vector such that fi : x → Rl . And add a bias term into the
Zhang X T, et al.
Sci China Inf Sci
2525
December 2011 Vol. 54 No. 12
model which is always one. Therefore the final unary feature vector is defined as gi (x) = [1, fi (x)]. Note that in the case of object detection, the vector gi (x) encodes the appearance based features of the ith site (or part). Then the form of the multiclass logistic regression function has the following form:
F (hi = k|x) =
⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩
exp(wkT gi (x)) , if k > 0, H 1 + l=1 exp(wlT gi (x)) 1+
H
1
l=1
exp(wlT gi (x))
(4)
, if k = 0.
Here wk are the model parameters for k = 1, . . . , H. H is the number of parts used to describe the object. For an H + 1 class classification problem, one needs only H independent hyperplanes which may lie in the feature space. The parameters wk , for k = 0, are set to 0. 2.2
Pairwise potentials
The pairwise potentials model the interaction between part labels at two neighboring sites given the observations. We use the multiclass generalized Potts model for these potentials: ψ(hi , hj , x) =
H H
T vkl qij (x)δ(hi = k)δ(hi = l).
(5)
k=0 l=0
Here, qij (x) is the pairwise relational vector for a site pair (i, j), and vkl are the model parameters. In this HRF based model, the image data can be used in the constraints that model the interaction between neighboring sites. Both geometric and appearance features can be included in the pairwise potentials. In our model, we do not consider the asymmetric interaction between the parts, e.g., some part is always above the other part. This implies that vkl = vlk for k, l = 0, 1, . . . , H. This form of interaction potential in (5) is a generalization of the Potts model. When vkl = 0 (k = l) and all the elements of the vector vkl are set to zero except the bias term when k = l, we can get the Potts model from (5).
3
Parameter learning and inference
The set of parameters of our model is θ = {wk=1,...,H , vkl=1,...,H }. To prevent over-fitting, we use the Maximum A Posteriori (MAP) criteria to estimate these parameters. We assume a Gaussian prior over the parameters, i.e., P (θ) N (θ; 0, σ2 I). Given a set of N i.i.d. labeled training images D = (x(1) , y (1) , . . . , x(N ) , y (N ) ), the MAP estimates of the parameters are obtained by minimizing the following objective function: L(θ) = −
N
log P (y (n) |x(n) ; θ) +
n=1
where
P (y (n) , h|x(n) ; Θ) =
i∈V
(n) N , h|x(n) ; θ) 1 1 T h P (y + 2 θT θ, θ θ = − log 2 2σ Z(θ) 2σ n=1
eφi (hi ,x
(n)
)
(n)
δ(y(hi ) = yi )
eψij (hi ,hj ,x
(n)
)
.
(6)
(7)
(i,j)∈E
Thus, we can learn the parameters using gradient descent. Define the log likelihood of nth training image ln (θ) = log P (y (n) |x(n) ; θ). Then its derivatives with respect to the parameters wk in unary potential functions can be written as ∂P (y (n) , h|x(n) ; θ) 1 ∂ln (θ) = ∂wk ∂wk P (y (n) |x(n) ; θ) h
∂φi ∂φi EP (hi =k|x(n) ,y(n) ;θ) . = − EP (hi =k|x(n) ;θ) ∂wk ∂wk i∈V
(8)
Zhang X T, et al.
2526
Sci China Inf Sci
December 2011 Vol. 54 No. 12
It follows that the derivatives can be computed using P (hi = k|x(n) , y (n) ; θ) and P (hi = k|x(n) ; θ), which can be obtained by loopy belief propagation (LBP). Note that there are two types of nodes in the graph, i.e., the class label variables and part label variables. When inferring the marginals, the messages passing among these variables should be considered. Similarly, the derivatives with respect to the parameters vkl in pairwise potential functions can be written as
∂ψij ∂ψij ∂ln (θ) EP (hi =k,hj =l|x(n) ,y(n) ;θ) . (9) = − EP (hi =k,hj =l|x(n) ;θ) ∂vkl ∂vkl ∂vkl (i,j)∈E
These derivatives can also be expressed in terms of marginal probabilities that can be obtained using LBP.
4 4.1
Experimental results Data set
In order to evaluate the proposed approach effectively, we collect images with a camera mounted on our testing vehicle [16]. The camera looks forward, and the rear-end of vehicles driving in front of the ego vehicle can be seen in the image. All the images in our dataset were captured on the roads around Xi’an city, under different weather conditions. There are totally 842 images in the dataset. The size of the images is 480×270. Most of the images contain more than one vehicle. There are 586 images containing two instances, 107 containing three instances, and the left containing one instance. Totally, there are 1642 vehicles in the data set. We only consider the situation where one vehicle is occluded by another. Furthermore, there are 118 images in this case. During training, we only use images with no occlusion. Therefore, in our experiment, we randomly choose 500 such images for training, which have one, two or three vehicles. Additionally, we need a validation set to determine the regular factor. Thus, we randomly split the rest images into two sets, each containing 172 images. Then, we have 969 vehicles for trainning, 317 for validation and 356 for testing. The occurrences of vehicle instances in the images are indicated by bounding boxes, which are obtained manually. For the occluded vehicles, the size of the bounding boxes is estimated according to the visible parts. Therefore, the interest points in the bounding boxes are considered as object points, i.e., the labels yi = 1. Figure 2 shows some images as an example in the database. 4.2
Feature extraction
In this paper, part labels of local observations are not known during training. For convenience, we need only one type of feature vector for all object parts. This descriptor should be robust to scale illumination and color changing, occlusion and other uncertainties. SIFT features [17] meet these requirements and can be incorporated in our model easily. Therefore, for the unary features, we compute • SIFT descriptor with 4×4 spatial and 8 orientation bins; • color features with 4×4 spatial bins in the CIELAB color space computed within a window at each site. Totally, we get a feature vector with 4 × 4 × 8 + 4 × 4 × 3 = 176 elements for each site. We also utilize appearance information between two neighboring sites for pairwise potentials. We compute both appearance and geometric features as follows: • the maximum ratio between scales of two neighboring local observations, i.e., max(si /sj , sj /si ), where si and sj are the scales under which the SIFT descriptors are computed; • the direction from the lower left site to the upper right one; • the pairwise appearance features extracted from the region between the two sites. The region is an oriented rectangle. The direction is determined by the position of the two sites. The long edge length of
Zhang X T, et al.
Sci China Inf Sci
Figure 2
December 2011 Vol. 54 No. 12
2527
Example images in the database.
the rectangle is the distance between the sites, and the short edge length is half of the long one. This region is then resized to 32 × 32 so that we can compute a histogram of gradient orientation with 8 bins; • the difference between color features from two neighboring sites. Thus, the feature vector for each pair of neighboring sites is 1 + 1 + 8 + 4 × 4 × 3 = 58 in length. 4.3
Performance evaluation
Our model was trained on the data set described above. As we use MAP criteria for training and select the variance parameters σ2 of the Gaussian prior by comparing the performance of the model on the evaluation data set. Additionally, as we did not distinguish which instance the local observations belong to, a postprocessing step is needed to separate multiple instances. This was accomplished by a simple clustering procedure. For each testing image, we inferred not only the class labels, but also the part labels for local observations. Then the postprocessing procedure was performed as • We clustered non-background sites according to the inferred class labels. The isolated clusters were considered as noise and all associated local observations were removed. • The rest non-background sites were then clustered according to the inferred part labels. • Beginning with the smallest part label, we connected neighboring parts increasingly until there was no non-repeated part left. • Finally, each set of connected parts was considered as a detected vehicle. Then we compared the max bounding box with the ground truth, and the one with more than 30% overlapping in area is considered as a correct detection. In this work, the number of parts was set manually. We first tested the performance with different numbers of parts. Intuitively, a large number of parts could capture richer information. Table 1 shows the detection accuracy for the experiments with various numbers of parts. Occlusion is one of the reasons for the poor performance of the model when the number of parts is small. Larger numbers will yield better performance, while the computational load increases greatly. Thus it is necessary to develop an algorithm to select the best number of parts in the future. Next, we compared our model with Adaboost-based detector, which is one of the most widely used vehicle detection methods in ITS society. In our experiment, we used the approach presented in [18]. Firstly, a set of candidate regions of interest were generated according to the vanishing points and edge information. Then an SVM classifier is trained based on boosted Gabor features. In this approach, the features extracted from the candidate regions described the holistic appearance characteristics of the vehicle. In this experiment, we used the images in [18] for training, and tested the classifier with the same images. A detection accuracy of 87.3% was reached. We show in Figure 3 some results of comparison.
5
Conclusions and future work
In this paper, we present an on-road vehicle detection algorithm based on hidden random field. The
Zhang X T, et al.
2528 Table 1
Figure 3
Sci China Inf Sci
December 2011 Vol. 54 No. 12
Detection accuracy for the experiments with different numbers of parts
Number of parts
5
10
15
Detection accuracy
85.39
90.17
91.85
Some results of comparison between proposed method and the Adaboost-based detector. The images in the left
column show the results of our approach. Small circles indicate the interest points on the vehicles. The interest points from background are represented by crosses. The images in the right column give the results of the Adaboost-based detector. The rectangles in these images are the candidate regions of interest. The dark rectangles in all images are the bounding boxes that contain vehicles.
vehicle instance is represented by a set of parts, which can be found automatically by our model. By this representation and carefully selected pairwise features, our model can detect occluded instances. Testing result on our database indicates its effectiveness. One of the disadvantages of proposed method is its efficiency. Labelling all the local observations is relatively slow. We will investigate some more efficient inference algorithms. Additionally, our approach is not very tolerant to scale. Therefore, multiscale features and appropriate modification of the model should be studied in the future.
Acknowledgements This work was supported partly by the National Natural Science Foundation of China (Grant No. 90920301).
Zhang X T, et al.
Sci China Inf Sci
December 2011 Vol. 54 No. 12
2529
References 1 Sun Z H, Bebis G, Miller R. On-road vehicle detection: a review. IEEE Trans Pattern Anal Mach Intell, 2006, 28: 694–711 2 Aytekin B, Altug E. Increasing driving safety with a multiple vehicle detection and tracking system using ongoing vehicle shadow information. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Istanbul, Turkey, 2010. 3650–3656 3 Sivaraman S, Trevide M M. Active learning based monocular vehicle detection for on-road safety systems. In: Proceedings of IEEE Intelligent Vehicle Symposium, Xi’an, China, 2009. 399–404 4 Crandall D, Felzenszwalb P, Huttenlocher D. Spatial priors for part-based recognition using statistical models. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 2005. 10–17 5 Bergtholdt M, Kappes J, Schmidt S, et al. A study of parts-based object class detection using complete graphs. Int J Comput Vision, 2010, 87: 93–117 6 Agarwal S, Roth D. Learning a sparse representation for object detection. In: Proceedings of European Conference on Computer Vision, Copenhagen, Denmark, 2002. 97–101 7 Sivic J, Russell B, Efros A, et al. Discovering objects and their locations in images. In: Proceedings of IEEE International Conference on Computer Vision, Beijing, China, 2005. 370–375 8 Ronfard R, Schmid C, Triggs B. Learning to parse pictures of people. In: Proceedings of European Conference on Computer Vision, Copenhagen, Denmark, 2002. 700–714 9 Ramanan D, Forsyth D A, Zisserman A. Strike a pose: tracking people by finding stylized poses. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Beijing, China, 2005. 271–278 10 Fergus R, Perona P, Zisserman A. Object class recognition by unsupervised scale-invariant learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, 2003. 39–45 11 Kumar S, Hebert M. Discriminative random fields. Int J Comput Vision, 2006, 68: 179–201 12 Szummer M. Learning diagram parts with hidden random fields. In: Proceedings of IEEE International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 2003. 1188–1193 13 Kumar S, Hebert M. Multiclass discriminative fields for part-based object detection. In: Proceedings of Snowbird Learning Workshop, Utah, USA, 2004 14 Winn J, Shotton J. The layout consistent random field for recognizing and segmenting partially occluded objects. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 2006. 37–42 15 He X M, Zemel R, Carreira-Perpinan M. Multiscale conditional random fields for image labelling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, USA, 2004. 695–702 16 Cheng H, Zheng N N, Zhang X T, et al. Interactive road situation analysis for driver assistance and safety warning systems: frameworks and algorithms. IEEE Trans Intell Transport Syst, 2007, 8: 157–167 17 Lowe D. Distinctive image features from scale-invariant keypoints. Int J Comput Vision, 2004, 60: 91–110 18 Cheng H, Zheng N N, Sun C, et al. Boosted crucial gabor features applied to vehicle detection. In: Proceedings of IEEE International Conference on Pattern Recognition, Hong Kong, China, 2006. 662–665 ZHANG XueTao was born in 1981. He received the bachelor’s degree in information engineering and master’s degree in automation science and technology from Xi’an Jiaotong Unversity, Xi’an, China in 2003 and 2006 respectively. He is now a Ph.D. candidate at Institute of Artificial Intelligence and Robotics in Xi’an Jiaotong University. His research interests include computer vision, pattern recognition, especially the object detection and recognition, probabilistic graphical models. HE YongJian was born in 1975. He is a Ph.D. candidate at the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China. Currently, he is a teacher of Xi’an Communication Institute, Xi’an, China. He is an expert of General Staff Innovation Workstation. His research interests include pattern recognition, artificial intelligence, computer vision and image processing.
WANG Fei was born in 1975. He received the master’s degree in communication and information system from Xi’an Institute of Optics and Precision Mechanics of Chinese Academy of Sciences, Xi’an, China in 2002 and the Ph.D. degree in pattern recognition and intelligent system from Xi’an Jiaotong University, Xi’an, China in 2009. Currently, he is an associate professor at Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University. His research interests include machine vision, shape matching and retrieval, and computer graphics. Dr. Wang Fei is a member of IEEE Computer Society and a member of CCF YOCSEF.
Supporting Information 122011-660-video The supporting information is available online at info.scichina.com and www.springerlink.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.