Part-based on-road vehicle detection using hidden random field

This paper addresses the problem of detecting on-road vehicles in still images captured by the on-board cameras. We model this as a labelling inferenc...

0 downloads 21 Views 449KB Size

Download PDF

SCIENCE CHINA Information Sciences

. RESEARCH PAPERS . Special Focus

December 2011 Vol. 54 No. 12: 2522–2529 doi: 10.1007/s11432-011-4493-3

Part-based on-road vehicle detection using hidden random field ZHANG XueTao1 , HE YongJian1,2 & WANG Fei1 ∗ 1Institute

of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China; 2Xi’an Communication Institute, Xi’an 710049, China Received June 15, 2011; accepted September 26, 2011

Abstract This paper addresses the problem of detecting on-road vehicles in still images captured by the onboard cameras. We model this as a labelling inference procedure and incorporate the part-based representation of the rear-ends of vehicle within a hidden random ﬁeld based probabilistic model. Representing objects with parts inherently good for dealing with occlusions. In the proposed model, the part labels form a hidden layer in the graphical model. Our approaches can automatically ﬁnd the latent parts without explicit indication during training. The experiment is performed on the database with real images with a promising result. Keywords

vehicle detection, hidden random ﬁeld, part-based model

Citation Zhang X T, He Y J, Wang F. Part-based on-road vehicle detection using hidden random ﬁeld. Sci China Inf Sci, 2011, 54: 2522–2529, doi: 10.1007/s11432-011-4493-3

1

Introduction

Vehicle detection based on vision sensor is a basic technology for autonomous vehicles. In order to drive fast and safely, ego-vehicle should monitor the behavior of surrounding vehicles. Among them, the vehicles in front are more important for safety. In addition, the environment perception system of autonomous vehicle relies heavily on the performance of vehicle detection. However, since the shape of vehicles, their appearance, lighting conditions vary a lot across scenes, it is crucial to develop computer vision algorithms that can precisely and robustly detect surrounding vehicles, providing eﬀective information for autonomous vehicle system. This paper addresses the problem of detecting both clean and partially occluded front vehicles using an onboard camera. We model both local parts and structure for vehicle detection under a discriminative graphical model. The proposed model is powerful in coping with uncertainties due to occlusion and noise, which often occur in the urban traﬃc scene. Moreover, the model can discover parts automatically and this greatly relieves the eﬀort to collect the training images. Compared with the sliding window based approaches, the representation of objects by a sparse set of parts can utilize the spatial interactions between parts, not only helping the object detection, but also allowing for detection under partial occlusion. For example, knowing the location of the left rear wheel can help ﬁnd the right one. Furthermore, using the information from the left rear wheel and top-left corner of the car’s rear-end with their interaction, we can infer that there is a car conﬁdently. ∗ Corresponding

author (email: [email protected])

c Science China Press and Springer-Verlag Berlin Heidelberg 2011

info.scichina.com

www.springerlink.com

Zhang X T, et al.

Sci China Inf Sci

December 2011 Vol. 54 No. 12

2523

A key aspect of the proposed learning framework for the part-based detection task is ﬁnding parts automatically, and therefore, only coarse labels of training images are required. This can greatly reduce the human labeling eﬀort, as only the instances in the images need to be manually labeled. Although the learned parts are not guaranteed to be meaningful, the algorithm can model parts automatically chosen for the task. Therefore, the detection can be more accurate. In general, most approaches to vehicle detection can be divided into two stages: the feature-based stage to generate vehicle candidates, and the appearance-based stage for validation. Usually, the latter uses pre-designed vehicle templates to capture holistic characters of the vehicle region. A thorough review on this topic can be found in [1]. Recently, Aytekin and Altug developed a vehicle detection and tracking system, in which the shadow of vehicles was used to generate the candidate regions, and they were veriﬁed by detecting vertical edges [2]. Song et al. extracted regions of interest by edge based texture analysis and classiﬁed these candidates using the AdaBoost algorithm. Sivaraman and Trevide trained an Adaboost classiﬁer with Harr-like features in an active learning framework [3]. However, this kind of approaches is sensitive to some uncertainties caused by background noise and occlusion. In urban area, a large amount of edge information from buildings in the background will erode the candidate generation stage. Moreover, as the validation needs a whole description of the vehicle region, incomplete information will be harmful to the classiﬁer. Another set of vehicle detection methods is known as local methods, which represent an object as a collection of parts. These approaches have been wildly studied recently in the Compute Vision and Machine Learning domains. Usually, considerable human eﬀorts are needed to label parts used in classiﬁer training [4, 5]. Other methods learn the parts by clustering similar image patches and from a “bags of words” representation ignore the geometric structure entirely [6, 7]. Additional structural constraints can also be added to the model which is helpful for discriminating the object [8, 9]. While generative approaches successfully model the distribution of spatially coherent parts [10], discriminative models can be more eﬀective when the relations between observations and labels are complicated and hard to describe in a generative way [5, 11].

2

Probabilistic graphical model

We aim at locating all the instances of vehicles in a still image. As the image is captured by an incar forward camera, only the rear-end of leading vehicles can be seen. Traditional methods are based on sliding window classiﬁers, which use common classiﬁers, such as SVM and Adaboost, to determine whether a local window or patch contains an instance of a vehicle. However, it is often the case that one vehicle may be occluded by another one. The classiﬁers based on holistic appearance information cannot correctly distinguish these partial vehicle patches from the background. Therefore, we resort to part-based models, in which various parts of the image are classiﬁed into object parts or background. We adopt a probabilistic framework based on recently proposed hidden random ﬁeld (HRF) [12], in which the object is modeled by a ﬂexible collection of parts conditioned on image observations. In our approach, local observations are extracted around detected points in interest. Our task is to infer class label (i.e., object or background) of each observation. Let V denote the collection of all local observations. Each observation corresponds to a site in the random ﬁelds. The observed features extracted from these sites are represented by x = {xi }i∈V . For each site i ∈ V , there is a class label yi ∈ {0, 1}, where yi = 1 indicates that the observation associated with this site is from an instance of vehicle, and yi = 0 from the background. Note that, at this time, we do not indicate which instance the local observation belongs to. Some post-processing operations, which are used to separate instances, will be introduced in section 4. One possible way to model the interaction between neighboring sites is to smooth the labels; that is, the neighboring sites usually have same class labels. However, in this situation, the performance relies more on single site classiﬁers. Thus, some strong general local features are required. Also, it is hard to incorporate geometric structure of the object into the probabilistic framework. In the present work, we use HRF. The model augments the conditional random ﬁeld with a hidden layer which is not observed

Zhang X T, et al.

2524

Figure 1

Sci China Inf Sci

December 2011 Vol. 54 No. 12

The proposed graphical model. All part variables h are conditioned on the image x. There are also links

between class labels y and h. Note that, in our situation, the graph is irregular.

during training. The variables in this layer are used to indicate which parts the local observations belong to. Then, the interactions between these part variables are constructed. This enhances the modeling power of the framework. Let h = {hi }i∈V denote the set of these variables. Each hi indicates the part the ith observation belongs to. In this paper, we extend the HRF by replacing the linear functions in the unary potentials with logistic regression functions [11, 13]. Although other classiﬁers can also be used, such as the random decision trees in [14] and the neural network in [15], a separate step is needed to learn the parameters in the unary potentials. However, by modeling with logistic regression classiﬁers, these parameters can be learned simultaneously in the HRF framework. Another diﬀerence is that we carefully choose edge features for the pairwise potentials, which incorporate both appearance and geometric information on edges. An example of the graphical model is depicted in Figure 1. The conditional distribution for the class labels y and part labels h given the image x is deﬁned as 1 eφ(hi ,x) δ(y(hi ) = yi ) eψ(hi ,hj ,x) , (1) P (y, h|x; θ) = Z(θ, x) i∈V

(i,j)∈E

where θ represents the learned parameters of the model, and E is the set of neighbors between pairs of part labels. The unary potentials φ(hi , x) measure the compatibility between local observations and part labels, and this measurement takes the form of multi-class logistic regression. The pairwise potentials ψ(hi , hj , x) measure the compatibility between the part labels and local observations. Finally, the potentials δ(yi = y(hi )) enforce the deterministic mapping from part labels to class labels. Therefore, our ﬁnal purpose is to estimate the posterior probability distribution of class labels for all the local observations, that is, P (y, h|x; θ). (2) P (y|x; θ) = h

2.1

Unary potentials

For the unary potentials, instead of linear function, we use the multiclass version of the logistic function for its simplicity and eﬃciency. For ith local observation in image x, the potential is modeled using a local discriminative model that outputs the association of this site with part class hi . Thus the potential function φ(hi , x) can be written as φ(hi , x) =

H

δ(hi = k) log F (hi = k|x),

(3)

k=0

and the function F is in the form of multiclass logistic regression. For each site i, let fi (x) be a function that maps the observations x on a feature vector such that fi : x → Rl . And add a bias term into the

Zhang X T, et al.

Sci China Inf Sci

2525

December 2011 Vol. 54 No. 12

model which is always one. Therefore the ﬁnal unary feature vector is deﬁned as gi (x) = [1, fi (x)]. Note that in the case of object detection, the vector gi (x) encodes the appearance based features of the ith site (or part). Then the form of the multiclass logistic regression function has the following form:

F (hi = k|x) =

⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

exp(wkT gi (x)) , if k > 0, H 1 + l=1 exp(wlT gi (x)) 1+

H

1

l=1

exp(wlT gi (x))

(4)

, if k = 0.

Here wk are the model parameters for k = 1, . . . , H. H is the number of parts used to describe the object. For an H + 1 class classiﬁcation problem, one needs only H independent hyperplanes which may lie in the feature space. The parameters wk , for k = 0, are set to 0. 2.2

Pairwise potentials

The pairwise potentials model the interaction between part labels at two neighboring sites given the observations. We use the multiclass generalized Potts model for these potentials: ψ(hi , hj , x) =

H H

T vkl qij (x)δ(hi = k)δ(hi = l).

(5)

k=0 l=0

Here, qij (x) is the pairwise relational vector for a site pair (i, j), and vkl are the model parameters. In this HRF based model, the image data can be used in the constraints that model the interaction between neighboring sites. Both geometric and appearance features can be included in the pairwise potentials. In our model, we do not consider the asymmetric interaction between the parts, e.g., some part is always above the other part. This implies that vkl = vlk for k, l = 0, 1, . . . , H. This form of interaction potential in (5) is a generalization of the Potts model. When vkl = 0 (k = l) and all the elements of the vector vkl are set to zero except the bias term when k = l, we can get the Potts model from (5).

3

Parameter learning and inference

The set of parameters of our model is θ = {wk=1,...,H , vkl=1,...,H }. To prevent over-ﬁtting, we use the Maximum A Posteriori (MAP) criteria to estimate these parameters. We assume a Gaussian prior over the parameters, i.e., P (θ) N (θ; 0, σ2 I). Given a set of N i.i.d. labeled training images D = (x(1) , y (1) , . . . , x(N ) , y (N ) ), the MAP estimates of the parameters are obtained by minimizing the following objective function: L(θ) = −

N

log P (y (n) |x(n) ; θ) +

n=1

where

P (y (n) , h|x(n) ; Θ) =

i∈V

(n) N , h|x(n) ; θ) 1 1 T h P (y + 2 θT θ, θ θ = − log 2 2σ Z(θ) 2σ n=1

eφi (hi ,x

(n)

)

(n)

δ(y(hi ) = yi )

eψij (hi ,hj ,x

(n)

)

.

(6)

(7)

(i,j)∈E

Thus, we can learn the parameters using gradient descent. Deﬁne the log likelihood of nth training image ln (θ) = log P (y (n) |x(n) ; θ). Then its derivatives with respect to the parameters wk in unary potential functions can be written as ∂P (y (n) , h|x(n) ; θ) 1 ∂ln (θ) = ∂wk ∂wk P (y (n) |x(n) ; θ) h

∂φi ∂φi EP (hi =k|x(n) ,y(n) ;θ) . = − EP (hi =k|x(n) ;θ) ∂wk ∂wk i∈V

(8)

Zhang X T, et al.

2526

Sci China Inf Sci

December 2011 Vol. 54 No. 12

It follows that the derivatives can be computed using P (hi = k|x(n) , y (n) ; θ) and P (hi = k|x(n) ; θ), which can be obtained by loopy belief propagation (LBP). Note that there are two types of nodes in the graph, i.e., the class label variables and part label variables. When inferring the marginals, the messages passing among these variables should be considered. Similarly, the derivatives with respect to the parameters vkl in pairwise potential functions can be written as

∂ψij ∂ψij ∂ln (θ) EP (hi =k,hj =l|x(n) ,y(n) ;θ) . (9) = − EP (hi =k,hj =l|x(n) ;θ) ∂vkl ∂vkl ∂vkl (i,j)∈E

These derivatives can also be expressed in terms of marginal probabilities that can be obtained using LBP.

4 4.1

Experimental results Data set

In order to evaluate the proposed approach eﬀectively, we collect images with a camera mounted on our testing vehicle [16]. The camera looks forward, and the rear-end of vehicles driving in front of the ego vehicle can be seen in the image. All the images in our dataset were captured on the roads around Xi’an city, under diﬀerent weather conditions. There are totally 842 images in the dataset. The size of the images is 480×270. Most of the images contain more than one vehicle. There are 586 images containing two instances, 107 containing three instances, and the left containing one instance. Totally, there are 1642 vehicles in the data set. We only consider the situation where one vehicle is occluded by another. Furthermore, there are 118 images in this case. During training, we only use images with no occlusion. Therefore, in our experiment, we randomly choose 500 such images for training, which have one, two or three vehicles. Additionally, we need a validation set to determine the regular factor. Thus, we randomly split the rest images into two sets, each containing 172 images. Then, we have 969 vehicles for trainning, 317 for validation and 356 for testing. The occurrences of vehicle instances in the images are indicated by bounding boxes, which are obtained manually. For the occluded vehicles, the size of the bounding boxes is estimated according to the visible parts. Therefore, the interest points in the bounding boxes are considered as object points, i.e., the labels yi = 1. Figure 2 shows some images as an example in the database. 4.2

Feature extraction

In this paper, part labels of local observations are not known during training. For convenience, we need only one type of feature vector for all object parts. This descriptor should be robust to scale illumination and color changing, occlusion and other uncertainties. SIFT features [17] meet these requirements and can be incorporated in our model easily. Therefore, for the unary features, we compute • SIFT descriptor with 4×4 spatial and 8 orientation bins; • color features with 4×4 spatial bins in the CIELAB color space computed within a window at each site. Totally, we get a feature vector with 4 × 4 × 8 + 4 × 4 × 3 = 176 elements for each site. We also utilize appearance information between two neighboring sites for pairwise potentials. We compute both appearance and geometric features as follows: • the maximum ratio between scales of two neighboring local observations, i.e., max(si /sj , sj /si ), where si and sj are the scales under which the SIFT descriptors are computed; • the direction from the lower left site to the upper right one; • the pairwise appearance features extracted from the region between the two sites. The region is an oriented rectangle. The direction is determined by the position of the two sites. The long edge length of

Zhang X T, et al.

Sci China Inf Sci

Figure 2

December 2011 Vol. 54 No. 12

2527

Example images in the database.

the rectangle is the distance between the sites, and the short edge length is half of the long one. This region is then resized to 32 × 32 so that we can compute a histogram of gradient orientation with 8 bins; • the diﬀerence between color features from two neighboring sites. Thus, the feature vector for each pair of neighboring sites is 1 + 1 + 8 + 4 × 4 × 3 = 58 in length. 4.3

Performance evaluation

Our model was trained on the data set described above. As we use MAP criteria for training and select the variance parameters σ2 of the Gaussian prior by comparing the performance of the model on the evaluation data set. Additionally, as we did not distinguish which instance the local observations belong to, a postprocessing step is needed to separate multiple instances. This was accomplished by a simple clustering procedure. For each testing image, we inferred not only the class labels, but also the part labels for local observations. Then the postprocessing procedure was performed as • We clustered non-background sites according to the inferred class labels. The isolated clusters were considered as noise and all associated local observations were removed. • The rest non-background sites were then clustered according to the inferred part labels. • Beginning with the smallest part label, we connected neighboring parts increasingly until there was no non-repeated part left. • Finally, each set of connected parts was considered as a detected vehicle. Then we compared the max bounding box with the ground truth, and the one with more than 30% overlapping in area is considered as a correct detection. In this work, the number of parts was set manually. We ﬁrst tested the performance with diﬀerent numbers of parts. Intuitively, a large number of parts could capture richer information. Table 1 shows the detection accuracy for the experiments with various numbers of parts. Occlusion is one of the reasons for the poor performance of the model when the number of parts is small. Larger numbers will yield better performance, while the computational load increases greatly. Thus it is necessary to develop an algorithm to select the best number of parts in the future. Next, we compared our model with Adaboost-based detector, which is one of the most widely used vehicle detection methods in ITS society. In our experiment, we used the approach presented in [18]. Firstly, a set of candidate regions of interest were generated according to the vanishing points and edge information. Then an SVM classiﬁer is trained based on boosted Gabor features. In this approach, the features extracted from the candidate regions described the holistic appearance characteristics of the vehicle. In this experiment, we used the images in [18] for training, and tested the classiﬁer with the same images. A detection accuracy of 87.3% was reached. We show in Figure 3 some results of comparison.

5

Conclusions and future work

In this paper, we present an on-road vehicle detection algorithm based on hidden random ﬁeld. The

Zhang X T, et al.

2528 Table 1

Figure 3

Sci China Inf Sci

December 2011 Vol. 54 No. 12

Detection accuracy for the experiments with diﬀerent numbers of parts

Number of parts

5

10

15

Detection accuracy

85.39

90.17

91.85

Some results of comparison between proposed method and the Adaboost-based detector. The images in the left

column show the results of our approach. Small circles indicate the interest points on the vehicles. The interest points from background are represented by crosses. The images in the right column give the results of the Adaboost-based detector. The rectangles in these images are the candidate regions of interest. The dark rectangles in all images are the bounding boxes that contain vehicles.

vehicle instance is represented by a set of parts, which can be found automatically by our model. By this representation and carefully selected pairwise features, our model can detect occluded instances. Testing result on our database indicates its eﬀectiveness. One of the disadvantages of proposed method is its eﬃciency. Labelling all the local observations is relatively slow. We will investigate some more eﬃcient inference algorithms. Additionally, our approach is not very tolerant to scale. Therefore, multiscale features and appropriate modiﬁcation of the model should be studied in the future.

Acknowledgements This work was supported partly by the National Natural Science Foundation of China (Grant No. 90920301).

Zhang X T, et al.

Sci China Inf Sci

December 2011 Vol. 54 No. 12

2529

References 1 Sun Z H, Bebis G, Miller R. On-road vehicle detection: a review. IEEE Trans Pattern Anal Mach Intell, 2006, 28: 694–711 2 Aytekin B, Altug E. Increasing driving safety with a multiple vehicle detection and tracking system using ongoing vehicle shadow information. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Istanbul, Turkey, 2010. 3650–3656 3 Sivaraman S, Trevide M M. Active learning based monocular vehicle detection for on-road safety systems. In: Proceedings of IEEE Intelligent Vehicle Symposium, Xi’an, China, 2009. 399–404 4 Crandall D, Felzenszwalb P, Huttenlocher D. Spatial priors for part-based recognition using statistical models. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 2005. 10–17 5 Bergtholdt M, Kappes J, Schmidt S, et al. A study of parts-based object class detection using complete graphs. Int J Comput Vision, 2010, 87: 93–117 6 Agarwal S, Roth D. Learning a sparse representation for object detection. In: Proceedings of European Conference on Computer Vision, Copenhagen, Denmark, 2002. 97–101 7 Sivic J, Russell B, Efros A, et al. Discovering objects and their locations in images. In: Proceedings of IEEE International Conference on Computer Vision, Beijing, China, 2005. 370–375 8 Ronfard R, Schmid C, Triggs B. Learning to parse pictures of people. In: Proceedings of European Conference on Computer Vision, Copenhagen, Denmark, 2002. 700–714 9 Ramanan D, Forsyth D A, Zisserman A. Strike a pose: tracking people by ﬁnding stylized poses. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Beijing, China, 2005. 271–278 10 Fergus R, Perona P, Zisserman A. Object class recognition by unsupervised scale-invariant learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, 2003. 39–45 11 Kumar S, Hebert M. Discriminative random ﬁelds. Int J Comput Vision, 2006, 68: 179–201 12 Szummer M. Learning diagram parts with hidden random ﬁelds. In: Proceedings of IEEE International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 2003. 1188–1193 13 Kumar S, Hebert M. Multiclass discriminative ﬁelds for part-based object detection. In: Proceedings of Snowbird Learning Workshop, Utah, USA, 2004 14 Winn J, Shotton J. The layout consistent random ﬁeld for recognizing and segmenting partially occluded objects. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 2006. 37–42 15 He X M, Zemel R, Carreira-Perpinan M. Multiscale conditional random ﬁelds for image labelling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, USA, 2004. 695–702 16 Cheng H, Zheng N N, Zhang X T, et al. Interactive road situation analysis for driver assistance and safety warning systems: frameworks and algorithms. IEEE Trans Intell Transport Syst, 2007, 8: 157–167 17 Lowe D. Distinctive image features from scale-invariant keypoints. Int J Comput Vision, 2004, 60: 91–110 18 Cheng H, Zheng N N, Sun C, et al. Boosted crucial gabor features applied to vehicle detection. In: Proceedings of IEEE International Conference on Pattern Recognition, Hong Kong, China, 2006. 662–665 ZHANG XueTao was born in 1981. He received the bachelor’s degree in information engineering and master’s degree in automation science and technology from Xi’an Jiaotong Unversity, Xi’an, China in 2003 and 2006 respectively. He is now a Ph.D. candidate at Institute of Artiﬁcial Intelligence and Robotics in Xi’an Jiaotong University. His research interests include computer vision, pattern recognition, especially the object detection and recognition, probabilistic graphical models. HE YongJian was born in 1975. He is a Ph.D. candidate at the Institute of Artiﬁcial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China. Currently, he is a teacher of Xi’an Communication Institute, Xi’an, China. He is an expert of General Staﬀ Innovation Workstation. His research interests include pattern recognition, artiﬁcial intelligence, computer vision and image processing.

WANG Fei was born in 1975. He received the master’s degree in communication and information system from Xi’an Institute of Optics and Precision Mechanics of Chinese Academy of Sciences, Xi’an, China in 2002 and the Ph.D. degree in pattern recognition and intelligent system from Xi’an Jiaotong University, Xi’an, China in 2009. Currently, he is an associate professor at Institute of Artiﬁcial Intelligence and Robotics, Xi’an Jiaotong University. His research interests include machine vision, shape matching and retrieval, and computer graphics. Dr. Wang Fei is a member of IEEE Computer Society and a member of CCF YOCSEF.

Supporting Information 122011-660-video The supporting information is available online at info.scichina.com and www.springerlink.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientiﬁc accuracy and content remains entirely with the authors.

Part-based on-road vehicle detection using hidden random field

Recommend Documents