SCIENCE CHINA Information Sciences
. RESEARCH PAPER .
March 2015, Vol. 58 032107:1–032107:13 doi: 10.1007/s11432-014-5151-3
Probabilistic modeling of scenes using object frames SU Hao1,2 * & YU Adams Wei3 1School
of Mathematics, Beihang University, Beijing 100191, China; of Computer Science, Stanford University, Stanford, CA 94305, USA; 3Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA 2Department
Received February 11, 2014; accepted June 10, 2014; published online November 26, 2014
Abstract In this paper, we propose a probabilistic scene model using object frames, each of which is a group of co-occurring objects with fixed spatial relations. In contrast to standard co-occurrence models, which mostly explore the pairwise co-existence of objects, the proposed model captures the spatial relationship among groups of objects. Such information is closely tied to the semantics of the underlying scenes, which allows us to perform object detection and scene recognition in a unified framework. The proposed probabilistic model has two major components. The first models the dependencies between object frames and objects by adopting the Latent Dirichlet Allocation model for text analysis. The second component characterizes the dependencies between object frames and scenes by establishing a mapping between global image features and object frame distributions. Experimental results show that the induced object frames are both semantically meaningful and spatially consistent. In addition, our model significantly improves the performance of object recognition and scene retrieval. Keywords
Bayes network, scene understanding, object frame, probabilistic model
Citation Su H, Yu A W. Probabilistic modeling of scenes using object frames. Sci China Inf Sci, 2015, 58: 032107(13), doi: 10.1007/s11432-014-5151-3
1
Introduction
The detection and modeling of co-occurring objects in images is an active area of computer vision research. Existing work has focused on the use of object co-occurrences to improve the detection/segmentation of individual objects [1–6]. In this paper, we propose to model object co-occurrences in the context of both object recognition and scene recognition. We introduce the notion of object frames, each of which is a collection of co-occurring objects with fixed mutual spatial relations. This is motivated by the fact that the spatial relations between co-occurring objects, which indicate how these objects interact with each other, are closely tied to the underlying scene semantics. For example, when a lamp, table, and chair appear together in a workspace, they are often arranged in a similar pattern (see Figure 1). We introduce a probabilistic framework for modeling object frames. The proposed model has two major components. First, we adapt the Latent Dirichlet Allocation [7], which is a popular model for document-topic-term relations in text analysis, to model the dependencies among images, object frames, * Corresponding author (email:
[email protected])
c Science China Press and Springer-Verlag Berlin Heidelberg 2014
info.scichina.com
link.springer.com
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:2
Figure 1 (Color online) In two workspace scenes, we observe objects with a consistent spatial layout, e.g., lamps are on the tables, and chairs are adjacent to the tables. The three objects are correlated with each other by the function of the workspace. Our algorithm uses the concept of such object frames to represent images and recognize scenes.
and objects. Unlike most existing approaches, which utilize the co-occurrence information as energy terms in an inference phase, our probabilistic model explicitly considers object frames as latent variables. This strategy enables the co-occurrence information to be utilized by tuning the probabilistic inference setting. The second component models the dependencies between object frames and scenes. Specifically, we represent scenes by their global image features, such as GIST [8]. Following the idea of Dirichlet Multinomial Regression [9], we learn a mapping from the global image features to the prior of the object frame distribution (OFD). This mapping not only serves as a regularizer in model learning, but also enables global features to contribute to the prediction of objects through object frames. We conducted various experiments to analyze and evaluate our probabilistic model. The results, reported in Section 4, show that the learned object frames are both semantically meaningful and spatially consistent. We also show that our probabilistic model can be used to significantly improve the performance of object recognition and scene retrieval. We highlight our contributions as follows: • The proposed model automatically discovers “object frames”, each of which consists of a group of co-occurring objects with consistent mutual spatial relations, and this advantage makes our model a useful tool for summarizing image collections. • The proposed model naturally integrates information from individual object detectors and global image features. This integration not only improves the learning of “object frames,” but also improves the performance of object detection via probabilistic inference. • The proposed model outputs a highly compact encoding of the image, in which rich semantic information is preserved. This encoding provides an effective means of retrieving images of similar scenes.
2
Related work
There is a large body of work in the field of scene modeling. In terms of image representation, existing work falls into two categories. The most prevalent approach is to build an image representation based upon statistics of low-level image properties (e.g., SIFT [10], filterbanks [11,12], and GIST [13]). An alternative approach is to encode images as structural compositions of semantically meaningful intermediate representations [14–17]. Pioneering work such as Classeme [18] and Object Bank [19] use objects as the intermediate representations, and show promising potential in capturing the semantics of images. Our work is similar in spirit to the latter approach — the proposed model uses groups of object, i.e., object frames, as an intermediate representation for the image contents. The key difference is that we also model the spatial relations between objects.
Su H, et al.
Sci China Inf Sci
Table 1 Variable
Definition
March 2015 Vol. 58 032107:3
Variable definition Variable
Definition
w
Object
L
Latent object location in an image
φ
Object distribution in object frames
c
Origin of object frame in an image
β
Parameter of Dirichlet prior over φ
τ
Parameter of Gaussian prior over c
z
Object frame index
b
Object location relative to an object frame
θ
Object frame distribution in an image
ψ
Topic model/background model switch
α
Parameter of Dirichlet prior over θ
s
Detection score
λ
Parameter of logistic transfer function
ρ
Parameter of Bernoulli prior over ψ
f
Image feature
T
Confidence of detection score
l
Observed object location
Topic models have been widely used for scene modeling in computer vision [20–24]. Most approaches in this area are based on the bag of visual words (BoW) representation. The BoW representation first maps local appearance descriptors to a set of visual words by vector quantization, and then models the image as a histogram of these visual words. By drawing an analogy from visual words in images to words in documents, topic models can be utilized to discover the statistical patterns of visual word occurrences. Our approach models the topics of an image in a different manner. Specifically, we model the topics of images based on the presence of object frames, e.g., groups of co-occurring objects with consistent mutual spatial relations. Compared with the visual word-based approaches, our model more closely resembles how human beings understand scenes. Another closely related line of research explores context in object and scene recognition [1–6,25–27]. However, most of these approaches only explore pairwise object co-occurrences. Two exceptions are recent work by Sadeghi and Farhadi [28], and Li et al. [29], who showed that recognition of groups of correlated objects could foster individual object detection. In general, our approach differs from these approaches in the modeling of object co-occurrence and subsequent object detection. Specifically, [28] requires meticulous human annotation in the training stage, whereas our model automatically groups objects into object frames, and [29] uses a Hough transform-based bottom-up method to greedily group co-occurring objects. Our probabilistic model exhibits two advantages. First, we take fuzzy object detectors as model input, while [28,29] require clean object labels for training. Second, our probabilistic model can utilize global image features to predict object frames. In [30], geometric phrases are built to parse the 3D scene layout using one image. The phrases are obtained by a heuristic approach, which lacks an explicit modeling of the detector noise and statistical soundness. As a side-effect of the generative modeling principle of Bayes Nets, our approach supports the zero-detector setting (Section 4) and guessing of object presence, which is impossible for [29,30]. In the experimental section, we show that our model outperforms [29] in an object recognition experiment using the PASCAL VOC2007 dataset. It is worth noting that [31–34] also use probabilistic approaches to model the interaction between a scene and its objects. However, these approaches are purely generative, and hence lack the ability to leverage the recent progress in discriminative object detection algorithms.
3
The Bayes Net model
In this section, we propose a Bayes Net model. In a Bayes Net, each node corresponds to a random variable, and each edge encodes the dependency, i.e., conditional probabilities among random variables. These conditional probabilities are determined in the learning phase. Once the model has been learned, we can use probabilistic inference to determine the assignment probabilities of random variables. In the remainder of this section, we introduce our Bayes Net model, then describe how to train and apply it to object detection. Table 1 and Figure 2 show the proposed Bayes Net model. We explain each step in the following four subsections. Subsection 3.1 introduces the basic structure of our model, which is based on Latent Dirichlet
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:4
G
G
N
I G
G
V
Figure 2 (Color online) Full Model. Our model has the ability to simultaneously consider the global image features, fuzzy object detection results, and spatial arrangement of objects in scenes under a unified Bayes Net framework. We use colors to indicate different groups of variables, i.e., red for global feature variables, orange for object spatial layout variables, and green for fuzzy object detection result variables. Details of each group of variables and their dependencies are given in Figures 4 and 5.
G I Figure 3
N
The Bayes Net of LDA, which is the basic structure of our model.
Allocation (LDA) [7]. Subsections 3.2–3.4 describe how this model is enhanced into the final model. In particular, Subsection 3.2 shows how global image features are incorporated into LDA, and Subsection 3.3 addresses fuzzy object labeling by incorporating object detector scores. Finally, Subsection 3.4 introduces our approach to modeling the spatial configuration of objects. 3.1
Basic structure
LDA [7] is a powerful model for extracting semantically valid topics from document collections. In LDA, each document is a mixture of topics, and each topic is simply a distribution over different terms (words or phrases). The LDA model has plenty of favorable properties. For example, the topic distribution maximally preserves the semantic information, because the MAP estimation of the topic distribution minimizes the KL-divergence between the empirical word distribution and the predicted word distribution. Readers may refer to [35] for more theoretical analysis. Similar to LDA, we model each image as a mixture of object frames, where each object frame is a collection of objects (see Figure 3). In our model, object frames correspond to topics in LDA, and objects correspond to terms. Note that our analogy is significantly different from most previous LDAbased models in computer vision, in which visual words correspond to terms. In the remainder of this paper, we will use w to denote objects, z to denote object frames, θ to denote the distribution of object frames in documents, and φ to denote the object distribution in each object frame. θ and φ are sampled from Dirichlet priors parameterized by α and β, respectively. z and w are sampled from a categorical distribution parameterized by θ and φz , respectively. 3.2
Integrating global image features
Global features such as GIST are an effective means of characterizing similar scenes. As each image is modeled as a mixture of object frames, we enhance our basic model to characterize the dependencies between global image features and the OFD. In other words, we impose the assumption that images with similar global features have a similar OFD. Motivated by Dirichlet Multinomial Regression [9] in text analysis, we modify α (which is a constant in the standard LDA model) to depend on global image features: T
αIt = log[1 + eλt fI ],
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:5
G
I N I (b)
(a)
Figure 4 (Color online) (a) Variables and dependencies related to global image features in the full model; (b) variables and dependencies related to object detector scores and the background model in the full model.
where I and t index the images and object frames, respectively. fI is the vector containing global image features for image I, and λt is a latent vector variable associated with each object frame t. Note that we y have used the logistic transfer function [36] lgt(λT t fI ), where lgt(y) = log[1 + e ]. This can be viewed as a soft thresholding of y, and guarantees the non-negativeness of αI . In this setting, λt can be roughly interpreted as the weight of the “predictor” for object frame t. Figure 4(a) shows the modified part of the Bayes Net. 3.3
Modeling noisy object labels using confidence scores
We expand our model to incorporate fuzzy object labels, e.g., object labels associated with confidence scores. Such object labels typically come from object detectors with calibrated detector responses representing confidence scores. In this setting, with the presence of noisy data, a good model should better explain those observations with higher confidences. Formally, let x = (w, s) be an observation, where w is an object and s ∈ [0, 1] is a score indicating the confidence of the observation. In the following, we introduce a background model to explain noisy object labels. We also introduce a switching variable ψ conditioned on s to select between the topic model and the background model (Figure 4(b)). The background model. We assume that w is generated by a multinomial background distribution parameterized by m [37]. The topic/background switching variable ψ. For each observation x, let ψ be a switching variable that conforms to a Bernoulli distribution parameterized by ρ. When ψ = 1, we sample w using the topic model; otherwise, the background model is used. We further assume that w and s are conditionally independent given ψ, indicating: P (w|ψ)P (ψ|s). (1) P (w|s) = ψ
We have defined P (w|ψ) for the topic model (ψ = 1) and background model (ψ = 0). Let us now consider P (ψ|s). Modeling P (ψ|s). ψ is a Bernoulli variable conditioned on ρ. We add a conjugate prior for ρ conditioned on detection score s, such that ρ ∼ Beta(T s+1, T (1−s)+1). Here, T represents our trust in the confidence scores. By definition, a higher value of s pushes ρ closer to 1, and a higher value of T keeps ρ closer to the original detection score s. 3.4
Conditional spatial dirichlet multinomial regression: incorporating object locations in 3D
In this section, we enable our model to capture the spatial layout of objects in scenes. In real-world scenes, objects typically exhibit a consistent spatial layout. For example, microwaves usually appear on top of tables, whereas chairs appear adjacent to tables. Such patterns in scene layout have proven to be quite useful for scene recognition. For example, the Spatial Pyramid Model (SPM) significantly outperforms a BoW model by imposing a pyramid structure [38]. Most existing methods model 2D object locations by decomposing an image into grid cells and conducting a pooling process
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:6
World coordinate system Y Y′ Object frame coordinate system
L C
N
I
log Z′
G
X′ O
G V
log Z
X
2 1 0 −1 −2 0
curtain
0
wardrobe bed
−0.5 −1.0 −1.5 −2.0
window
0.5
1.0 1.5 2.0 log Z (log m)
Y (m)
Y (m)
Figure 5 Left: Object frame coordinate system and world coordinate system. C is the origin of the object frame coordinate system (cg in our model), and its location conforms to a Gaussian distribution. CL is the relative location of an object in an object frame (bgw in our model), which conforms to a Gaussian distribution. Right: Variables and dependencies related to the spatial model in the full model.
2.5
3.0
chair table
0
0.5
1.0 log Z (log m)
1.5
mountain
0 −20
0
sky
20
Y (m)
Y (m)
40
lake 0
1
2 3 log Z (log m)
4
building
−10
car
−20
road
−30 5
0
0.5
1.0 1.5 2.0 2.5 3.0 3.5 log Z (log m)
Figure 6 Learned object frames and example images labeled with each object frame (best viewed in color and zoomed in). Each corner corresponds to one object frame. The top row shows objects in each object frame placed in the world coordinate system. The center of each arrow (object) is the expected location of the object in the world coordinate system, computed as E[l] = E[τg ] + E[bw g ]. The bottom row illustrates the object frame located in an example image, where the bounding boxes are given by object detectors.
within each cell. In contrast, we model the 3D location of objects, by which we can generate a more informative visualization (see Figure 5). To parameterize the 3D location of objects, we assign to each object frame a translational vector that describes the location of this object frame in a fixed world coordinate system. The location of an object is its relative location in the object frame plus the displacement of the object frame (see Figure 6). The
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:7
technical details of this parameterization are as follows. Transforming 2D detection to 3D location. We assume that baseline object detectors already predict the locations of objects. Let w denote an object category, (x, y) be the coordinates of the center of the object bounding box, and W , H be the width and height of the box. Following the approach in [39], we can apply coordinate transformations to represent the object locations in 3D world coordinates: lx =
x y f Hw , ly = Hw , lz = Hw , H H H
(2)
where f is the distance from the observer to the image plane, which is set to 1 by default. Hw is the physical height of an object w, and is assumed to be constant. This height is manually specified for each type of object. Following [25], we ignore lx , as the x-coordinate is not informative. We assume that ly and lz are independent. We model ly and lz as Gaussian and log-Gaussian distributions, respectively. Let k index the candidate windows generated by the baseline detectors. We can use the transformation described in 2 to find the location variable lwk = (ly , log lz ). We further let Lwk = (Ly , log Lz ) denote the latent (actual) location of the object, which is unobservable. Then, lwk |L ∼ N (Lwk , Σlw ). Modeling object frame location. As mentioned above, we create a world coordinate system for each object frame. Let cg = (ly , log lz ) denote the origin of the coordinate system of object frame g. Then, c is parameterized as a Gaussian distribution: cg |τ ∼ N (τg , Σcg ). Relating the location of objects and object frames. We use cg as the anchor point to represent the location of objects in object frame g. Specifically, we assume that bgw = (ly , log lz ) is the prior location distribution of object w in object frame g, relative to the object frame center c. Then, Lwk |c, g, z = g, ψwk = 1 ∼ N (cg + bgw , ΣL gw ). 3.5
(3)
Model inference and application settings
As the proposed Bayes Net model incorporates non-conjugate priors, we must select an approximate inference algorithm. In our experiments, we use an adapted Variational Message Passing [40,41] algorithm. As variational inference for graphical models has been widely studied, there are several existing software packages that accomplish this task. Our inference uses one such package, Infer.Net [40], developed by Microsoft Research Cambridge. Essentially, Infer.Net includes a set of conditional probabilities as building blocks, so that Bayes Nets constructed by combining these building blocks can be inferred. In particular, for our Bayes Net, all conditional probability functions are recognized by Infer.Net, except those related to global image features. Specifically, if λ is given (and hence α is fixed), then Infer.Net can conduct an inference on all other variables. Note that if θ is given, then the optimal λ can be estimated by solving a logistic regression problem. Therefore, we maximize the posterior probability by applying an iterative optimization scheme, which alternates between the following two steps: Step 1: Given the λ estimation from Step 2 in the previous iteration, we use the Variational Message Passing in Infer.Net to perform the MAP inference on all other variables; Step 2: Given the θ estimation from Step 1, we solve a logistic regression problem to obtain λ. It is clear that this approach is essentially an alternating optimization scheme for maximizing the overall posterior probability. Hence, the posterior on all variables increases after every iteration, and the algorithm converges. The proposed Bayes Net model can address model learning and various scene recognition tasks (such as object recognition) by setting an appropriate level of observability for the random variables. In the following, we briefly describe how to set the observability of variables in different inference tasks. This will be used in Section 4. Model learning with object labels. In this setting, we assume that ψ is observable, and the inference task is to estimate φ, the logistic transfer function parameter λ, and parameters related to the spatial layout. Model learning with missing object labels. In this setting, some values of ψ are unobservable, and the parameters to be estimated are as above. Because our model is generative, we can naturally impute
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:8
missing object labels. More specifically, if detector scores can be acquired, we can set informative priors by providing s and choosing an appropriate value of T . In the extreme, it is possible to learn a model purely from object detector scores. Object recognition. We fix φ, λ, and spatial parameters such as τ and b. Our goal is to infer ρ. As before, if detection scores are provided, we can set informative priors by providing s. Scene retrieval. We use the distribution of object frames θ as the image representation. Thus, we fix φ, λ, and spatial parameters such as τ and b, and our goal is to infer θ.
4
Experiments
In this section, we present experimental results from our probabilistic model. We begin by describing the experimental setup, before reporting the results of a user study to evaluate the key features of our model. Finally, we show the effectiveness of our model in object detection and scene retrieval applications. 4.1
Experimental setup
In our experiments, we used a testbed of three datasets — SUN09, PASCAL VOC2007, and MIT Indoor. The SUN09 dataset [25] is particularly challenging, containing diverse objects and scenes. Over 100 object classes are annotated in this dataset, and about 8700 images have over 90% of their area covered by objects. A subset of images include scene labels. The PASCAL VOC2007 dataset is a standard benchmark for object detection tasks. There are over 9600 images for training, testing, and validation, and a total of 20 object classes are annotated. The MIT Indoor dataset is a benchmark for scene recognition tasks. It contains 67 scene classes and a total of 15620 images. For the SUN09 and MIT Indoor datasets, our object detectors were trained by a DPM model using external data1) . The object detectors for the PASCAL VOC2007 dataset were trained using PASCAL VOC2007 training data2) . For global features, we used GIST, HoG2x2, LBP, and Tiny Images. By default, we learned 16 object frames with β = 1. 4.2
Analysis of learned probabilistic model
Object frames are semantically meaningful. We evaluated the meaningfulness of learned object frames by performing a user study. For each object frame, we asked five participants how likely this group of objects was to be found together in the same scene. Possible scores ranged from 1 (never) to 9 (very likely). The score for each object frame was the average assigned by the five subjects. To simplify the user study, we reduced the number of objects by performing a hierarchical clustering of all objects. The merged objects are shown in Figure 7. To better understand their behavior, we evaluated object frames obtained under five different settings: (1) our full model with object label supervision; (2) our full model with object detector responses; (3) our model without global image features; (4) a correlation-based baseline model3) ; (5) random grouping of objects into object frames. The results are summarized in Figure 8. We can see that our model, with or without object label supervision, outperforms the baseline models. In addition, the object frame quality is not significantly affected when object label supervision is absent. This is because the errors from the object detectors are averaged across images. The mapping between scenes and object frames is meaningful. We conducted another user study to evaluate the mapping between natural scenes and learned object frames. In this study, we provided participants with a list of natural scenes, such as a beach, street, office (see Figure 9(a) for a complete list), and a list of object frames learned from our model. Each object frame is identified by its associated objects. For example, object frame 1 is characterized by. We represent the relation between object frames 1) The set of detectors are provided by [25], which includes 111 object detectors. 2) The set of detectors are provided by http://people.cs.uchicago.edu/ rbg/latent/, which includes 20 object detectors. 3) This baseline method simply computes the correlation coefficient between objects, and groups objects according to the correlation coefficient.
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:9
Figure 7 (Color online) A heat matrix of the learned object frames in φ. Each row corresponds to an object frame (16 in total), and each column corresponds to an object. Object names are displayed at the bottom. Higher values in a row indicate that the object is more likely to appear.
Figure 8 (Color online) Quality of semantic coherency of object frames from a human study research. The higher the bar, the higher the quality of the object frames.
and scenes using an indicator matrix whose rows represent object frames and whose columns represent scenes. Each subject was asked to mark the entries in this matrix as either 1 (meaning the corresponding object frame is likely to contained in the corresponding scene) or 0. We aggregated the results from the five participants in a heat matrix, where the value of each entry represents how many times an object frame was voted as being associated with a scene. Note that each column of the heat matrix provides the ground-truth distribution of the object frames of the corresponding scene. We then compared the learned OFD of each scene type with the ground-truth distribution. The learned OFD is simply the average over all images of the same scene type. As shown in Figure 9 (b), the learned OFD closely resembles that obtained from the user study. 4.3
Object recognition
We evaluated the object-detection performance of our probabilistic model on two benchmark datasets, PASCAL VOC2007 and SUN09. We compared our model with a baseline detection model [42] and the hContext model [43]. Figure 10 shows the accuracy of different models in predicting the top-N objects, where objects are sorted by probability of occurring (an evaluation protocol proposed by [25]). Our model yields a significant improvement over hContext and DPM. For example, on the PASCAL VOC2007 dataset, our model outperforms hContext by 10% and DPM by 18% when predicting the top-2 objects. We also compared the mean Average Precision (AP) of all models. With the SUN09 dataset, DPM and hContext achieved AP scores of 39.34% and 53.45%, respectively. Our model scored 58.07%, which is superior to both state-ofthe-art methods. On the PASCAL VOC2007 dataset, we also compared our results with those given by
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:10
Figure 9 (Color online) (a) Association between scenes and object frames under human supervision. Rows correspond to object frames and columns correspond to scenes. (b) Association between scenes and object frames from images learned by our algorithm. Rows and columns are as in (a). In both (a) and (b), scenes (columns) are reordered as leaves of a hierarchical clustering.
Figure 10 (Color online) (a) Presence prediction accuracy for PASCAL VOC2007; (b) presence prediction accuracy for SUN09; (c) presence prediction without running object detectors on SUN09.
Su H, et al.
Table 2
Sci China Inf Sci
March 2015 Vol. 58 032107:11
Mean AP of the top 10 candidates for scene retrieval
Method
SUN09
GIST (L1)
0.67
MIT Indoor 0.30
GIST (L2)
0.69
0.30
SPM (L2)
0.71
0.33
SPM (L1)
0.71
0.34
ObjectBank (L1)
0.73
0.37
OFD (L1) (zero detector setting)
0.74
0.38
OFD (L1)
0.76
0.39
Groups [29]. The AP scores of DPM, hContext, and Groups are 32.30%, 34.20%, and 34.60%, whereas the proposed method scored 35.10%. Again, our model yields the best performance. These experimental results justify the intuition that our probabilistic model, which integrates contextual object relations and global image features, can significantly improve the accuracy of object detection. We also tested the performance of our probabilistic model without the use of object detectors. This is a favorable setting, because it does not require any object detectors to be used for the test images. As shown in Figure 10, even without running any object detectors, our model still predicted the presence of objects with good accuracy. In particular, we only used several tens of object frames, which is much fewer than the number of object detectors (111 object classes). This experiment shows that the presence of a large set of objects can be approximated from a small set of object frames. 4.4
Scene retrieval
The above observations indicate that our OFD can be used to measure the scene similarity between images. In this section, we show that the OFD provides a powerful image descriptor for measuring similarities between images. We evaluate the performance of this image descriptor against the GIST, SPM, and ObjectBank standard image descriptors using the SUN09 dataset4) and the MIT Indoor dataset. The MIT Indoor dataset is particularly challenging, as it contains many fine-grained scene classes. To build the OFD descriptor, we split a given dataset into a training set and a testing set5) . We then learned 20 object frames from the training set using our model. For each test image, we built an image descriptor using the zero object detector experiment described in Subsection 4.3. We also tried running object detectors on every test image. Experimental results show that the zero object detector setting is much more efficient, and significantly affects image retrieval performance (see Table 2). To conduct an evaluation, we randomly selected 30% of the test images as queries. For each query image and each image descriptor, we collected the top 10 candidates based on the L1 or L2 descriptor distance. Table 2 provides the mean AP of the returned images. It is clear that the OFD descriptor outperforms GIST and SPM, which are based on global image features. The performance of OFD and ObjectBank is similar; however, OFD is much more efficient, as it does not run object detectors on the test images. In addition, the dimension of the OFD (20 in our case) is usually much smaller than other image descriptors, e.g., GIST (128) and ObjectBank (1000–10000). Thus, OFD is a powerful and compact image descriptor for measuring image similarities.
5
Conclusion
In this paper, we proposed a probabilistic model for scene recognition. We assumed that each image is a mixture of object frames, and that each object frame is a group of objects with consistent mutual spatial relations. A Bayes Net was designed to automatically discover these object frames. Experimental results show that the learned object frames are both semantically meaningful and spatially consistent. 4) Only a subset of images in SUN09 have scene annotations, so we conducted the experiment on this subset. 5) Specifically, for the SUN09 dataset, 30% of images were used as training data, and the rest were used as testing data; the MIT Indoor dataset was split as described in [44].
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:12
In addition, our model significantly improves object recognition and scene retrieval performance over state-of-the-art methods.
References 1 Desai C, Ramanan D, Fowlkes C C. Discriminative models for multi-class object layout. Int J Comput Vis, 2011, 95: 1–12 2 Divvala S K, Hoiem D, Hays J H, et al. An empirical study of context in object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 1271–1278 3 Galleguillos C, McFee B, Belongie S, et al. Multi-class object localization by combining local contextual interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 113–120 4 Marszalek M, Schmid C. Semantic hierarchies for visual object recognition. In: Proceedings of the IEEE Computer Vision and Pattern Recognition, Minneapolis, 2007. 1–7 5 Rabinovich A, Vedaldi A, Galleguillos C, et al. Objects in context. In: Proceedings of the IEEE 11th International Conference on Computer Vision, Rio de Janeiro, 2007. 1–8 6 Sivic J, Russell B C, Zisserman A, et al. Unsupervised discovery of visual object class hierarchies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008. 1–8 7 Blei D M, Jordan M I. Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2003. 127–134 8 Oliva A, Torralba A. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis, 2001, 42: 145–175 9 Mimno D, McCallum A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, 2008. 411–418 10 Lowe D G. Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE International Conference on Computer Vision, Kerkyra, 1999. 1150–1157 11 Perona P, Malik J. Scale-space and edge detection using anisotropic diffusion. IEEE Trans Pattern Anal Mach Intell, 1990, 12: 629–639 12 Leung T, Malik J. Representing and recognizing the visual appearance of materials using three-dimensional textons. Int J Comput Vis, 2001, 43: 29–44 13 Oliva A, Torralba A. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis, 2001, 42: 145–175 14 Farhadi A, Endres I, Hoiem D, et al. Describing objects by their attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 1778–1785 15 Lampert C H, Nickisch H, Harmeling S. Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 951–958 16 Ferrari V, Zisserman A. Learning visual attributes. In: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, Vancouver, 2007. 433–440 17 Kumar N, Berg A C, Belhumeur P N, et al. Attribute and simile classifiers for face verification. In: Proceedings of the 12th IEEE International Conference on Computer Vision, Kyoto, 2009. 365–372 18 Torresani L, Szummer M, Fitzgibbon A. Efficient object category recognition using classemes. In: Proceedings of 11th European Conference on Computer Vision, Heraklion, 2010. 776–789 19 Xing E P, Li L -J, Su H, et al. Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Proceedings of the 24th Annual Conference on Neural Information Processing Systems, Vancouver, 2010. 1378–1386 20 Bosch A, Zisserman A, Munoz X. Scene classification via pLSA. In: Proceedings of 9th European Conference on Computer Vision, Graz, 2006. 517–530 21 Fei-Fei L, Perona P. A Bayesian hierarchy model for learning natural scene categories. In: Proceedings of the IEEE Computer Vision and Pattern Recognition. Piscataway: IEEE, 2005. 524–531 22 Sudderth E, Torralba A, Freeman W T, et al. Learning hierarchical models of scenes, objects, and parts. In: Proceedings of the IEEE International Conference on Computer Vision, Beijing, 2005. 1331–1338 23 Li L -J, Socher R, Fei-Fei L. Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 2036–2043 24 Zhu J, Li L -J, Fei-Fei L, et al. Large margin learning of upstream scene understanding models. In: Proceedings of 24th Annual Conference on Neural Information Processing Systems, Vancouver, 2010. 2586–2594 25 Choi M J, Lim J J, Torralba A, et al. Exploiting hierarchical context on a large database of object categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 129–136 26 Carbonetto P, Freitas N, de Barnard K. A statistical model for general contextual object recognition. In: Proceedings of 8th European Conference on Computer Vision, Prague, 2004. 350–362 27 Boben M, Fidler S, Leonardis A. Evaluating multiclass learning strategies in a hierarchical framework for object detection. In: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, Vancouver, 2009. 531–539 28 Sadeghi M A, Farhadi A. Recognition using visual phrases. In: Proceedings of the IEEE Conference on Computer
Su H, et al.
Sci China Inf Sci
March 2015 Vol. 58 032107:13
Vision and Pattern Recognition, Providence, 2011. 1745–1752 29 Li C, Parikh D, Chen T. Automatic discovery of groups of objects for scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 2735–2742 30 Choi W, Chao Y W, Pantofaru C, et al. Understanding indoor scenes using 3D geometric phrases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013. 33–40 31 Zhao Y B, Zhu S -C. Scene parsing by integrating function, geometry and appearance models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013. 3119–3126 32 Yao B. I2T: image parsing to text description. Proc IEEE, 2010, 98: 1485–1508 33 Lin L, Wu T, Porway J, et al. A stochastic graph grammar for compositional object representation and recognition. Pattern Recogn, 2009, 42: 1297–1307 34 Liu X. Integrating spatio-temporal context with multiview representation for object recognition in visual surveillance. IEEE Trans Circuit Syst Video Techn, 2011, 21: 393–407 35 Ding C, Li T, Peng W. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput Stat Data Anal, 2008, 52: 3913–3927 36 Low Y, Agarwal D, Smola A J. Multiple domain user personalization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2011. 123–131 37 Mei Q, Zhai C X. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2005. 198–207 38 Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2006. 2169–2178 39 Hoiem D, Efros A A, Hebert M. Putting objects in perspective. Int J Comput Vis, 2008, 80: 3–15 40 Minka T, Winn J, Guiver J, et al. Infer.net, version 2.1.30904, 2008 41 Winn J, Bishop C M. Variational message passing. J Mach Learning Res, 2005, 6: 661–694 42 Felzenszwalb P F, Girshick R B, McAllester D, et al. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell, 2009, 32: 1627–1645 43 Choi M J, Lim J J, Torralba A, et al. Exploiting hierarchical context on a large database of object categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 129–136 44 Quattoni A, Torralba A. Recognizing indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 413–420