Multimed Tools Appl DOI 10.1007/s11042-016-4167-7
Land-use classification with biologically inspired color descriptor and sparse coding spatial pyramid matching Tian Tian1 · Yun Zhang2 · Hao Dou3 · Hengjian Tong1
Received: 22 May 2016 / Revised: 20 September 2016 / Accepted: 15 November 2016 © Springer Science+Business Media New York 2016
Abstract Land-use classification using remote sensing images plays a key role in many applications such as urban mapping and geospatial object detection. With the rapid development of satellite sensors, high-resolution images which exhibit more detailed textures now can be acquired. How to effectively represent these images and recognize the categories of land-use/land-cover scenes has become a challenging task. In this paper, we propose a novel biologically inspired descriptor combined with the sparse coding spatial pyramid matching (ScSPM) for land-use classification. A color processing pipeline is first presented to simulate the opponent responses of human visual system. By extending the scale invariant feature transform (SIFT) on processed color channels, a descriptor that is able to jointly extract color and shape information for land-use images is proposed. Then the ScSPM model is employed to incorporate the local descriptors of an image, followed by a linear kernel support vector machine (SVM) for image classification. Performance evaluation on the publicly available LULC data set demonstrates that the proposed method achieves better classification accuracy than other reference methods. Keywords Land-use classification · Biologically inspired descriptor · Scale invariant feature transform (SIFT) · Sparse coding spatial pyramid matching (ScSPM)
Hengjian Tong
[email protected] 1
Hubei Key Laboratory of Intelligent Geo-Information Processing, College of Computer Science, China University of Geosciences, No.388 Lumo Road, Wuhan, 430074, China
2
Beijing Electro-Mechanical Engineering Institute, Beijing, 100074, China
3
School of Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
Multimed Tools Appl
1 Introduction With the development of satellite imaging sensors, remote sensing images of high spatial resolution and great volume are continuously produced [29]. These images are usually called very high resolution (VHR) images, which exhibit finer textures that carry more abundant spatial and structural information [47]. However, accompanied by the increase of spatial resolution, phenomena of lower intraclass similarity and higher interclass disparity occur, which pose a big challenge for image representation and classification [40, 49]. Due to the complexity and diversity of land-cover/land-use types, land-use classification has drawn great interests in remote sensing fields [25]. It plays an important role for a wide range of applications, such as geospatial object detection, urban planning, vegetation mapping, etc [8, 39]. Although many efforts have been made to develop image description and classification methods, interpretation of remote sensing images still remains one of the most challenging problems in this field. Many traditional local features in computer vision have been applied to land-use remote sensing image classification [7, 30, 38]. Local features such as SIFT (scale invariant feature transform) [26] and HOG (histogram of gradient) [11] extract the bottom-level information and form the basic representation of an image. These features are designed with robustness to rotation, scaling, occlusion, illumination changes and other advantages, and efforts on developing better local description methods have never been stopped during the past decade [3, 4, 19, 48]. In order to incorporate the local features for scene-level classification, many models are proposed over the years. The most typical ones are the bag of visual words (BOW) model [35] and its variations [20, 41, 45]. The BOW model originating from text analysis has been successfully applied in different image and scene classifications. The basic approach of BOW treats each image as a bag of visual words (a collection of unordered local features), so by extracting local features, mapping features to visual words and computing the histogram of all mapping codes, the representation of one image is achieved. BOW is simple and robust to spatial variations, but its some drawbacks remain to be solved. To this end, spatial pyramid matching (SPM) [20], sparse coding SPM (ScSPM) [41] and spatial pyramid cooccurrence kernel (SPCK) [45] are proposed. SPM introduces absolute spatial context to deal with location information of local features, while SPCK employs the relative spatial relations to incorporate local features. ScSPM improves SPM on vector quantization step and histogram pooling method, which is more advanced in image representation. A publicly available land-use data set is provided by Yang et al. [44], and in their work, comprehensive investigation concerning BOW and its spatial extensions are carried out. SIFT, global color descriptors and homogeneous texture descriptors combined with BOW, SPM and SMCK are tested and analyzed for land-use classification. Based on their data set and work, there are many other researchers that have developed methods for this application. Chen et al. propose a pyramid-of-spatial-relatons (PSR) model to incorporate spatial information into BOW framework [7]. Cheng et al. use partlets to represent a library of pretrained part detectors for mid-level element discovery [8]. Zhong et al. employ the probabilistic topic model for semantic allocation level multi-feature fusion [49]. Li et al. simulate the visual cortex and propose a two-layer framework to learn and extract land-use image features [23]. All of these previous works have achieved success on the purpose of enhancing classification accuracy, nevertheless, they mainly focus on the processing of scene level representation. Although this part plays a significant role for improving classification result, the most fundamental steps which involve local feature extraction and local feature pooling should not be neglected. But for land-use classification, the description capability of local
Multimed Tools Appl
features is still limited. Inspired by the ability of human visual system, in this paper we propose a biologically inspired feature to jointly extract color and shape information. Moreover, we employ the sparse coding SPM instead of the classical BOW model for mid-level feature pooling which is validated to show better classification performance. More precisely, we first simulate the responses of single and double opponent cells and form a color processing pipeline of land-use images, then the SIFT feature is computed based on the biologically processed data. Subsequently, the sparse coding spatial pyramid matching is employed to incorporate the color extended SIFT, and the final feature vector that represents each image is put into a support vector machine (SVM) for training and testing. Our contribution in this paper includes the following aspects. Firstly we propose a biologically inspired local description approach, compared to the previous biological color processing methods, our simulation of opponent cells is more reasonable and the local feature has better capacity of land-use image description. Secondly, we introduce the sparse coding SPM into land-use classification, which has lower quantization error and stronger representation ability compared to the conventional BOW model. And finally, we present the land-use classification based on the biologically extended feature and sparse coding SPM. The proposed method improves the state-of-the-art methods on the step of local feature description and the final part of image representation framework, which is helpful to produce better classification results. The rest of this paper is organized as follows: Section 2 introduces the related work on local feature extraction and feature incorporating models. Section 3 describes the details of the proposed method for land-use classification. In Section 4, experimental results on publicly available data set are presented. Finally, conclusions of this paper are drawn in Section 5.
2 Related work For land-use classification, the most closely related work includes local feature extraction algorithms and spatial context incorporating models. SIFT [26] is regarded as a benchmark in the field of feature extraction, and has been successfully implemented in many computer vision applications. During the past decades, numbers of developments have been proposed on the basis of SIFT, aiming to gain a faster construction speed or a better description effect. For the former purpose, outstanding representatives include SURF [3], BRISK [22] and FREAK [1], and for the latter one, there are many approaches to apply SIFT on processed color channels such as HSV-SIFT [4] and OpponentSIFT [36]. As for the spatial context incorporating models, the BOW model [10, 12] has become the most widely used one in object and scene classification since it has been transplanted from the field of natural language processing. The simple BOW does work in image classification applications, however its neglect on spatial information and relation has left room for improvement. Therefore, several variants [20, 41, 45] of BOW have been presented during the past years to utilize the spatial context of local features.
2.1 The SIFT feature and SIFT-based color descriptors SIFT which is proposed by Lowe detects keypoints using Gaussian scale pyramids and describes a local image patch by constructing gradient orientation histogram. It has outstanding invariance to rotation and scaling, and has been employed in image registration, image retrieval, object recognition and tracking [27, 28]. In image and scene classification
Multimed Tools Appl
applications, dense keypoint sampling is found to perform much better than any keypoint detectors [4]. As a result, in classification and recognition problems, we usually use dense SIFT instead of the original SIFT feature, i.e., description step of SIFT is reserved whereas the keypoint detection is replaced by a division of overlapped grids. For each point on the grid, a histogram of local gradient directions over a local neighborhood is computed as the descriptor. With 8 quantized orientation bins and 4×4 grid neighbor points, a 128 dimension of dense SIFT feature is obtained. Though the employment of SIFT has shown performance effect in object and scene classification [14, 20], feature extraction is still a challenging problem in these applications. Some improvements are made as variants of SIFT such as PCA-SIFT [19] and GLOH [31], nevertheless they are all constructed on grayscale images. Researches on primate visual system suggest that color information may play an important role in scene parsing and figure-ground segmentation [18], hence the consideration of color attracts increasing attentions. For SIFT-based color descriptors, color are mainly processed in two ways. The first one splits images into multiple color channels and extracts SIFT on each channel, HSVSIFT [4], OpponentSIFT [36] and CSIFT [5] are all of this kind. The other approach such as HueSIFT [37] concatenates SIFT computed from grayscale images with histograms constructed in other color space. The difference of these two methods is whether color and shape information are processed separately. Based on the work of neurophysiology, two functional classes of color-sensitive neurons have been discovered and described as Single-Opponent(SO) and Double-Opponent(DO) neurons [34]. They are involved in color perception and exhibit different characteristics respectively: SO cells show strong selectivity for color opponency (e.g. red vs. green) and weak tuning for spatial opponency (i.e., orientation) while DO cells tend to be responsive to both color and spatial opponency [48]. SO cells mainly process the surface information of images, and DO cells is concerned with boundary extraction and function in color constancy [9]. On the basis of this knowledge, a hierarchical model of color processing that mimics the retina to visual cortex of human beings is proposed [15, 42]. The feedforward model consists of three layers: cone layer, Ganglion/LGN layer and cortex layer. The cone layer corresponds to the trichromatic receptor theory, where an image is split into three individual channels, then SO cells function to obtain multiple opponent channels in the next layer and DO cells in the last layer produce response which is sensitive to both color and orientation. As SODOSIFT [48] has validated, the hierarchical color processing pipeline outperforms algorithms that simply concatenate color and spatial information separately, which has become a potential method to deal with color in advance of spatial information description. The opponent response simulation model has been successfully applied in color constancy analysis and boundary detection [16, 43], which is capable to function in more applications concerning image representation.
2.2 The BOW model and its extensions The BOW model stems from text analysis and document classification, wherein a document is represented by the occurrence frequencies of words. Order of different words are ignored and the histogram of word frequencies is used for text classification. Inspired by its success in language processing, BOW is applied to image representation for object and scene recognition [35]. Local features which are extracted from the training image set are grouped into a number of clusters, centroids of these clusters, i.e., the visual words, form a visual dictionary or codebook. By quantizing all the local features of an image with the dictionary,
Multimed Tools Appl
codes of visual word occurrence are then obtained. Finally, this image can be represented with a histogram of visual words by properly pooling the aforementioned codes. The pipeline of image classification with BOW model can be concluded as the following three steps: visual dictionary building, local feature encoding, code pooling and histogram construction, and classifier training and testing. Building dictionary is a clustering process essentially, in which k-means algorithm is usually adopted to learn the visual words. In the feature encoding step, the standard BOW model employs vector quantization, which is a hard assignment that searches the nearest visual word for a feature vector [7]. Due to the hard assignment of standard BOW, counting the frequencies of visual codes is straightforward to get the histogram. While if the encoding is a soft quantization, codes of local features have to be pooled with some strategy to get a histogram vector. Popular pooling strategies include average pooling and maximum pooling [24], and the latter one usually achieves higher accuracy in image classification [41]. One major limitation of BOW is that it doesn’t consider the spatial relations of the local features when it constructs histograms from the visual codes, although the locations of local features play important roles in image classification. On account of this, SPM [20] is proposed as an improvement of BOW. SPM partitions an image into hierarchically coarse to fine grids and computes weighted histogram at each level of resolution. Larger weights are assigned for finer regions and the histograms of different levels are concatenated to form the final vector. The number of matches at each level l is computed by: I (HXl , HYl ) =
M D
min HXl (i, m), HYl (i, m)
(1)
i=1 m=1
and the pyramid matching kernel for two images is as follows: K(X, Y ) = I L +
L−1 l=0
1 2L−l
(I l − I l+1 )
(2)
where HXl and HYl denote the histograms of image X and Y at level l, HXl (i, m) and HYl (i, m) are the counts of visual word m that fall into the ith cell of the grid, D = 4l is the number of cells, and I l is the abbreviation of I (HXl , HYl ). SPM is conventionally computed at two or three levels, and the one-level SPM with no division of subregions actually degrades to the BOW model. SPM has shown better performance than BOW on several challenging data sets [13, 17] with slight increase of computational consumption, thus it has become as popular as BOW on image mid-level feature aggregation. SPM improves BOW on spatial pooling with SPM kernel, in fact there are other kernels can be used in the code pooling step. Motivated by gray level co-occurrence matrices (GLCM) for texture description, a spatial pyramid co-occurrence kernel is proposed to deal with the spatial dependence of visual codes [44]. SPCK is considered to incorporate relative spatial information, and other methods of similar ideas involve visual word correlograms proposed by Savarese et al. [33]. Both BOW and SPM adopt hard vector quantization, as a result, large quantization error easily occurs when mapping a local descriptor to one visual word. To remove the one-toone mapping constraint, Yang et al. employ sparse coding in the feature encoding step and present the ScSPM algorithm [41]. The sparse constraint is beneficial for salient pattern obtainment and good performance reconstruction, moreover it is supported by researches on
Multimed Tools Appl
image statistics and biological visual systems [41, 46]. Sparse feature is superior on linear separability, as a consequence, linear SVM can be employed for fast training and testing. Compared to SPM, ScSPM adopts max pooling instead of the average pooling, which has better robustness to local spatial variations. These advantages indicate that ScSPM may be a better choice to take the place of BOW in many computer vision applications.
3 Method In this section, we present the proposed land-use classification method based on biologically inspired color SIFT and sparse coding spatial pyramid matching. We start by laying out the common structure of simulating biological processing system, and then describe the formulation of functional cells. Subsequently the pipeline of constructing local descriptor is introduced, and finally the approaches of image representation with ScSPM and classification with SVM are provided.
3.1 Biologically inspired processing of color information In terms of proofs provided by physiological studies, a biological color information processing structure can be modeled by a three-layer feedforward structure. The aforementioned structure is shown in Fig. 1, where the opponent response simulation is given using a sole example of R-G channel. The cone layer corresponds to an operation of color channel split, and the rest two layers simulate SO and DO cells respectively. On account of the simplicity of cone layer processing, here we only refer to the processing pipeline for the computation of SO and DO descriptors. Classical opponent theories of color vision emphasize two main chromatic axes: Red-Green (R-G) and Yellow-Blue (YB) [21]. Actually, other opponent channels can be further involved. We consider two more channels here: Red-Cyan (R-C) and White-Black (Wh-Bl), where C is obtained by combing G and B, and Wh and Bl are luminance-based channels which have relationship with all of RGB channels. Though whether to model the R-C channel as an independent one remains controversial, there are reports that the addition of these two opponent channels consistently increases the description performance [48].
3.2 Response simulation of Single-Opponent(SO) neurons Receptive field (RF) of SO cells can be seen as the combination of RFs of two opposite cone cells [42]. Thus responses of cone cells on retina are simulated first. Images are split into RGB channels, and the Gaussian function is employed to form the RFs of cone cells.
Single-Opponent g
r
Double-Opponent
SO(r-g)
b
DO(rg)
Input image
SO(g-r) Cone layer
Ganglion/LGN layer
Fig. 1 Biologically inspired hierarchical structure of color processing
Cortex layer
Multimed Tools Appl
Gaussian function of variant scales are convolved with each RBG channel, and the process is formulated as: x2 + y2 1 G(x, y; σ ) = (3) exp − 2π σ 2 2σ 2 y; σR ) = R(λ) ∗ G(x, y; σR ) R(x, y; σG ) = G(λ) ∗ G(x, y; σG ) G(x, y; σB ) = B(λ) ∗ G(x, y; σB ) B(x,
(4) (5) (6) where G(x, y; σ ) represents a Gaussian function of scale σ which determines the area of y; σR ) and other two indicate the output of cone-like responses. receptive field, and R(x, The four opponent channels R-G, Y-B, R-C and Wh-Bl are included in the simulation of SO cells, therein Y = (R + G)/2, C is obtained by G and B, and Wh and Bl are computed from R, G and B. For each opponent channel, characteristics of SO cell can be simulated with the combination of responses of corresponding cone cells. It is written in the following form: √ √ ⎡ ⎤ ⎡ ⎤ ⎤ ⎡ ±1/√2 ∓1/√2 0 √ SO(R±G) ⎢ SO(R±C) ⎥ ⎢ ±2/ 6 ∓1/ 6 ∓1/ 6 ⎥ R(x, y; σR ) ⎢ ⎥ ⎢ ⎦ √ √ ⎥⎣ √ (7) ⎣ SO(B±Y ) ⎦ = ⎣ ±1/ 6 ±1/ 6 ∓2/ 6 ⎦ G(x, y; σG ) √ √ √ B(x, y; σB ) SO(W h±Bl) ±1/ 3 ±1/ 3 ±1/ 3 where “+” (resp. “-”) sign indicates an excitation (resp. inhibition) of the corresponding response. Moreover, a half-squaring operation is imposed on the spatio-chromatic opponent response due to the positive firing rates of neurons. Therefore, only positive response is retained finally.
3.3 Response simulation of Double-Opponent(DO) neurons Receptive field of DO neurons can be regarded as the superposition of two oriented cells with reverse phases, as shown in Fig. 2. This way of simulation is easily implemented, yet quite reasonable. As for a single oriented cell, the Gabor filter with a sinusoidal wave is employed to imitate its RF structure. This Gabor function has the following formulation:
x02 + γ 2 y02 2π + ϕ (8) x × sin Gab(x, y; σ, ϕ) = exp − 0 λ 2σ 2 x0 = xcosθ + ysinθ (9) y0 = −xsinθ + ycosθ
Fig. 2 Simulation of DO cell’s RF based on two oriented cells with reverse phases
Multimed Tools Appl
where σ determines the area of receptive field, ϕ indicates the phase of it, γ defines the aspect ratio and θ indicates the orientation of the filter. With respect to the response of DO cells of a certain orientation, we can obtain it by steps as follows. Firstly, Gabor filters of this orientation with 0◦ and 180◦ phases are convolved with the input SO response respectively. And then a half-squaring rectification is applied on the outputs of convolution. Finally the responses of these two filters are summed up to imitate the DO cell’s response. Take the R-G opponent channel as an example, the above process can be described as: ∗ Gab(x, y; σ, ϕ0 ) F(R−G) = SO(R−G)
(10)
∗ Gab(x, y; σ, ϕ180 ) F(G−R) = SO(G−R)
(11)
+ F(G−R) DO(RG) = F(R−G)
(12)
and SO(G−R) represent the responses of SO channels after half-squaring, where SO(R−G) F(R−G) and F(G−R) are rectified convolutional responses of Gabor filters with different phases, DO(RG) indicates the final DO-like response of R-G opponent channel. Similarly,
the same operation can be implemented on the other opponent channels, which produces four DO responses for the input image.
3.4 Color extended SIFT Based on the above simulation model of human color perception, we propose a color extended descriptor which fuses color and shape information together. Since the widelyused SIFT feature has shown good performance on local structure description, we carry out the fusion on the basis of SIFT description method. Standard SIFT is constructed by calculating gradient orientation histogram of grayscale images, whereas we employ the responses of DO channels to build orientation histograms here. First of all, the input image is processed according to Sections 3.1 – 3.3, thereby horizontal and vertical DO responses of four opponent channels are obtained. For each channel, amplitudes and angles are computed according to the horizontal and vertical responses. Then the orientation histogram weighted by amplitudes is computed over four channels. Finally, histograms of each channel are concatenated, accordingly the color extended SIFT descriptor is eventually obtained.
3.5 Classification with ScSPM and SVM As mentioned in Section 2.2, ScSPM is one of the extensions that improves BOW and SPM in many computer vision applications. On account of its good performance on image classification and object recognition, we adopt ScSPM to incorporate the local color extended descriptors. An image is densely sampled into overlapped grids, and color extended SIFT is obtained on each patch. Learning of visual word dictionary of ScSPM is similar to BOW and SPM model, but vector quantization (VQ) of feature encoding step is different. Hard-VQ is replaced by sparse coding mapping, which has the following objective function: min
M
U,V m=1
xm − um V 2 + λ|um |
subj ect to
vk ≤ 1,
∀k = 1, 2, . . . , K
(13)
Multimed Tools Appl
where M is the number of local descriptors, um indicates the projective coefficient of the m-th descriptor on Dictionary V , vk is a visual word with a total number of K, and typically a unit L2-norm constraint on vk is applied. ScSPM uses L1-norm as the sparse regularization for projective coefficients, which is essentially well-known as Lasso in the statistical field. This optimization problem has many conventional solutions that iteratively and alternatingly optimize over each coefficient [41]. After the coding step, max pooling on the absolute sparse codes is applied instead of the average pooling adopted by SPM. It is reported that max pooling has stronger biophysical support and empirically outperforms other pooling strategies in image categorization [41]. The max pooling function is defined as: z=
max {|um |}
(14)
m=1,...,M
where z is the obtained histogram, and um is the m-th sparse code of a total M features. Compared to (2), a linear kernel of two images X and Y is depicted as: l
K(X, Y ) =
zi zj
=
l
2 2 2
zil (s, t), zjl (s, t)
(15)
l=0 s=1 t=1
where zi , zj = zi zj , and zil (s, t) is the mean square statistic of (s, t)-th descriptor in scale level l. Consequently, a linear kernel SVM, which only costs linear training and testing computation, can be employed for classification: f (z) =
n i=1
αK(z, zi ) + b =
n
α i zi
z+b
(16)
i=1
4 Experiments 4.1 Data set We employ the LULC data set [44] in our experiments, as it is the most widely used database that is publicly available. This data set includes 21 land-use/land-cover classes of remotely sensed images which range from natural lands to urban constructions. Each category contains 100 images measuring 256 × 256 pixels, with a pixel resolution of 0.3m in R-G-B color space (visible bands). Sample images of LULC data set are shown in Fig. 3.
4.2 Experimental setup The proposed method and other approaches are compared by performing multi-class classification with SVM. A widely used implementation of SVM known as LIBSVM [6] is employed here. Similar to Yang et al. have done [44], five-fold cross-validation is performed. The data set is randomly split into five equal subsets, four of which are used for classifier training with the remaining one for evaluation. Experiments are repeated five times, and the average classification rates are recorded as the final results. Except for the proposed biologically inspired descriptor, we use the standard SIFT as a major performance reference. Moreover, one greedy color SIFT feature that easily integrates gray-level SIFT from each color channel is implemented as a reference method of
Multimed Tools Appl
Fig. 3 Sample images of 21 categories in the LULC data set. From left to right and top to bottom, the classes are: (1) agricultural, (2) airplane, (3) baseball diamond, (4) beach, (5)buildings, (6) chaparral, (7) dense residential, (8) forest, (9) freeway, (10) golf course, (11) harbor, (12) intersection, (13) medium residential, (14) mobile home park, (15) overpass, (16) parking lot, (17) river, (18) runway, (19) sparse residential, (20) storage tanks, (21) tennis court
color utilization. Additionally, two color-related descriptors mentioned by literature [44] are involved for comparison. The H RGB and H HLS are color histograms that are computed in RGB or HLS color space respectively. Each dimension is quantized into 8 bins and a 512 dimension histogram is obtained as the descriptor. The histogram intersection kernel (HIK) is applied to these histogram features in the SVM classification. GIST feature [32] which is proposed for scene classification is supplemented as well. Since original GIST is designed for grayscale images, we concatenate descriptors obtained from different spectral bands [2]. The conventional BOW and SPM models are tested together with ScSPM model. For BOW and SPM, Gaussian radial basis function (RBF) kernel is adopted for SVM with HIK as a supplementary kernel. Due to the limitation of memory, we randomly extract 100 thousand patches from all the images (roughly equal quantities from each class) for dictionary training. Parameters such as dictionary size and number of pyramid levels are turned to produce the best result.
4.3 Results and comparisons Up to twelve methods of combination is tested and the classification results are given by Fig. 4. The proposed biological color extended descriptor is denoted by BCSIFT, and SIFT RGB stands for the greedy SIFT integrating SIFT from each channel. Each method is marked as a combination of feature, pooling model and SVM kernel, wherein LK stands for the linear kernel SVM. ScSPM only employs the linear kernel SVM on account of efficiency and its designing intention. Two color histogram features and GIST feature are global features themselves, so no pooling model is necessary for them. As shown in Fig. 4, BCSIFT with ScSPM and linear SVM achieves the highest accuracy. In the condition of employing a same pooling model and SVM kernel, our proposed biological descriptor outperforms the standard SIFT and greedy SIFT as well as the other two types of color histograms and GIST feature. In consideration of local feature incorporating models only, ScSPM exhibits obvious superiority compared to BOW and SPM. In addition, intersection kernel shows better performances in our experiments for BOW and
Multimed Tools Appl
Fig. 4 Classification results of the proposed method and state-of-the-art approaches on LULC data set. Each method is marked as “feature + pooling model + SVM kernel”. Blue bars are SIFT based results, red ones represent results with the biological color extended SIFT, and green ones represent the rates with two global color histograms and the global GIST feature (no pooling models)
SPM as expected, since HIK is inherently appropriate for histogram-based features (SIFT is essentially an orientation histogram). Effect of the most important two parameters is shown in Fig. 5. As it is time-consuming to train a dictionary, parameters are not exhaustively tested. The range of dictionary size refers to the experiments of BOW and SPM in literature [44], where the size of visual words ranging from 500 to 1000 results in different changes for BOW and SPM on classification accuracy. Moreover, it takes more time to generate a dictionary as its size increases. So dictionary sizes within this range are tested in terms of the similarity between ScSPM and the aforementioned two models. It can be inferred from Fig. 5 that dictionary size of ScSPM on land-use classification has a similar effect of BOW. Initially the classification rate rises rapidly as the size increases, and this trend goes slow during the mid section. Finally a larger gamma = 0.05
90%
Dsize = 1000
50%
40%
Recog Rate
Recog Rate
85% 30%
20%
80% 10%
75% 500
600
700 800 Dictionary Size
900
1000
0 0.05
0.1
0.15 0.2 gamma value
0.25
0.3
Fig. 5 Evaluation on classification accuracy for parameters: dictionary size and gamma value indicating sparsity of ScSPM
Multimed Tools Appl
Fig. 6 Performance comparison of proposed method with standard SIFT and BOW model on each category
number of visual words are no longer beneficial for improving performances, moreover it’s not worthwhile to use a large size with the consideration of computation consumption. Another parameter that has great impact on results is the sparsity parameter γ of ScSPM. The bigger γ is, more sparse the codes will be. Empirical values suggested in literature [41] are not fit for land-use classification, so several smaller values are tested here (Fig. 5). We find that a sparsity around 20 % with γ set to be less than 0.075 yields good results for our proposed method. The performance decreases severely as γ increases, and remains uninfluenced when γ is small enough. It validates that an excessive sparsity requirement on feature quantization will result in the insufficiency of image representation. Other parameters have also been considered and discussed as follows. As there are too many coefficients in biological models, we adopt a group of reasonable values according to common biological simulation functions and applications. Gaussian function employed for cone’s RF imitation ((3)) uses a 7 × 7 window and a standard deviation of 1. Gabor function ((8)) is built with an aspect ratio of 0.3, Gaussian window of 9 and σ of 4. The RC opponent channel is once removed in feature description and the performance decreases with 1.7 %, therefore four opponent channels are necessary for better results. The pyramid level of 3 is proved to yield the best result, which only shows a slight superiority compared to the suboptimum result. Figure 6 shows the performance evaluation over each category of the LULC database. SIFT with BOW, SIFT with ScSPM and BCSIFT with ScSPM are involved in the figure. The proposed combination has better accuracy than SIFT with BOW in 19 out of 21 classes, and the biologically inspired feature is superior to SIFT in 15 out of 21 (one draw).
5 Conclusions In this paper, we propose a biologically inspired descriptor to utilize color information. Combined with the sparse coding SPM model to incorporate local features, classification of remotely sensed land-use/land-cover images is implemented. By extending SIFT on the novel color processing pipeline imitating human visual system, the proposed local feature can jointly extract color and shape information of remote sensing images. Moreover, the employment of ScSPM contributes to the classification performance compared to the commonly used BOW and SPM models. Experimental results show that our proposed method outperforms several conventional approaches on LULC data set, which has made contributions to the steps of local feature extraction and mid-level feature pooling in the field of land-use classification. As sparsity is an inherent characteristic in biological responses, how
Multimed Tools Appl
to imitate sparse responses of neurons and extend sparse coding to feature building is worthy to be studied for more concise and precise representation. Other future work includes evaluation on biological coefficient selection and reduction of computational complexity. Acknowledgments This work is supported by the Natural Foundation of Hubei Province under Grant 2016CFB278, China Postdoctoral Science Foundation under Grant 2016M602390, Open Research Project of the Hubei Key Laboratory of Intelligent Geo-Information Processing under Grant KLIGIP1608, and the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan).
References 1. Alahi A, Ortiz R, Vandergheynst P (2012) FREAK: Fast retina keypoint. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp 510–517 2. Avramovi´c A, Risojevi´c V (2016) Block-based semantic classification of high-resolution multispectral aerial images. SIViP 10(1):75–84 3. Bay H, Tuytelaars T, Van Gool L (2006) SURF: Speeded up robust features. In: Computer Vision–ECCV 2006. Springer, pp 404–417 4. Bosch A, Zisserman A, Muoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Intell 30(4):712–727 5. Burghouts GJ, Geusebroek JM (2009) Performance evaluation of local colour invariants. Comput Vis Image Underst 113(1):48–62 6. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27 7. Chen S, Tian Y (2015) Pyramid of spatial relatons for scene-level land use classification. IEEE Trans Geosci Remote Sens 53(4):1947–1957 8. Cheng G, Han J, Guo L, Liu Z, Bu S, Ren J (2015) Effective and efficient midlevel visual elementsoriented land-use classification using vhr remote sensing images. IEEE Trans Geosci Remote Sens 53(8):4238–4249 9. Conway BR, Chatterjee S, Field GD, Horwitz GD, Johnson EN, Koida K, Mancuso K (2010) Advances in color science: from retina to behavior. J Neurosci 30(45):14,955–14,963 10. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, Prague, vol 1, pp 1–2 11. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition (CVPR 2005), vol 1, pp 886–893 12. Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE Computer society conference on computer vision and pattern recognition (CVPR 2005), vol 2, pp 524–531 13. Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611 14. Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70 15. Gao S, Yang K, Li C, Li Y (2013) A color constancy model with double-opponency mechanisms. In: Proceedings of the IEEE International Conference on Computer Vision, pp 929–936 16. Gao S, Yang K, Li C, Li Y (2015) Color constancy using double-opponency. IEEE Trans Pattern Anal Mach Intell 37(10):1973–1985 17. Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset 18. Hurlbert AC (1989) The computation of color. Tech. rep., DTIC Document 19. Ke Y, Sukthankar R (2004) PCA-SIFT: A more distinctive representation for local image descriptors. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), vol 2, pp II–506 20. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer society conference on computer vision and pattern recognition, vol 2, pp 2169–2178 21. Lennie P, Krauskopf J, Sclar G (1990) Chromatic mechanisms in striate cortex of macaque. J Neurosci 10(2):649–669 22. Leutenegger S, Chli M, Siegwart RY (2011) BRISK: Binary robust invariant scalable keypoints. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp 2548–2555
Multimed Tools Appl 23. Li Y, Tao C, Tan Y, Shang K, Tian J (2016) Unsupervised multilayer feature learning for satellite image scene classification. IEEE Geosci Remote Sens Lett 13(2):157–161 24. Liu L, Wang L, Liu X (2011) In defense of soft-assignment coding. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp 2486–2493 25. Liu P, Choo KKR, Wang L, Huang F (2016) SVM or deep learning? a comparative study on remote sensing image classification. Soft Computing. doi:10.1007/s00500-016-2247-2 26. Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, vol 2, pp 1150–1157 27. Ma J, Zhou H, Zhao J, Gao Y, Jiang J, Tian J (2015a) Robust feature matching for remote sensing image registration via locally linear transforming. IEEE Trans Geosci Remote Sens 53(12):6469–6481 28. Ma J, Zhao J, Yuille AL (2016) Non-rigid point set registration by preserving global and local structures. IEEE Trans Image Process 25(1):53–64 29. Ma Y, Wu H, Wang L, Huang B, Ranjan R, Zomaya A, Jie W (2015b) Remote sensing big data computing: challenges and opportunities. Futur Gener Comput Syst 51:47–60 30. Mekhalfi ML, Melgani F, Bazi Y, Alajlan N (2015) Land-use classification with compressive sensing multifeature fusion. IEEE Geosci Remote Sens Lett 12(10):2155–2159 31. Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615–1630 32. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175 33. Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. In: 2006 IEEE Computer society conference on computer vision and pattern recognition, vol 2, pp 2033–2040 34. Shapley R, Hawken MJ (2011) Color in the cortex: single-and double-opponent cells. Vis Res 51(7):701– 717 35. Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Proceedings of Ninth IEEE International Conference on Computer Vision, pp 1470–1477 36. Van De Sande KE, Gevers T, Snoek CG (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596 37. Van De Weijer J, Schmid C (2006) Coloring local feature extraction. In: Computer Vision–ECCV 2006. Springer, pp 334–348 38. Wang L, Song W, Liu P (2016a) Link the remote sensing big data to the image features via wavelet transformation. Clust Comput 19(2):793–810 39. Wang L, Yan J, Ma Y, Zomaya A (2016b) Pipscloud: High performance cloud computing for remote sensing big data management and processing. Future Generation Computer Systems. doi:10.1016/j.future.2016.06.009 40. Wang L, Zhang J, Liu P, Choo KKR, Huang F (2016c) Spectral–spatial multi-feature-based deep learning for hyperspectral remote sensing image classification. Soft Computing pp 1–9 41. Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), pp 1794–1801 42. Yang K, Gao S, Li C, Li Y (2013) Efficient color boundary detection with color-opponent mechanisms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2810– 2817 43. Yang K, Gao S, Guo C, Li C, Li Y (2015) Boundary detection using double-opponency and spatial sparseness constraint. IEEE Trans Image Process 24(8):2565–2578 44. Yang Y, Newsam S (2010) Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 270–279 45. Yang Y, Newsam S (2011) Spatial pyramid co-occurrence for image classification. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp 1465–1472 46. Yu K, Zhang T, Gong Y (2009) Nonlinear learning using local coordinate coding. In: Advances in neural information processing systems, pp 2223–2231 47. Zhang F, Du B, Zhang L (2015) Saliency-guided unsupervised feature learning for scene classification. IEEE Trans Geosci Remote Sens 53(4):2175–2184 48. Zhang J, Barhomi Y, Serre T (2012) A new biologically inspired color image descriptor. In: Computer vision–ECCV 2012. Springer, pp 312–324 49. Zhong Y, Zhu Q, Zhang L (2015) Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery. IEEE Trans Geosci Remote Sens 53(11):6207– 6222
Multimed Tools Appl
Tian Tian received her B.S. in Electronics Information Engineering from Huazhong University of Science and Technology (HUST) in 2009, and received Ph.D. in Control Science and Engineering from HUST in 2015. She is currently a postdoc with College of Computer Science, China University of Geosciences (Wuhan). Her major interests include computer vision and remote sensing image processing.
Yun Zhang received her Ph.D. in Control Science and Engineering from HUST in 2014. She is with Beijing Electro-Mechanical Engineering Institute. Her research interests mainly focus on biological vision and target recognition.
Hao Dou is currently a Ph.D. student in School of Automation, Huazhong University of Science and Technology. His research interests include image classification and object detection.
Multimed Tools Appl
Hengjian Tong received his Ph.D. from College of Information Engineering, China University of Geosciences (Wuhan) in 2003. He is with College of Computer Science, China University of Geosciences as the assistant dean. His research interests focus on remote sensing image segmentation and classification.