Nat Comput DOI 10.1007/s11047-013-9365-x
Kernel clustering using a hybrid memetic algorithm Yangyang Li • Peidao Li • Bo Wu • Lc Jiao Ronghua Shang
•
Ó Springer Science+Business Media Dordrecht 2013
Abstract This paper proposes a novel kernel clustering algorithm using a hybrid memetic algorithm for clustering complex, unlabeled, and linearly non-separable datasets. The kernel function can transform nonlinear data into a high dimensional feature space. It increases the probability of the linear separability of the patterns within the transformed space and simplifies the associated data structure. According to the distribution of various datasets, three local learning operators are designed; meanwhile double mutation operators incorporated into local learning operators to further enhance the ability of global exploration and overcome premature convergence effectively. The performance comparisons of the proposed method with k-means, kernel k-means, global kernel k-means and spectral clustering algorithms on artificial datasets and UCI datasets indicate that the proposed clustering algorithm outperforms the compared algorithms. Keywords Kernel cluster Memetic algorithm Local learning UCI
1 Introduction In pattern recognition fields, the goal of clustering is to partition an unlabeled dataset into several homogeneous groups. As a classical clustering algorithm, k-means clustering algorithm (KM) (Hartigan and Wong 1979) identifies homogeneous groups by minimizing the clustering
Y. Li (&) P. Li B. Wu L. Jiao R. Shang Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an 710071, China e-mail:
[email protected];
[email protected]
error which is defined as the sum of the squared Euclidean distances between each dataset point and the corresponding cluster center. But it has two limitations. One is that the solution depends heavily on the initial positions of the cluster centers, so it is easy to fall into local optimum. And the other is that it cannot cluster linearly non-separable datasets. To deal with the second limitation, the kernel k-means clustering algorithm (K-KM) (Scholkopf et al. 1998) has been widely applied to a variety of learning problems. The data are mapped into a nonlinear highdimensional feature space by the kernel function and the distances between each dataset point in original space are replaced by the distances in feature space. Then KM is used based on the feature space distance. K-KM has better performance to the nonlinear data than KM because kernel method simplifies the associated data structure. So its ability of overcome premature convergence is enhanced. But K-KM still falls into local optimum when it deals with complex dataset. Global kernel k-means clustering algorithm (GK-KM) (Tzortzis and Likas 2009) employs K-KM as a local search procedure to overcome the above two limitations. It finds the best M centers (M is the number of clusters) by traversing search the dataset. To cluster a dataset which has M clusters, first it looks the dataset points as one cluster and uses K-KM to find the best center by the way of traversing search. Second, it looks the dataset points as two clusters and the first center as one of the centers, finds the second best center. Then it finds the third best center, and so on, it will find the M best centers at last. Because GK-KM is a traversing search process, it has better ability of overcome premature convergence than K-KM. So in GK-KM for solving M-clustering problem (N is the size of dataset), it need to run NM times K-KM. But it also brings a drawback that the computational complexity is too high. The other drawback is that it
123
Y. Li et al.
cannot partition some datasets of special distribution effectively and still falls into local optimum. To solve the above defects, a novel kernel clustering algorithm based on memetic algorithm (KCMA) is proposed. Memetic algorithms (MAs) (Moscato 1989) proposed by Pablo Moscato is based on the simulation of the process of cultural evolution. It is a marriage between a population-based global search and the heuristic local search made by each of the individuals. Since we are concerned here with evolutionary algorithms where local searching plays a significant role throughout the searching, the term MA (Moscato 1989) is used in our paper. Some publications (Lim et al. 2000; Le et al. 2009) introduced MAs and in the above publications, the local learning mechanism is often regarded as a meme to enhance the capable of local convergence. From the liter˚ berg and Wessberg 2010), some atures (Sheng et al. 2008; A researchers use memetic algorithm for the clustering and feature selection. Also some paper has introduced hybrid memetic algorithm for the timetabling problem (Wang and Liu 2009). Compared with K-KM and GK-KM, KCMA also uses the kernel method to simplify the associated data structure. But KCMA use the evolution method to find the M best centers. Spectral clustering algorithm (SC) proposed in the literature (Ng et al. 2001) is based on spectral graph theory, transform clustering into the problem of optimal partition of graph. First it define a affinity matrix of the data points and compute the eigenvalues and eigenvectors of the matrix, then chose a clustering algorithm to cluster these eigenvectors and indirectly get the partition of the original dataset. SC is also able to produce nonlinear separating hypersurfaces between clusters (Filippone et al. 2008; Zhang et al. 2009; Guan et al. 2011; Zhou et al. 2011) and it is popular clustering method in last few years which has been successfully applied to some areas such as the SAR image segmentation (Zhang et al. 2008). In addition, NMF (Kim and Park 2008) is also an effective technology which can be applied to data clustering. It factorizes a matrix into a basis matrix and a coefficient matrix. We can use the coefficient matrix places the original matrix, and get a dimensionality reduction matrix which can reflect the characteristics of the raw data. In fact, NMF is equivalent to spectral clustering (Ding et al. 2005). In this paper, we propose the kernel cluster algorithm using hybrid memetic algorithm (KCMA). Kernel function can transform nonlinear data into a high dimensional feature space. It increases the probability of the linear separability of the patterns within the transformed space and simplifies the associated data structure. According to the characteristics of the datasets, two kinds of local learning operators combined with double mutation are applied to KCMA. Crossover operator provides enough diversity for population, but also it brings some unexpected solutions. After crossover, another local learning operator is applied
123
to the individuals and refines capabilities of search algorithms. These three local operators nearly have all opportunities to apply every point to all clustering center so as to reduce the computational complexity. The rest of the paper is organized as follows. Sect. 2 introduces an outline of KCMA, and analyzes the action of all operators in details. In Sect. 3, to evaluate the method proposed KCMA, the performance of the proposed method will be compared with KM (Hartigan and Wong 1979), K-KM (Scholkopf et al. 1998), GK-KM (Tzortzis and Likas 2009) and SC (Ng et al. 2001) over a test suit of several interesting data sets, including artificial datasets and UCI datasets. Finally, we have a brief conclusion in Sect. 4.
2 Description of KCMA Given a datasetX ¼ fx1 ; x2 ; . . .; xN g; xi 2 Rd ; i ¼ f1; 2; . . .; Ng, and the X set of N points is represented by the set P, d is dimension, the purpose of KCMA is to partition this dataset into k disjoint clusters C ¼ fC1 ; C2 ; . . .; Ck g. The best clustering results is that patterns from same cluster are as similar as possible and patterns from different clusters differ as far as possible. The clusters should have the following properties: ð1Þ Ci 6¼ U8i 2 f1; 2; . . .; kg; ð2Þ Ci \ S Cj ¼ U 8i 6¼ j and i; j 2 f1; 2; . . .; kg; ð3Þ ki¼1 Ci ¼ P. Kernel function u is applied to map data into a nonlinear high-dimensional space H : u : Rd ! H; xi ! uðxi Þ; T and uðxi Þ ¼ ½u1 ðxi Þ; where xi ¼ xi;1 ; xi;2 ; . . .; xi;d u2 ðxi Þ; . . .:uH ðxi ÞT . Cluster centers mi ði ¼ 1; 2; . . .; kÞ are mapped into feature space are mui ði ¼ 1; 2; . . .; kÞ. By applying the mapping, Inner product of two dots xTi xj is mapped into uðxi ÞT u xj . The dot product in the transformed space can be calculated as K xi ; xj ¼ uðxi ÞT u xj in the input space Rd . In this paper, Gaussian Kernel function is represented as follow: 2 K xi ; xj ¼ exp xi xj =2x2 ð1Þ Where x [ 0, Ki;j 2 R NN ; for Gaussian kernel, K ðxi ; xi Þ ¼ 1. In the feature space, the kernel k-means objective function is defined as follows: N X K 2 X F mu1 ; mu2 ; . . .; muK ¼ I xi 2 Cj uðxi Þ muj i¼1 j¼1
N P I xi 2 Cj uðxi Þ
where
muj ¼
i¼1 N P I xi 2 C j i¼1
ð2Þ
Kernel clustering using a hybrid memetic algorithm
2 ffi;j ¼ u xi muj N P 2 I xi 2 Cj Kil ¼ Kii i¼1 N P I ðxi 2 Ck Þ
According to the average objective function value, the global Cauchy mutation operator and local Gauss mutation operator is implemented to the population. With the action of the double mutation operator, the relationship of the parents ðCCi ð jÞ; ri ð jÞÞ and their offspring CCi0 ð jÞ; r0i ð jÞ is showed as follows:
i¼1
N P N P I xl 2 Cj I xt 2 Cj Klt
þ
l¼1 i¼1 N P N P
I xl 2 Cj I xt 2 Cj Klt
If ffxi \ ffave , (Gauss mutation) ð3Þ
l¼1 i¼1
If xi belongs to cluster Cj ; I xl 2 Cj ¼ 0, else I xl 2 Cj ¼ 1. KCMA minimizes the objective function F through three local learning methods.
ð jÞ ¼
ffxi ffave ri ð jÞ exp ffmax ffmin ri ð jÞ expð1Þ
ð5Þ
CCi0 ð jÞ ¼ CCi ð jÞ þ r0i ð jÞCj ð0; 1Þ
The kernel clustering using a hybrid memetic algorithm is provided in Fig. 1. Note that learning operator can guide the
(
ð4Þ
Else (Cauchy mutation)
2.1 Main loop of KCMA
r0i
CCi0 ð jÞ ¼ CCi ð jÞ þ r0i ð jÞNj ð0; 1Þ ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ
pffiffiffiffiffiffi lg þ lg r0i ð jÞ ¼ ri ð jÞ exp 2 ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ
ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ
ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ; when ffmax 6¼ ffmin when ffmax ¼ ffmin
process of evolution. It enhances the genotype to influence the better results which is replaced the locally improved individual back into population to compete for reproductive opportunities. The basic steps of KCMA are shown in Fig. 1. The major elements of KCMA are presented as follows. 2.1.1 The double mutation operators From the literatures (Yao and Liu 1996, 1999; Dong et al. 2007; Lee and Yao 2004; Chellapilla 1998; Chen et al. 2009; Das et al. 2009), it is obvious that Cauchy mutation performs better when the current search point is far away from the global minimum, while Gaussian mutation is better at finding a local optimum in a good region. Figure 2 shows Cauchy and Gaussian density function values. On the interval [0, 2], Gaussian function has a large mutation probability, but on the interval [2, ?] its probability is almost 0. It means that mutation individuals of Gaussian function are all near the original individual. So Gaussian mutation is suitable for local search. But Cauchy function still has a relatively larger mutation probability in this interval. So Cauchy mutation can search with a larger step and find a farther solution.
ð6Þ
ð7Þ
End Where Nj ð0; 1Þ denotes a normally distributed onedimensional random number with mean zero and standard deviation one. l denotes the proportional coefficient ð0\l 0:1Þ and g denotes the generation of evolution. From Eqs. 4 and 5, Gauss mutation operator generate offspring near their parents, it could enhance the performance of the local convergence. Based on Eqs. 6 and 7, Cj ð0:1Þ denotes Cauchy random variable with the scale parameter t = 1. ffxi is fitness of the current individual, ffmax is the maximum fitness of the current population, ffmin is the minimum fitness of the current population. ffave is the average fitness of the current population. Based on Eq. 7, for the minimum problems, it can infer that it is needed to reduce the size of local search when ffxi \ ffave . Otherwise it should increase the size of local search to search the solution. From the Eqs. 4–7, double mutation operators guide the individuals go to a better direction to find the global optimum. So the double mutation is able to improve the speed and accuracy of KCMA. At the initial stage of the search, most of individuals are far from the global optimum, and Cauchy mutation plays the leading role. It ensures that the search is
123
Y. Li et al.
Procedure: 1. (Initialization) Kernel matrix K, Number of clusters k, Initial clusters C = [C1 , C2 ,..., Ck ] , the maximum iterations
tmax . 2. Firstly, for each data point xi , i ∈1,..., N , and every cluster C j , j ∈1,..., k ,compute ϕ ( xi ) − mϕj
2
using Formula
(3);
(
Find C ' ( xi ) = argmin j ϕ ( xi ) − mϕj
2
) and
update new clusters C' = [C1' , C2 ' ,..., Ck ' ] and objective function
value F , ffi , j ; 3. While the stopping conditions are not satisfied Do 4. (Calculate the fitness and Reproduction) Evaluate the fitness F and ffi , j of all individuals C' = [C1' , C2 ' ,..., Ck ' ] , and according to the population scale and fitness, the reproduction of the selected individuals; Select new clusters CC = [CC1 , CC2 ,..., CCk ] through reproduction of the clusters C' = [C1' , C2 ' ,..., Ck ' ] 5. (Update the clusters CC by performing double mutation to generate the new clusters CC' ) For each individual do If ff xi < ff ave : perform Gauss mutation; Else: perform Cauchy mutation; End if; End for; 6. According to the different distribution of datasets, update mutated clusters CC'
by performing
MA-Hill-climbing local learning operator with N1 neighborhood or MA-quadratic interpolation local learning operator with N2 neighborhood to generate the new clusters CC1'' and CC''2 ; 7. Update mutated clusters CC' by performing crossover operator and MA-D.S.C local learning operator with N3 neighborhood to generate the new clusters CC''3 ; 8. Elitism selection: select new clusters C '1 , C2 ' ,..., Ck ' from the clusters CC1'' , CC''2 and CC''3 obtained by three local methods; 9. (Termination) end while. Fig. 1 The pseudo code of KCMA algorithm
executed in the global region and does not fall into the local optimum prematurely. And at the final stage, Gauss mutation will play this role and lead an accurate search, so it can obtain a better solution. 2.1.2 The three meta-learning schemes In order to overcome traditional local search algorithms, we combine the traditional local search algorithms with multiple neighborhoods searching to improve the capability of the local learning and optimize the objective
123
function. After the double mutation operators are used, two local methods are performed. According to the distribution of the different datasets, multiple neighborhoods are performed. Figure 3 shows the pseudo code of local methods with suitable neighborhood. MA-Hill-climbing (Liang et al. 2009) as a local searching method is used as local learning operator and the neighborhood is traditional Euclidean distance between two individuals. N1 Neighbourhood: N1 xi ; xj ¼ dist xi xj
ð8Þ
Kernel clustering using a hybrid memetic algorithm
algorithm to select the solution more quickly and enhance the global convergence.
0.4 Gaussian N(0,1) Cauchy C(0,1)
0.35
2.1.4 Termination criterion
0.3 0.25
A proper stop condition is very important for KCMA. It is well known that the maximum number of the iteration is usually used as the termination criterion in the traditional evolutionary algorithms and our algorithms is based on evolutionary algorithm’s framework, so we also employ maximum generation as the stop criterion.
0.2 0.15 0.1 0.05 0 -5
-4
-3
-2
-1
0
1
2
3
4
5
Fig. 2 Cauchy and Gaussian density functions values
Where dist xi xj is Euclidean distance between two individuals. MA-Quadratic interpolation (Li et al. 2009; Liang et al. 1999; Liang and Yao 2000) as another local searching method is used as local learning operator and the neighborhood is manifold distance between two individuals. N2 Neighbourhood : L xi ; xj ¼ qdistðxi xj Þ 1 ! k X N2 xi ; xj ¼ min L xi ; xjþ1 p¼1
ð9Þ Where q [ 1, L xi ; xj is denoted manifold distance between two individuals. N2 xi ; xj is manifold neighborhood. MA-D.S.C. After crossover operator, one local searching D.S.C operator, which is proposed by Davies, Swann and Campey (Schwefel 1995), is cooperated to KCMA: N3 Neighborhood : N3 xi ; xj ¼ 2 dist xi xj ð10Þ Where dist xi xj is Euclidean distance between two crossover individuals. N3 neighborhood is expanding the search area. MA-Hill-climbing and MA-D.S.C. are both based on Euclidean distance. MA-Hill-climbing is suited for spherical datasets and MA-D.S.C. is suited for ellipsoidal datasets. MA-Quadratic interpolation is based on manifold distance and is suited for manifold datasets. The combination of the three operators has the wilder applicability and better performance than any one of them. We will show it in the experiment followed. 2.1.3 Crossover operator and Elitism selection operator Crossover operator will bring the better characteristic of population diversity. Elitism selection can help the
3 Experimental results In order to test the effectiveness of the proposed method KCMA, we compare its performance with four other clustering algorithms-KM (Hartigan and Wong 1979), KKM (Scholkopf et al. 1998), GK-KM (Tzortzis and Likas 2009) and SC (Andrew et al. 2001) on artificial datasets and real world datasets-UCI datasets.1 The simulation has been carried out on a 2.7 GHz AMD Athlon(tm) PC with 2G RAM by programming with Matlab. In the following experiments, we performed 100 independent runs on each test problem. 3.1 Test datasets Table 1 lists the basic characteristics of these datasets. In Table 1, ten artificial datasets and seven UCI datasets are chosen. The datasets AD_20_2 and Fourty belong to sphere distribution and have large the number of classes. Sizes5 and Square4 belong to sphere distribution and overlapping distribution. And Synthetic1, Synthetic5, Threecircles and 2moondata belong to manifold distribution. Taotao belongs to both mixed distribution and overlapping distribution. Most of these datasets are linearly non-separable datasets. 3.2 Experimental settings The parameters used in KCMA are set as follows: the population size is 50, the mutation rate is 0.3, the crossover rate is 0.8, and the maximum generation is 100. The parameter settings of the other four algorithms are same as the existing settings reported in related references (Hartigan and Wong 1979; Scholkopf et al. 1998; Tzortzis and Likas 2009; Ng et al. 2001). We first apply Huang’s accuracy measure -Clustering Accuracy (CA) (Das et al. 2008) to evaluate the clustering result provided by algorithms, which is defined as:
1
http://www.ics.uci.edu/*mlearn/MLRepository.html.
123
Y. Li et al. Fig. 3 The Pseudo code of Local method
Procedure of MA-LS
//LS: Hill-climbing; quadratic interpolation; D.S.C.
Choose an initial solution; While neighborhood of the current solution contains a better solution; //neighborhood:N1,N2,N3 Do Choose the best solution from the neighborhood of the current solution and move to this solution End
Table 1 Description of the datasets Datasets
Number of data points (N)
Number of clusters (k)
Dimension of datasets (d)
Synthetic1
500
2
2
Synthetic3
400
4
3
Synthetic5
600
2
2
Threecircles
299
3
2
2moondata
400
2
2
Taotao
1,300
7
2
Sizes5 Square4
1,000 1,000
4 4
2 2
AD_20_2
1,000
20
2
Fourty
1,000
40
2
Iris (UCI)
150
3
4
Glass (UCI)
214
6
9
Wine (UCI)
178
3
13
Vowel (UCI)
528
11
10
Zoo (UCI)
101
7
16
Msegment (UCI)
2,310
7
18
Abalone (UCI)
4,177
29
7
k P
Without any operator
With operator: a
With operator: a, b
With operator: a, b and c
Synthetic5
0.8367
0.8367
0.9983
0.9983
Threecircle
0.3462
0.3462
0.9928
0.9928
Sizes5
0.6380
0.9932
0.9932
0.9932
AD_20_2
0.3625
0.9981
0.9981
0.9981
Taotao
0.5538
0.5538
0.5538
0.9924
Glass (UCI) Vow l (UCI)
0.4818 0.3424
0.4818 0.3424
0.4818 0.3424
0.9179 0.8888
Zoo (UCI)
0.7129
0.9901
0.9901
0.9901
3.3 The comparison experiment: with or without meta-learning operators
ð11Þ
Where n represents the total number of data points in the data set, and ni represents the number of data occurring in both the ith cluster and its corresponding true cluster. CA is in the interval of [0, 1] and a bigger is more better. The other metrics we used in this study is Adjusted Rand Index (ARI) (Handl and Knowles 2007), its definition is given as below, where nij means those numbers which both belong to cluster l and cluster kðl 2 T; k 2 sÞ, T and S are the true cluster and obtained cluster, respectively. The ARI also returns value in the interval [0, 1] and the bigger value means better clustering performance. ARI ¼ RðT; SÞ P nlk P nl P nk n = ¼ 1 P nl lk 2P nk l 2P nlk 2 P n2k n k 2 k 2 = 2 l 2 l 2 2 ð12Þ
123
Datasets
‘a’ stand for local learning operator: MA-Hill-climbing, ‘b’ stand for local learning operator: MA-Quadratic interpolation, ‘c’ stand for local learning operator: MA-D.S.C
ni
CA ¼ i¼1 n
Table 2 The comparison experiment: the CA with different metalearning operators
To show the contribution of three meta-learning operators, we gave two sets of experimental results. One of them is the result of original KCMA, and the other is the algorithm without three meta-learning operators. In the comparison experiment, we chosen five artificial datasets and four UCI datasets form the datasets that we had listed above. Table 2 shows the mean ARI and the mean CA of 100 independent runs. From Table 2, we can easily find that every local learning operator has contributed to the overall performance of KCMA. Because local learning can search the best solution more effectively, and without it, the algorithm still falls into local optimum. 3.4 General comparison with other clustering algorithms Table 3 shows the mean ARI, mean CA and mean running time (seconds) of five algorithms of 100 independent runs. Table 3 shows the performance of the five algorithms. We find that KCMA obtain better results than other four
Kernel clustering using a hybrid memetic algorithm Table 3 Mean clustering accuracy, mean adjusted rand index and mean run time achieved by each clustering algorithm over 100 independent runs Methods
Datasets Synthetic1
K-means
K-KM
GK-KM
SC
KCMA
GK-KM
SC
KCMA
Threecircles
2moondata
Taotao
Sizes5
Square4
AD_20_2
CA
0.3052
0.8027
0.8345
0.3572
0.7170
0.5625
0.7590
0.9240
0.8187
8.5783e-004
0.7141
0.4468
0.0738
0.1866
0.4850
0.6055
0.8091
0.8110
Time (s)
0.2339
0.8094
0.3225
0.1881
0.3345
2.2890
1.7735
1.1912
2.3295
CA
0.8173
0.6075
0.7154
0.3461
0.7057
0.6087
0.2435
0.9328
0.4724
ARI
0.6190
0.5940
0.2457
0.0542
0.1672
0.5211
0.0886
0.8294
0.5370
Time (s)
1.6365
0.9032
2.6101
0.5198
0.9576
17.578
8.7892
9.1210
8.5515
CA ARI
1 1
1 1
0.8367 0.4525
0.7011 0.6460
0.7050 0.1660
0.6423 0.5612
0.2400 0.0977
0.9320 0.8276
1 1
Time (s)
54.016
27.737
55.928
12.484
16.386
4828.9
1455.4
785.76
291.37
CA
1
0.8675
1
0.7191
1
0.7732
0.9684
0.8618
1
ARI
1
0.8555
1
0.6088
1
0.7789
0.9431
0.7599
1
Time (s)
0.9585
0.8761
2.1132
2.7634
0.7569
2.2013
1.4378
2.9647
1.6535
CA
0.997
0.9980
0.9983
0.9928
0.9989
0.9924
0.9932
0.9930
0.9981
ARI
0.9962
0.9947
0.9932
0.9815
0.9954
0.9899
0.9761
0.9814
0.9959
Time (s)
11.586
7.5957
10.306
4.2161
5.2114
135.53
55.196
47.560
173.19
Datasets Fourty
K-KM
Synthetic5
ARI
Methods
K-means
Synthetic3
Iris (UCI)
Glass (UCI)
Wine (UCI)
Vowel (UCI)
Zoo (UCI)
Msegment (UCI)
Abalone (UCI) 0.1168
CA
0.8452
0.8977
0.4504
0.7138
0.3249
0.7401
0. 5132
ARI
0.8327
0.7391
0.2237
0.3795
0.2070
0.6258
0.3511
0.0426
Time (s)
4.9180
0.1176
0.2282
0.2527
0.8004
1.0702
3.2537
20.594 0.1966
CA
0.8143
0.8315
0.4916
0.4222
0.3414
0.7351
0.1351
ARI
0.7897
0.6729
0.2513
0.0797
0.2257
0.6423
8.9214e-005
0.0617
Time (s)
8.5467
0.1367
0.2429
0.1803
1.7739
1.2537
74.183
477.23
CA
1
0.8933
0.5421
0.7191
0.5290
0.7921
0.1476
\
ARI
1
0.7302
0.2543
0.3941
0.4110
0.7265
6.9338e-005
\
Time (s)
2690.9
3.5718
27.347
6.5318
113.21
8.3440
4398.8
\
CA ARI
1 1
0.8867 0.7163
0.5019 0.2491
0.6292 0.3822
0.3494 0.2026
0.5853 0.5054
– –
– –
Time (s)
3.9568
0.7783
0.8210
0.9013
0.0398
0.1058
–
–
CA
0.9910
0.9681
0.9179
0.9484
0.8888
0.9901
0.9919
0.9014
ARI
0.9827
0.9070
0.8183
0.8553
0.7674
0.9722
0.9811
0.8762
Time (s)
290.77
3.3538
6.0692
3.0107
28.687
3.2267
444.01
11336
‘‘\’’ means the running time is too long and it can be not accepted. ‘‘–’’ means out of memory and we can not get the result
algorithms for most datasets except for datasets Synthetic1, Synthetic3, AD_20_2 and Fourty. But KCMA gets almost similar results with GK-KM for datasets Synthetic1, Synthetic3, Fourty and AD_20_2. However, the running time of KCMA is far lower than that of GK-KM. It means that KCMA effectively reduce the running time complexity and improve clustering accuracy at the same time. SC has relatively better performance to most datasets, but it can not work well on overlapping and real datasets. In addition, SC needs much more memory when clustering the large scale datasets.
In Fig. 4, the scatter plots of some chosen datasets have been shown. The clustered results of these five algorithms which are KM, K-KM, GK-KM, KCMA and SC are presented in Fig. 4. It is clearly seen that K-KM performs better than KM for datasets synthetic1, Taotao and Sizes5. It means that the kernel function can transform nonlinear data into a high dimensional feature space, and increases the probability of the linear separability of the patterns within the transformed space and simplifies the associated data structure. But it heavily depends on initial positions of the cluster centers. So the clustering results of GK-KM are
123
Y. Li et al. 60
60
60
60
60
40
40
40
40
40
20
20
20
20
20
0
0
0
0
0
-20
-20
-20
-20
-20
-40
-40 -60 -60 -40 -20 0
20
60
40
(a.1) KM-Synthetic1
20 15 10 5 0 20
15
10
5
0 0
20 10 15
5
20 15 10 5 0 20
20 15 10 5 0 20
8
5
0 0
20 10 15
5
20 40
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3
10 12 14 16
4
2
(f.1) KM-Taotao
6
8
10
5
5
0 0
20
40
60
(d.3) GK-KM-Threecircles 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3
10 12 14 16
2
(f.2) K-KM-Taotao
4
6
8
10 12 14 16
20 10 15
20 15 10 5 0 20
15
140 130 120 110 100 90 80 70 60 50 40 0 60 80 100 120
20 40
(d.4) SC-Threecircles
(e.4) SC-2moondata
4
6
8
10 12 14 16
(f.4) SC-Taotao
2
15
10
10
10
10
5
5
5
5
5
0
0
0
0
0
-5
-5
-5
-5
-5
12 10 8 6 4 2 0 -2 -4 -6 -10
-5
0
5
0
5
10
15
20
-10 -10 -5
10
(h.1) KM-Square4
15
12 10 8 6 4 2 0 -2 -4 -6 -10
-5
0
5
10
(h.2) K-KM- Square4
0
5
10 15 20
(g.3) GK-KM-Sizes5
15
12 10 8 6 4 2 0 -2 -4 -6 -10
-5
0
5
10
(h.3) GK-KM- Square4
Fig. 4 The comparison results of datasets clustered about five algorithms
-10 -10 -5
0
5
10 15
20
(g.4) SC-Sizes5
15
6
8
10 12 14 16
20
10
(g.2) K-KM-Sizes5
4
(f.5) KCMA-Taotao
15
-10 -10 -5
80 100 120
16 14 12 10 8 6 4 2 0
16 14 12 10 8 6 4 2 0 2
20
20
60
(e.5) KCMA-2moondata
15
15
40
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3
20
10
20 10 15
(d.5) KCMA-Threecircles
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3
15
5
5
0 0
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
20
0
5
20
15
(g.1) KM-Sizes5
10
(c.5) KCMA-Synthetic5
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
(f.3) GK-KM-Taotao
60
(b.5) KCMA-Synthetic3
(c.4) SC-Synthetic5
(e.3) GK-KM-2moondata 16 14 12 10 8 6 4 2 0
5
0 0
20 40
(a.5) KCMA- Synthetic1
(b.4) SC--Synthetic3
140 130 120 110 100 90 80 70 60 50 40 80 100 120 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
-60 -60 -40 -20 0
60
20
-10 -10 -5
123
20 15 10 5 0 20 20 15 15 10 10 5
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
20 40
(a.4) SC-Synthetic1
(c.3) GK-KM-Synthetic5
(e.2) K-KM-2moondata 16 14 12 10 8 6 4 2 0
15
-60 -60 -40 -20 0
(b.3) GK-KM-Synthetic3
140 130 120 110 100 90 80 70 60 50 40 0 60 80 100 120
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
(e.1) KM-2moondata
6
10
(d.2) K-KM-Threecircles
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3
4
15
(c.2) K-KM-Synthetic5
(d.1) KM-Threecircles
20 40 60
(a.3) GK-KM-Synthetic1
140 130 120 110 100 90 80 70 60 50 40 20 40 60 80 100 120 0
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2
-60 -60 -40 -20 0
(b.2) K-KM-Synthetic3
(c.1 )KM-Synthetic5
16 14 12 10 8 6 4 2 0
60
(a.2) K-KM-Synthetic1
(b.1) KM-Synthetic3 140 130 120 110 100 90 80 70 60 50 40 0
20 40
-40
-40
-40
-60 -60 -40 -20 0
12 10 8 6 4 2 0 -2 -4 -6 -10
-5
0
5
(h.4) SC- Square4
-10 -10 -5
0
5
10 15
20
(g.5) KCMA-Sizes5
10
15
12 10 8 6 4 2 0 -2 -4 -6 -10
-5
0
5
10
(h.5) KCMA- Square4
15
Kernel clustering using a hybrid memetic algorithm 8 6 4 2 0 -2 -4 -6 -8
8 6 4 2 0 -2 -4 -6 -8 -6 -4 -2 0
2
4
6
8 10
-6 -4 -2 0
(i.1) KM-AD_20_2 30
2
4
6
8
10
(i.2) K-KM-AD_20_2
8 6 4 2 0 -2 -4 -6 -8
8 6 4 2 0 -2 -4 -6 -8
8 6 4 2 0 -2 -4 -6 -8 -6 -4 -2 0
2
4
6
8
-6 -4 -2 0
10
(i.3) GK-KM-AD_20_2
2
4
6
8
10
-6 -4 -2 0
(i.4) SC-AD_20_2
25
30 25
30 25
30 25
30 25
20
20
20
20
20
15
15
15
15
15
10
10
10
10
10
5
5
5
5
5
0
0
0
0
-5
-5 -2 0 2 4 6 8 10 12 14 16 18
(j.1) KM-Fourty
-5 -2 0 2 4 6 8 10 12 14 16 18
(j.2) K-KM- Fourty
6
8
10
0 -2 0 2 4 6 8 10 12 14 16 18
(j.3) GK-KM- Fourty
4
-5
-5 -2 0 2 4 6 8 10 12 14 16 18
2
(i.5) KCMA-AD_20_2
(j.4) SC- Fourty
-2 0 2 4 6 8 10 12 14 16 18
(j.5) KCMA- Fourty
Fig. 4 continued
better than the results of KM and K-KM for most datasets. As is shown in Fig. 4, it can clearly see that KMCA performs better than KM and K-KM for all datasets. Also for datasets Synthetic5, Threecircles and Taotao, the results obtained by KCMA better than those obtained by GK-KM. But for datasets Synthetic1 and AD_20_2, the results of KCMA and GK-KM are close to global optimum classification results. For dataset Taotao, it can be inferred that SC does not always suited for all dataset. For analyzing the stable performance of KCMA algorithm, one artificial dataset and four UCI datasets are chosen. Figure 5 shows statistical CA value of five algorithms of 100 independent runs. Here, box plots are used to illustrate the distribution of these samples. In a notched box plot the notches represent a robust estimate of the uncertainty about the medians for box-to-box comparison. The boxes have lines at the lower quartile, median, and upper quartile values. The whiskers are lines extending from each end of the boxes to show the extent of the rest of the data. Outliers are data with values beyond the ends of the whiskers. Symbol ‘?’ denote for outlines. The five plots denote the five clustering algorithms results respectively. As Fig. 5 shown, the statistical CA value of Synthetic1 obtained by GK-KM and KCMA are obviously better than the other three algorithms. For other four UCI datasets, the results obtained by KCMA are obviously better than those of the other four algorithms. The bigger the CA is, the better the algorithm of performance is. It can see that the boxes (including lower quartile, median and upper quartile of CA values) of Iris, Glass, Vowel and Zoo obtained by KCMA are higher than those of the other four algorithms, so the stable performance of KCMA is better than the compared algorithms. 3.5 Analysis of robustness To compare the robustness of these algorithms, we follow the formula used by (Geng et al. 2005), which can be defined as follows:
bm ¼
X i
Rmi max Rmi
ð13Þ
m
Where Rmi represents the mean ARI values of the m-th method on i-th dataset. When a method has the best ARI in a dataset then the ratio value is 1, the sum of a method on all the studied data sets is represented as bm. So bm provides a good measurement of the robustness of a method, and a large value indicates good robustness. Figure 6 shows the distribution of bm for the four methods on the fifteen different datasets except for Mesgment and Abalone. Their values are 10.0529, 9.0855, 11.5327, 12.1425 and 14.666 corresponding to KM, K-KM, GK-KM, SC and KCMA respectively. We can note that the proposed KCMA shows the best performance. In fact, the bm of KCMA are very close to 15 on the fifteen data sets, which indicates KCMA performs very well in different situations. So KCMA is more robust than the other four methods.
4 Conclusions This paper has presented a novel kernel cluster algorithm using hybrid memetic algorithm (KCMA) for complex and linearly non-separable datasets. The proposed KCMA utilizes a kernel function to map data points from input space to a higher dimensional feature space and optimizes the objective function in the feature space, and it finds near optimal minima results by three local learning methods with different neighborhoods. The main advantages of this method are that the different local neighborhoods are adaptive to the characteristics of different data distribution. These local methods combined with double mutation operator and crossover operator have a certain probability of all the data points belong to all clusters; meanwhile they guide all individuals to go to a better direction and reduce they blindly searching. Thus it can effectively reduce the computational complexity and save the time resources.
123
Y. Li et al. 1 1
0.95 0.9
0.9
0.85 0.8
CA
CA
0.8 0.7
0.75 0.7
0.6
0.65
0.5
0.6 0.4
0.55
0.3
0.5 1
2
3
4
5
1
2
(a) Synthetic1
3
4
5
SC
KCMA
(b) Iris
0.9
0.8
CA
0.7
0.6
0.5
0.4
0.3 KM
K-KM
GK-KM
SC
KCMA
0.9
1
0.8
0.9
0.7
0.8
0.6
CA
CA
(c) Glass
0.7
0.5 0.6 0.4 0.5 0.3 0.4 KM
K-KM
GK-KM
SC
KCMA
(d) Vowel
KM
K-KM
GK-KM
(e) Zoo
Fig. 5 The Box figures of five datasets
From the experimental results, it can see that KCMA obtained the solutions which are better than KM, K-KM, GK-KM and SC for most testing datasets. KCMA accelerate the running time, and it needs far lower time than GK-KM. It also can see that KCMA performs more stable than other clustering algorithms. But there are some
123
shortcomings of KCMA. It brings some abnormal points, and leads some individuals far from its cluster centers occasionally. So in our future we should develop a novel local neighborhood of the individuals, and let the individuals can learn from each other for the purpose of reducing abnormal points. Also our future research should focus on
Kernel clustering using a hybrid memetic algorithm Zoo
25
Vowel Wine Glass
20
Iris 14.666
bm
15 11.5327
10
10.0529
12.1425
Fourty AD_20_2 Square4
9.0855
Sizes5 Taotao
5
2moomdata Threecircle
0
Synthetic5
KM
K-KM
GK-KM
SC
KCMA
Synthetic3
Fig. 6 Comparision of robustness
improving the performance of automatic clustering over high dimensional datasets by incorporating some feature selection mechanism. Acknowledgments This work was supported by the Program for New Century Excellent Talents in University (No. NCET-12-0920), the National Natural Science Foundation of China (Nos. 61272279, 61001202 and 61203303), the China Postdoctoral Science Foundation Funded Project (Nos. 20080431228, 20090461283 and 20090451 369), the China Postdoctoral Science Foundation Special Funded Project (Nos. 200801426 and 201104618), the Fundamental Research Funds for the Central Universities (Nos. K50511020014, K505110 20011 and K50510020011), the Provincial Natural Science Foundation of Shaanxi of China (No. 2011JQ8020) and the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048).
References ˚ berg MB, Wessberg J (2010) A memetic algorithm for selection of A 3D clustered features with applications in neuroscience. Paper presented at the International Conference on Pattern Recognition, Istanbul 2010. pp 1076–1079 Chellapilla K (1998) Combining mutation operators in evolutionary programming. IEEE Trans Evol Comput 2(3):91–96 Chen C, Low CP, Yang Z (2009) Preserving and exploiting genetic diversity in evolutionary programming algorithms. IEEE Trans Evol Comput 13(3):661–673 Das S, Abraham A, Konar A (2008) Automatic kernel clustering with a multi-elitist particle swarm optimization algorithm. Pattern Recogn Lett 29(5):688–699 Das S, Abraham A, Chakraborty UK, Konar A (2009) Differential evolution using a neighborhood based mutation operator. IEEE Trans Evol Comput 13(3):526–553 Ding C, He XF, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. Paper presented at the SIAM International Conference on Data Mining, Newport Beach 2005. pp 606–610 Dong H, He J, Huang H, Hou W (2007) Evolutionary programming using a mixed mutation strategy. Inf Sci 177(1):312–327 Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recogn 41:176–190 Geng X, Zhan DC, Zhou ZH (2005) Supervised nonlinear dimensionality reduction for visualization and classification. IEEE Trans Syst Man Cybern B 35(6):1098–1107 Guan NY, Tao DC, Luo ZG, Yuan B (2011) Non-negative patch alignment framework. IEEE Trans Neural Netw 22(8):1218–1230
Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11(1):56–76 Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Applied Statistics 28:100–108 Kim J, Park H (2008) Toward Faster Nonnegative Matrix Factorization: a new algorithm and comparisons. Paper presented at the 8th IEEE International Conference on Data Mining, New York, 2008 Le MN, Ong YS, Jin Y, Sendhoff B (2009) Lamarckian memetic algorithms: local optimum and connectivity structure analysis. Memetic Comput J 1(3):175–190 Lee CY, Yao X (2004) Evolutionary programming using mutations based on the Le´vy probability distribution. IEEE Trans Evol Comput 8(1):1–13 Li B, Zhao SL, Fang L (2009) Optimized GM (1, 1) based on Romber algorithm and quadratic interpolation method. Paper presented at the International Conference on Apperceiving Computing and Intelligence Analysis, Oct 2009, pp 128–131 Liang KH, Yao X (2000) Evolutionary search of approximated N-dimensional landscapes. Int J Knowl Int Eng Syst 4(3): 172–183 Liang KH, Yao X, Newton C (1999) Combining landscape approximation and local search in global optimization. In: Proceedings of the 1999 Congress on Evolutionary Computation, 6–9 July 1999, vol 2. IEEE Press, pp 1514–1520 Liang RS, Jiang YF, Bian R (2009) Ordered hill climbing search for heuristic planning. Paper presented at the International Conference on Information Engineering and Computer Science, Tokyo, Dec 2009, pp 1–4 Lim M, Yuan Y, Omatu S (2000) Efficient genetic algorithms using simple genes exchange local search policy for the quadratic assignment problem. Comput Optimization Appl 15(3):249–268 Moscato PA (1989) On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech Concurrent Computation Program Report 826, Caltech, Pasadena Ng AY, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, pp 849–856 Scholkopf B, Smola A, Muller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10: 1299–1319 Schwefel HP (1995) Evolution and optimum seeking. Wiley, New York Sheng WG, Liu XH, Fairhurst MC (2008) A niching memetic algorithm for simultaneous clustering and feature selection. IEEE Trans Knowl Data Eng 20(7):868–879 Tzortzis GF, Likas AC (2009) The global kernel k-means algorithm for clustering in feature space. IEEE Trans Neural Netw 20(7): 1181–1194 Wang Z, Liu JL (2009) Hybrid memetic algorithm for uniting classes of university timetabling problem. Paper presented at the International Conference on Computational Intelligence and Security, Beijing, 11–14 Dec 2009, 978-0-7695-3931-7/09 Yao X, Liu Y (1996) Fast evolutionary programming. In: Proceedings of the Fifth Annual Conference on Evolutionary Programming (EP’96), 29/2-2/3/96. MIT Press, San Diego, pp 451–460 Yao X, Liu Y (1999) Evolutionary programming made faster. IEEE Trans Evol Comput 3(2):82–102 Zhang XR, Jiao LC, Liu F, Bo LF, Gong MG (2008) Spectral clustering ensemble applied to texture features for SAR image segmentation. IEEE Trans Geosci Remote Sens 46(7):2126–2136 Zhang TH, Tao DC, Li XL, Yang J (2009) Patch alignment for dimensionality reduction. IEEE Trans Knowl Data Eng 21(9): 1299–1313 Zhou TY, Tao DC, Wu XD (2011) Manifold elastic net: a unified framework for sparse dimension reduction. Data Min Knowl Disc 22(3):340–371
123