Kernel clustering using a hybrid memetic algorithm

This paper proposes a novel kernel clustering algorithm using a hybrid memetic algorithm for clustering complex, unlabeled, and linearly non-separable...

3 downloads 79 Views 907KB Size

Download PDF

Nat Comput DOI 10.1007/s11047-013-9365-x

Kernel clustering using a hybrid memetic algorithm Yangyang Li • Peidao Li • Bo Wu • Lc Jiao Ronghua Shang

•

Ó Springer Science+Business Media Dordrecht 2013

Abstract This paper proposes a novel kernel clustering algorithm using a hybrid memetic algorithm for clustering complex, unlabeled, and linearly non-separable datasets. The kernel function can transform nonlinear data into a high dimensional feature space. It increases the probability of the linear separability of the patterns within the transformed space and simplifies the associated data structure. According to the distribution of various datasets, three local learning operators are designed; meanwhile double mutation operators incorporated into local learning operators to further enhance the ability of global exploration and overcome premature convergence effectively. The performance comparisons of the proposed method with k-means, kernel k-means, global kernel k-means and spectral clustering algorithms on artificial datasets and UCI datasets indicate that the proposed clustering algorithm outperforms the compared algorithms. Keywords Kernel cluster Memetic algorithm Local learning UCI

1 Introduction In pattern recognition fields, the goal of clustering is to partition an unlabeled dataset into several homogeneous groups. As a classical clustering algorithm, k-means clustering algorithm (KM) (Hartigan and Wong 1979) identifies homogeneous groups by minimizing the clustering

Y. Li (&) P. Li B. Wu L. Jiao R. Shang Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an 710071, China e-mail: [email protected]; [email protected]

error which is defined as the sum of the squared Euclidean distances between each dataset point and the corresponding cluster center. But it has two limitations. One is that the solution depends heavily on the initial positions of the cluster centers, so it is easy to fall into local optimum. And the other is that it cannot cluster linearly non-separable datasets. To deal with the second limitation, the kernel k-means clustering algorithm (K-KM) (Scholkopf et al. 1998) has been widely applied to a variety of learning problems. The data are mapped into a nonlinear highdimensional feature space by the kernel function and the distances between each dataset point in original space are replaced by the distances in feature space. Then KM is used based on the feature space distance. K-KM has better performance to the nonlinear data than KM because kernel method simplifies the associated data structure. So its ability of overcome premature convergence is enhanced. But K-KM still falls into local optimum when it deals with complex dataset. Global kernel k-means clustering algorithm (GK-KM) (Tzortzis and Likas 2009) employs K-KM as a local search procedure to overcome the above two limitations. It finds the best M centers (M is the number of clusters) by traversing search the dataset. To cluster a dataset which has M clusters, first it looks the dataset points as one cluster and uses K-KM to find the best center by the way of traversing search. Second, it looks the dataset points as two clusters and the first center as one of the centers, finds the second best center. Then it finds the third best center, and so on, it will find the M best centers at last. Because GK-KM is a traversing search process, it has better ability of overcome premature convergence than K-KM. So in GK-KM for solving M-clustering problem (N is the size of dataset), it need to run NM times K-KM. But it also brings a drawback that the computational complexity is too high. The other drawback is that it

123

Y. Li et al.

cannot partition some datasets of special distribution effectively and still falls into local optimum. To solve the above defects, a novel kernel clustering algorithm based on memetic algorithm (KCMA) is proposed. Memetic algorithms (MAs) (Moscato 1989) proposed by Pablo Moscato is based on the simulation of the process of cultural evolution. It is a marriage between a population-based global search and the heuristic local search made by each of the individuals. Since we are concerned here with evolutionary algorithms where local searching plays a significant role throughout the searching, the term MA (Moscato 1989) is used in our paper. Some publications (Lim et al. 2000; Le et al. 2009) introduced MAs and in the above publications, the local learning mechanism is often regarded as a meme to enhance the capable of local convergence. From the liter˚ berg and Wessberg 2010), some atures (Sheng et al. 2008; A researchers use memetic algorithm for the clustering and feature selection. Also some paper has introduced hybrid memetic algorithm for the timetabling problem (Wang and Liu 2009). Compared with K-KM and GK-KM, KCMA also uses the kernel method to simplify the associated data structure. But KCMA use the evolution method to find the M best centers. Spectral clustering algorithm (SC) proposed in the literature (Ng et al. 2001) is based on spectral graph theory, transform clustering into the problem of optimal partition of graph. First it define a affinity matrix of the data points and compute the eigenvalues and eigenvectors of the matrix, then chose a clustering algorithm to cluster these eigenvectors and indirectly get the partition of the original dataset. SC is also able to produce nonlinear separating hypersurfaces between clusters (Filippone et al. 2008; Zhang et al. 2009; Guan et al. 2011; Zhou et al. 2011) and it is popular clustering method in last few years which has been successfully applied to some areas such as the SAR image segmentation (Zhang et al. 2008). In addition, NMF (Kim and Park 2008) is also an effective technology which can be applied to data clustering. It factorizes a matrix into a basis matrix and a coefficient matrix. We can use the coefficient matrix places the original matrix, and get a dimensionality reduction matrix which can reflect the characteristics of the raw data. In fact, NMF is equivalent to spectral clustering (Ding et al. 2005). In this paper, we propose the kernel cluster algorithm using hybrid memetic algorithm (KCMA). Kernel function can transform nonlinear data into a high dimensional feature space. It increases the probability of the linear separability of the patterns within the transformed space and simplifies the associated data structure. According to the characteristics of the datasets, two kinds of local learning operators combined with double mutation are applied to KCMA. Crossover operator provides enough diversity for population, but also it brings some unexpected solutions. After crossover, another local learning operator is applied

123

to the individuals and refines capabilities of search algorithms. These three local operators nearly have all opportunities to apply every point to all clustering center so as to reduce the computational complexity. The rest of the paper is organized as follows. Sect. 2 introduces an outline of KCMA, and analyzes the action of all operators in details. In Sect. 3, to evaluate the method proposed KCMA, the performance of the proposed method will be compared with KM (Hartigan and Wong 1979), K-KM (Scholkopf et al. 1998), GK-KM (Tzortzis and Likas 2009) and SC (Ng et al. 2001) over a test suit of several interesting data sets, including artificial datasets and UCI datasets. Finally, we have a brief conclusion in Sect. 4.

2 Description of KCMA Given a datasetX ¼ fx1 ; x2 ; . . .; xN g; xi 2 Rd ; i ¼ f1; 2; . . .; Ng, and the X set of N points is represented by the set P, d is dimension, the purpose of KCMA is to partition this dataset into k disjoint clusters C ¼ fC1 ; C2 ; . . .; Ck g. The best clustering results is that patterns from same cluster are as similar as possible and patterns from different clusters differ as far as possible. The clusters should have the following properties: ð1Þ Ci 6¼ U8i 2 f1; 2; . . .; kg; ð2Þ Ci \ S Cj ¼ U 8i 6¼ j and i; j 2 f1; 2; . . .; kg; ð3Þ ki¼1 Ci ¼ P. Kernel function u is applied to map data into a nonlinear high-dimensional space H : u : Rd ! H; xi ! uðxi Þ; T and uðxi Þ ¼ ½u1 ðxi Þ; where xi ¼ xi;1 ; xi;2 ; . . .; xi;d u2 ðxi Þ; . . .:uH ðxi ÞT . Cluster centers mi ði ¼ 1; 2; . . .; kÞ are mapped into feature space are mui ði ¼ 1; 2; . . .; kÞ. By applying the mapping, Inner product of two dots xTi xj is mapped into uðxi ÞT u xj . The dot product in the transformed space can be calculated as K xi ; xj ¼ uðxi ÞT u xj in the input space Rd . In this paper, Gaussian Kernel function is represented as follow: 2 K xi ; xj ¼ exp xi xj =2x2 ð1Þ Where x [ 0, Ki;j 2 R NN ; for Gaussian kernel, K ðxi ; xi Þ ¼ 1. In the feature space, the kernel k-means objective function is defined as follows: N X K 2 X F mu1 ; mu2 ; . . .; muK ¼ I xi 2 Cj uðxi Þ muj i¼1 j¼1

N P I xi 2 Cj uðxi Þ

where

muj ¼

i¼1 N P I xi 2 C j i¼1

ð2Þ

Kernel clustering using a hybrid memetic algorithm

2 ffi;j ¼ u xi muj N P 2 I xi 2 Cj Kil ¼ Kii i¼1 N P I ðxi 2 Ck Þ

According to the average objective function value, the global Cauchy mutation operator and local Gauss mutation operator is implemented to the population. With the action of the double mutation operator, the relationship of the parents ðCCi ð jÞ; ri ð jÞÞ and their offspring CCi0 ð jÞ; r0i ð jÞ is showed as follows:

i¼1

N P N P I xl 2 Cj I xt 2 Cj Klt

þ

l¼1 i¼1 N P N P

I xl 2 Cj I xt 2 Cj Klt

If ffxi \ ffave , (Gauss mutation) ð3Þ

l¼1 i¼1

If xi belongs to cluster Cj ; I xl 2 Cj ¼ 0, else I xl 2 Cj ¼ 1. KCMA minimizes the objective function F through three local learning methods.

ð jÞ ¼

ffxi ffave ri ð jÞ exp ffmax ffmin ri ð jÞ expð1Þ

ð5Þ

CCi0 ð jÞ ¼ CCi ð jÞ þ r0i ð jÞCj ð0; 1Þ

The kernel clustering using a hybrid memetic algorithm is provided in Fig. 1. Note that learning operator can guide the

(

ð4Þ

Else (Cauchy mutation)

2.1 Main loop of KCMA

r0i

CCi0 ð jÞ ¼ CCi ð jÞ þ r0i ð jÞNj ð0; 1Þ ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ

pﬃﬃﬃﬃﬃﬃ lg þ lg r0i ð jÞ ¼ ri ð jÞ exp 2 ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ

ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ

ði ¼ 1; . . .; N; j ¼ 1; . . .; dÞ; when ffmax 6¼ ffmin when ffmax ¼ ffmin

process of evolution. It enhances the genotype to influence the better results which is replaced the locally improved individual back into population to compete for reproductive opportunities. The basic steps of KCMA are shown in Fig. 1. The major elements of KCMA are presented as follows. 2.1.1 The double mutation operators From the literatures (Yao and Liu 1996, 1999; Dong et al. 2007; Lee and Yao 2004; Chellapilla 1998; Chen et al. 2009; Das et al. 2009), it is obvious that Cauchy mutation performs better when the current search point is far away from the global minimum, while Gaussian mutation is better at finding a local optimum in a good region. Figure 2 shows Cauchy and Gaussian density function values. On the interval [0, 2], Gaussian function has a large mutation probability, but on the interval [2, ?] its probability is almost 0. It means that mutation individuals of Gaussian function are all near the original individual. So Gaussian mutation is suitable for local search. But Cauchy function still has a relatively larger mutation probability in this interval. So Cauchy mutation can search with a larger step and find a farther solution.

ð6Þ

ð7Þ

End Where Nj ð0; 1Þ denotes a normally distributed onedimensional random number with mean zero and standard deviation one. l denotes the proportional coefficient ð0\l 0:1Þ and g denotes the generation of evolution. From Eqs. 4 and 5, Gauss mutation operator generate offspring near their parents, it could enhance the performance of the local convergence. Based on Eqs. 6 and 7, Cj ð0:1Þ denotes Cauchy random variable with the scale parameter t = 1. ffxi is fitness of the current individual, ffmax is the maximum fitness of the current population, ffmin is the minimum fitness of the current population. ffave is the average fitness of the current population. Based on Eq. 7, for the minimum problems, it can infer that it is needed to reduce the size of local search when ffxi \ ffave . Otherwise it should increase the size of local search to search the solution. From the Eqs. 4–7, double mutation operators guide the individuals go to a better direction to find the global optimum. So the double mutation is able to improve the speed and accuracy of KCMA. At the initial stage of the search, most of individuals are far from the global optimum, and Cauchy mutation plays the leading role. It ensures that the search is

123

Y. Li et al.

Procedure: 1. (Initialization) Kernel matrix K, Number of clusters k, Initial clusters C = [C1 , C2 ,..., Ck ] , the maximum iterations

tmax . 2. Firstly, for each data point xi , i ∈1,..., N , and every cluster C j , j ∈1,..., k ,compute ϕ ( xi ) − mϕj

2

using Formula

(3);

(

Find C ' ( xi ) = argmin j ϕ ( xi ) − mϕj

2

) and

update new clusters C' = [C1' , C2 ' ,..., Ck ' ] and objective function

value F , ffi , j ; 3. While the stopping conditions are not satisfied Do 4. (Calculate the fitness and Reproduction) Evaluate the fitness F and ffi , j of all individuals C' = [C1' , C2 ' ,..., Ck ' ] , and according to the population scale and fitness, the reproduction of the selected individuals; Select new clusters CC = [CC1 , CC2 ,..., CCk ] through reproduction of the clusters C' = [C1' , C2 ' ,..., Ck ' ] 5. (Update the clusters CC by performing double mutation to generate the new clusters CC' ) For each individual do If ff xi < ff ave : perform Gauss mutation; Else: perform Cauchy mutation; End if; End for; 6. According to the different distribution of datasets, update mutated clusters CC'

by performing

MA-Hill-climbing local learning operator with N1 neighborhood or MA-quadratic interpolation local learning operator with N2 neighborhood to generate the new clusters CC1'' and CC''2 ; 7. Update mutated clusters CC' by performing crossover operator and MA-D.S.C local learning operator with N3 neighborhood to generate the new clusters CC''3 ; 8. Elitism selection: select new clusters C '1 , C2 ' ,..., Ck ' from the clusters CC1'' , CC''2 and CC''3 obtained by three local methods; 9. (Termination) end while. Fig. 1 The pseudo code of KCMA algorithm

executed in the global region and does not fall into the local optimum prematurely. And at the final stage, Gauss mutation will play this role and lead an accurate search, so it can obtain a better solution. 2.1.2 The three meta-learning schemes In order to overcome traditional local search algorithms, we combine the traditional local search algorithms with multiple neighborhoods searching to improve the capability of the local learning and optimize the objective

123

function. After the double mutation operators are used, two local methods are performed. According to the distribution of the different datasets, multiple neighborhoods are performed. Figure 3 shows the pseudo code of local methods with suitable neighborhood. MA-Hill-climbing (Liang et al. 2009) as a local searching method is used as local learning operator and the neighborhood is traditional Euclidean distance between two individuals. N1 Neighbourhood: N1 xi ; xj ¼ dist xi xj

ð8Þ

Kernel clustering using a hybrid memetic algorithm

algorithm to select the solution more quickly and enhance the global convergence.

0.4 Gaussian N(0,1) Cauchy C(0,1)

0.35

2.1.4 Termination criterion

0.3 0.25

A proper stop condition is very important for KCMA. It is well known that the maximum number of the iteration is usually used as the termination criterion in the traditional evolutionary algorithms and our algorithms is based on evolutionary algorithm’s framework, so we also employ maximum generation as the stop criterion.

0.2 0.15 0.1 0.05 0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Fig. 2 Cauchy and Gaussian density functions values

Where dist xi xj is Euclidean distance between two individuals. MA-Quadratic interpolation (Li et al. 2009; Liang et al. 1999; Liang and Yao 2000) as another local searching method is used as local learning operator and the neighborhood is manifold distance between two individuals. N2 Neighbourhood : L xi ; xj ¼ qdistðxi xj Þ 1 ! k X N2 xi ; xj ¼ min L xi ; xjþ1 p¼1

ð9Þ Where q [ 1, L xi ; xj is denoted manifold distance between two individuals. N2 xi ; xj is manifold neighborhood. MA-D.S.C. After crossover operator, one local searching D.S.C operator, which is proposed by Davies, Swann and Campey (Schwefel 1995), is cooperated to KCMA: N3 Neighborhood : N3 xi ; xj ¼ 2 dist xi xj ð10Þ Where dist xi xj is Euclidean distance between two crossover individuals. N3 neighborhood is expanding the search area. MA-Hill-climbing and MA-D.S.C. are both based on Euclidean distance. MA-Hill-climbing is suited for spherical datasets and MA-D.S.C. is suited for ellipsoidal datasets. MA-Quadratic interpolation is based on manifold distance and is suited for manifold datasets. The combination of the three operators has the wilder applicability and better performance than any one of them. We will show it in the experiment followed. 2.1.3 Crossover operator and Elitism selection operator Crossover operator will bring the better characteristic of population diversity. Elitism selection can help the

3 Experimental results In order to test the effectiveness of the proposed method KCMA, we compare its performance with four other clustering algorithms-KM (Hartigan and Wong 1979), KKM (Scholkopf et al. 1998), GK-KM (Tzortzis and Likas 2009) and SC (Andrew et al. 2001) on artificial datasets and real world datasets-UCI datasets.1 The simulation has been carried out on a 2.7 GHz AMD Athlon(tm) PC with 2G RAM by programming with Matlab. In the following experiments, we performed 100 independent runs on each test problem. 3.1 Test datasets Table 1 lists the basic characteristics of these datasets. In Table 1, ten artificial datasets and seven UCI datasets are chosen. The datasets AD_20_2 and Fourty belong to sphere distribution and have large the number of classes. Sizes5 and Square4 belong to sphere distribution and overlapping distribution. And Synthetic1, Synthetic5, Threecircles and 2moondata belong to manifold distribution. Taotao belongs to both mixed distribution and overlapping distribution. Most of these datasets are linearly non-separable datasets. 3.2 Experimental settings The parameters used in KCMA are set as follows: the population size is 50, the mutation rate is 0.3, the crossover rate is 0.8, and the maximum generation is 100. The parameter settings of the other four algorithms are same as the existing settings reported in related references (Hartigan and Wong 1979; Scholkopf et al. 1998; Tzortzis and Likas 2009; Ng et al. 2001). We first apply Huang’s accuracy measure -Clustering Accuracy (CA) (Das et al. 2008) to evaluate the clustering result provided by algorithms, which is defined as:

1

http://www.ics.uci.edu/*mlearn/MLRepository.html.

123

Y. Li et al. Fig. 3 The Pseudo code of Local method

Procedure of MA-LS

//LS: Hill-climbing; quadratic interpolation; D.S.C.

Choose an initial solution; While neighborhood of the current solution contains a better solution; //neighborhood:N1,N2,N3 Do Choose the best solution from the neighborhood of the current solution and move to this solution End

Table 1 Description of the datasets Datasets

Number of data points (N)

Number of clusters (k)

Dimension of datasets (d)

Synthetic1

500

2

2

Synthetic3

400

4

3

Synthetic5

600

2

2

Threecircles

299

3

2

2moondata

400

2

2

Taotao

1,300

7

2

Sizes5 Square4

1,000 1,000

4 4

2 2

AD_20_2

1,000

20

2

Fourty

1,000

40

2

Iris (UCI)

150

3

4

Glass (UCI)

214

6

9

Wine (UCI)

178

3

13

Vowel (UCI)

528

11

10

Zoo (UCI)

101

7

16

Msegment (UCI)

2,310

7

18

Abalone (UCI)

4,177

29

7

k P

Without any operator

With operator: a

With operator: a, b

With operator: a, b and c

Synthetic5

0.8367

0.8367

0.9983

0.9983

Threecircle

0.3462

0.3462

0.9928

0.9928

Sizes5

0.6380

0.9932

0.9932

0.9932

AD_20_2

0.3625

0.9981

0.9981

0.9981

Taotao

0.5538

0.5538

0.5538

0.9924

Glass (UCI) Vow l (UCI)

0.4818 0.3424

0.4818 0.3424

0.4818 0.3424

0.9179 0.8888

Zoo (UCI)

0.7129

0.9901

0.9901

0.9901

3.3 The comparison experiment: with or without meta-learning operators

ð11Þ

Where n represents the total number of data points in the data set, and ni represents the number of data occurring in both the ith cluster and its corresponding true cluster. CA is in the interval of [0, 1] and a bigger is more better. The other metrics we used in this study is Adjusted Rand Index (ARI) (Handl and Knowles 2007), its definition is given as below, where nij means those numbers which both belong to cluster l and cluster kðl 2 T; k 2 sÞ, T and S are the true cluster and obtained cluster, respectively. The ARI also returns value in the interval [0, 1] and the bigger value means better clustering performance. ARI ¼ RðT; SÞ P nlk P nl P nk n = ¼ 1 P nl lk 2P nk l 2P nlk 2 P n2k n k 2 k 2 = 2 l 2 l 2 2 ð12Þ

123

Datasets

‘a’ stand for local learning operator: MA-Hill-climbing, ‘b’ stand for local learning operator: MA-Quadratic interpolation, ‘c’ stand for local learning operator: MA-D.S.C

ni

CA ¼ i¼1 n

Table 2 The comparison experiment: the CA with different metalearning operators

To show the contribution of three meta-learning operators, we gave two sets of experimental results. One of them is the result of original KCMA, and the other is the algorithm without three meta-learning operators. In the comparison experiment, we chosen five artificial datasets and four UCI datasets form the datasets that we had listed above. Table 2 shows the mean ARI and the mean CA of 100 independent runs. From Table 2, we can easily find that every local learning operator has contributed to the overall performance of KCMA. Because local learning can search the best solution more effectively, and without it, the algorithm still falls into local optimum. 3.4 General comparison with other clustering algorithms Table 3 shows the mean ARI, mean CA and mean running time (seconds) of five algorithms of 100 independent runs. Table 3 shows the performance of the five algorithms. We find that KCMA obtain better results than other four

Kernel clustering using a hybrid memetic algorithm Table 3 Mean clustering accuracy, mean adjusted rand index and mean run time achieved by each clustering algorithm over 100 independent runs Methods

Datasets Synthetic1

K-means

K-KM

GK-KM

SC

KCMA

GK-KM

SC

KCMA

Threecircles

2moondata

Taotao

Sizes5

Square4

AD_20_2

CA

0.3052

0.8027

0.8345

0.3572

0.7170

0.5625

0.7590

0.9240

0.8187

8.5783e-004

0.7141

0.4468

0.0738

0.1866

0.4850

0.6055

0.8091

0.8110

Time (s)

0.2339

0.8094

0.3225

0.1881

0.3345

2.2890

1.7735

1.1912

2.3295

CA

0.8173

0.6075

0.7154

0.3461

0.7057

0.6087

0.2435

0.9328

0.4724

ARI

0.6190

0.5940

0.2457

0.0542

0.1672

0.5211

0.0886

0.8294

0.5370

Time (s)

1.6365

0.9032

2.6101

0.5198

0.9576

17.578

8.7892

9.1210

8.5515

CA ARI

1 1

1 1

0.8367 0.4525

0.7011 0.6460

0.7050 0.1660

0.6423 0.5612

0.2400 0.0977

0.9320 0.8276

1 1

Time (s)

54.016

27.737

55.928

12.484

16.386

4828.9

1455.4

785.76

291.37

CA

1

0.8675

1

0.7191

1

0.7732

0.9684

0.8618

1

ARI

1

0.8555

1

0.6088

1

0.7789

0.9431

0.7599

1

Time (s)

0.9585

0.8761

2.1132

2.7634

0.7569

2.2013

1.4378

2.9647

1.6535

CA

0.997

0.9980

0.9983

0.9928

0.9989

0.9924

0.9932

0.9930

0.9981

ARI

0.9962

0.9947

0.9932

0.9815

0.9954

0.9899

0.9761

0.9814

0.9959

Time (s)

11.586

7.5957

10.306

4.2161

5.2114

135.53

55.196

47.560

173.19

Datasets Fourty

K-KM

Synthetic5

ARI

Methods

K-means

Synthetic3

Iris (UCI)

Glass (UCI)

Wine (UCI)

Vowel (UCI)

Zoo (UCI)

Msegment (UCI)

Abalone (UCI) 0.1168

CA

0.8452

0.8977

0.4504

0.7138

0.3249

0.7401

0. 5132

ARI

0.8327

0.7391

0.2237

0.3795

0.2070

0.6258

0.3511

0.0426

Time (s)

4.9180

0.1176

0.2282

0.2527

0.8004

1.0702

3.2537

20.594 0.1966

CA

0.8143

0.8315

0.4916

0.4222

0.3414

0.7351

0.1351

ARI

0.7897

0.6729

0.2513

0.0797

0.2257

0.6423

8.9214e-005

0.0617

Time (s)

8.5467

0.1367

0.2429

0.1803

1.7739

1.2537

74.183

477.23

CA

1

0.8933

0.5421

0.7191

0.5290

0.7921

0.1476

\

ARI

1

0.7302

0.2543

0.3941

0.4110

0.7265

6.9338e-005

\

Time (s)

2690.9

3.5718

27.347

6.5318

113.21

8.3440

4398.8

\

CA ARI

1 1

0.8867 0.7163

0.5019 0.2491

0.6292 0.3822

0.3494 0.2026

0.5853 0.5054

– –

– –

Time (s)

3.9568

0.7783

0.8210

0.9013

0.0398

0.1058

–

–

CA

0.9910

0.9681

0.9179

0.9484

0.8888

0.9901

0.9919

0.9014

ARI

0.9827

0.9070

0.8183

0.8553

0.7674

0.9722

0.9811

0.8762

Time (s)

290.77

3.3538

6.0692

3.0107

28.687

3.2267

444.01

11336

‘‘\’’ means the running time is too long and it can be not accepted. ‘‘–’’ means out of memory and we can not get the result

algorithms for most datasets except for datasets Synthetic1, Synthetic3, AD_20_2 and Fourty. But KCMA gets almost similar results with GK-KM for datasets Synthetic1, Synthetic3, Fourty and AD_20_2. However, the running time of KCMA is far lower than that of GK-KM. It means that KCMA effectively reduce the running time complexity and improve clustering accuracy at the same time. SC has relatively better performance to most datasets, but it can not work well on overlapping and real datasets. In addition, SC needs much more memory when clustering the large scale datasets.

In Fig. 4, the scatter plots of some chosen datasets have been shown. The clustered results of these five algorithms which are KM, K-KM, GK-KM, KCMA and SC are presented in Fig. 4. It is clearly seen that K-KM performs better than KM for datasets synthetic1, Taotao and Sizes5. It means that the kernel function can transform nonlinear data into a high dimensional feature space, and increases the probability of the linear separability of the patterns within the transformed space and simplifies the associated data structure. But it heavily depends on initial positions of the cluster centers. So the clustering results of GK-KM are

123

Y. Li et al. 60

60

60

60

60

40

40

40

40

40

20

20

20

20

20

0

0

0

0

0

-20

-20

-20

-20

-20

-40

-40 -60 -60 -40 -20 0

20

60

40

(a.1) KM-Synthetic1

20 15 10 5 0 20

15

10

5

0 0

20 10 15

5

20 15 10 5 0 20

20 15 10 5 0 20

8

5

0 0

20 10 15

5

20 40

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3

10 12 14 16

4

2

(f.1) KM-Taotao

6

8

10

5

5

0 0

20

40

60

(d.3) GK-KM-Threecircles 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3

10 12 14 16

2

(f.2) K-KM-Taotao

4

6

8

10 12 14 16

20 10 15

20 15 10 5 0 20

15

140 130 120 110 100 90 80 70 60 50 40 0 60 80 100 120

20 40

(d.4) SC-Threecircles

(e.4) SC-2moondata

4

6

8

10 12 14 16

(f.4) SC-Taotao

2

15

10

10

10

10

5

5

5

5

5

0

0

0

0

0

-5

-5

-5

-5

-5

12 10 8 6 4 2 0 -2 -4 -6 -10

-5

0

5

0

5

10

15

20

-10 -10 -5

10

(h.1) KM-Square4

15

12 10 8 6 4 2 0 -2 -4 -6 -10

-5

0

5

10

(h.2) K-KM- Square4

0

5

10 15 20

(g.3) GK-KM-Sizes5

15

12 10 8 6 4 2 0 -2 -4 -6 -10

-5

0

5

10

(h.3) GK-KM- Square4

Fig. 4 The comparison results of datasets clustered about five algorithms

-10 -10 -5

0

5

10 15

20

(g.4) SC-Sizes5

15

6

8

10 12 14 16

20

10

(g.2) K-KM-Sizes5

4

(f.5) KCMA-Taotao

15

-10 -10 -5

80 100 120

16 14 12 10 8 6 4 2 0

16 14 12 10 8 6 4 2 0 2

20

20

60

(e.5) KCMA-2moondata

15

15

40

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3

20

10

20 10 15

(d.5) KCMA-Threecircles

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3

15

5

5

0 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

20

0

5

20

15

(g.1) KM-Sizes5

10

(c.5) KCMA-Synthetic5

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

(f.3) GK-KM-Taotao

60

(b.5) KCMA-Synthetic3

(c.4) SC-Synthetic5

(e.3) GK-KM-2moondata 16 14 12 10 8 6 4 2 0

5

0 0

20 40

(a.5) KCMA- Synthetic1

(b.4) SC--Synthetic3

140 130 120 110 100 90 80 70 60 50 40 80 100 120 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

-60 -60 -40 -20 0

60

20

-10 -10 -5

123

20 15 10 5 0 20 20 15 15 10 10 5

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

20 40

(a.4) SC-Synthetic1

(c.3) GK-KM-Synthetic5

(e.2) K-KM-2moondata 16 14 12 10 8 6 4 2 0

15

-60 -60 -40 -20 0

(b.3) GK-KM-Synthetic3

140 130 120 110 100 90 80 70 60 50 40 0 60 80 100 120

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

(e.1) KM-2moondata

6

10

(d.2) K-KM-Threecircles

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1.5-1 -0.50 0.5 1 1.5 2 2.5 3

4

15

(c.2) K-KM-Synthetic5

(d.1) KM-Threecircles

20 40 60

(a.3) GK-KM-Synthetic1

140 130 120 110 100 90 80 70 60 50 40 20 40 60 80 100 120 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

2

-60 -60 -40 -20 0

(b.2) K-KM-Synthetic3

(c.1 )KM-Synthetic5

16 14 12 10 8 6 4 2 0

60

(a.2) K-KM-Synthetic1

(b.1) KM-Synthetic3 140 130 120 110 100 90 80 70 60 50 40 0

20 40

-40

-40

-40

-60 -60 -40 -20 0

12 10 8 6 4 2 0 -2 -4 -6 -10

-5

0

5

(h.4) SC- Square4

-10 -10 -5

0

5

10 15

20

(g.5) KCMA-Sizes5

10

15

12 10 8 6 4 2 0 -2 -4 -6 -10

-5

0

5

10

(h.5) KCMA- Square4

15

Kernel clustering using a hybrid memetic algorithm 8 6 4 2 0 -2 -4 -6 -8

8 6 4 2 0 -2 -4 -6 -8 -6 -4 -2 0

2

4

6

8 10

-6 -4 -2 0

(i.1) KM-AD_20_2 30

2

4

6

8

10

(i.2) K-KM-AD_20_2

8 6 4 2 0 -2 -4 -6 -8

8 6 4 2 0 -2 -4 -6 -8

8 6 4 2 0 -2 -4 -6 -8 -6 -4 -2 0

2

4

6

8

-6 -4 -2 0

10

(i.3) GK-KM-AD_20_2

2

4

6

8

10

-6 -4 -2 0

(i.4) SC-AD_20_2

25

30 25

30 25

30 25

30 25

20

20

20

20

20

15

15

15

15

15

10

10

10

10

10

5

5

5

5

5

0

0

0

0

-5

-5 -2 0 2 4 6 8 10 12 14 16 18

(j.1) KM-Fourty

-5 -2 0 2 4 6 8 10 12 14 16 18

(j.2) K-KM- Fourty

6

8

10

0 -2 0 2 4 6 8 10 12 14 16 18

(j.3) GK-KM- Fourty

4

-5

-5 -2 0 2 4 6 8 10 12 14 16 18

2

(i.5) KCMA-AD_20_2

(j.4) SC- Fourty

-2 0 2 4 6 8 10 12 14 16 18

(j.5) KCMA- Fourty

Fig. 4 continued

better than the results of KM and K-KM for most datasets. As is shown in Fig. 4, it can clearly see that KMCA performs better than KM and K-KM for all datasets. Also for datasets Synthetic5, Threecircles and Taotao, the results obtained by KCMA better than those obtained by GK-KM. But for datasets Synthetic1 and AD_20_2, the results of KCMA and GK-KM are close to global optimum classification results. For dataset Taotao, it can be inferred that SC does not always suited for all dataset. For analyzing the stable performance of KCMA algorithm, one artificial dataset and four UCI datasets are chosen. Figure 5 shows statistical CA value of five algorithms of 100 independent runs. Here, box plots are used to illustrate the distribution of these samples. In a notched box plot the notches represent a robust estimate of the uncertainty about the medians for box-to-box comparison. The boxes have lines at the lower quartile, median, and upper quartile values. The whiskers are lines extending from each end of the boxes to show the extent of the rest of the data. Outliers are data with values beyond the ends of the whiskers. Symbol ‘?’ denote for outlines. The five plots denote the five clustering algorithms results respectively. As Fig. 5 shown, the statistical CA value of Synthetic1 obtained by GK-KM and KCMA are obviously better than the other three algorithms. For other four UCI datasets, the results obtained by KCMA are obviously better than those of the other four algorithms. The bigger the CA is, the better the algorithm of performance is. It can see that the boxes (including lower quartile, median and upper quartile of CA values) of Iris, Glass, Vowel and Zoo obtained by KCMA are higher than those of the other four algorithms, so the stable performance of KCMA is better than the compared algorithms. 3.5 Analysis of robustness To compare the robustness of these algorithms, we follow the formula used by (Geng et al. 2005), which can be defined as follows:

bm ¼

X i

Rmi max Rmi

ð13Þ

m

Where Rmi represents the mean ARI values of the m-th method on i-th dataset. When a method has the best ARI in a dataset then the ratio value is 1, the sum of a method on all the studied data sets is represented as bm. So bm provides a good measurement of the robustness of a method, and a large value indicates good robustness. Figure 6 shows the distribution of bm for the four methods on the fifteen different datasets except for Mesgment and Abalone. Their values are 10.0529, 9.0855, 11.5327, 12.1425 and 14.666 corresponding to KM, K-KM, GK-KM, SC and KCMA respectively. We can note that the proposed KCMA shows the best performance. In fact, the bm of KCMA are very close to 15 on the fifteen data sets, which indicates KCMA performs very well in different situations. So KCMA is more robust than the other four methods.

4 Conclusions This paper has presented a novel kernel cluster algorithm using hybrid memetic algorithm (KCMA) for complex and linearly non-separable datasets. The proposed KCMA utilizes a kernel function to map data points from input space to a higher dimensional feature space and optimizes the objective function in the feature space, and it finds near optimal minima results by three local learning methods with different neighborhoods. The main advantages of this method are that the different local neighborhoods are adaptive to the characteristics of different data distribution. These local methods combined with double mutation operator and crossover operator have a certain probability of all the data points belong to all clusters; meanwhile they guide all individuals to go to a better direction and reduce they blindly searching. Thus it can effectively reduce the computational complexity and save the time resources.

123

Y. Li et al. 1 1

0.95 0.9

0.9

0.85 0.8

CA

CA

0.8 0.7

0.75 0.7

0.6

0.65

0.5

0.6 0.4

0.55

0.3

0.5 1

2

3

4

5

1

2

(a) Synthetic1

3

4

5

SC

KCMA

(b) Iris

0.9

0.8

CA

0.7

0.6

0.5

0.4

0.3 KM

K-KM

GK-KM

SC

KCMA

0.9

1

0.8

0.9

0.7

0.8

0.6

CA

CA

(c) Glass

0.7

0.5 0.6 0.4 0.5 0.3 0.4 KM

K-KM

GK-KM

SC

KCMA

(d) Vowel

KM

K-KM

GK-KM

(e) Zoo

Fig. 5 The Box figures of five datasets

From the experimental results, it can see that KCMA obtained the solutions which are better than KM, K-KM, GK-KM and SC for most testing datasets. KCMA accelerate the running time, and it needs far lower time than GK-KM. It also can see that KCMA performs more stable than other clustering algorithms. But there are some

123

shortcomings of KCMA. It brings some abnormal points, and leads some individuals far from its cluster centers occasionally. So in our future we should develop a novel local neighborhood of the individuals, and let the individuals can learn from each other for the purpose of reducing abnormal points. Also our future research should focus on

Kernel clustering using a hybrid memetic algorithm Zoo

25

Vowel Wine Glass

20

Iris 14.666

bm

15 11.5327

10

10.0529

12.1425

Fourty AD_20_2 Square4

9.0855

Sizes5 Taotao

5

2moomdata Threecircle

0

Synthetic5

KM

K-KM

GK-KM

SC

KCMA

Synthetic3

Fig. 6 Comparision of robustness

improving the performance of automatic clustering over high dimensional datasets by incorporating some feature selection mechanism. Acknowledgments This work was supported by the Program for New Century Excellent Talents in University (No. NCET-12-0920), the National Natural Science Foundation of China (Nos. 61272279, 61001202 and 61203303), the China Postdoctoral Science Foundation Funded Project (Nos. 20080431228, 20090461283 and 20090451 369), the China Postdoctoral Science Foundation Special Funded Project (Nos. 200801426 and 201104618), the Fundamental Research Funds for the Central Universities (Nos. K50511020014, K505110 20011 and K50510020011), the Provincial Natural Science Foundation of Shaanxi of China (No. 2011JQ8020) and the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048).

References ˚ berg MB, Wessberg J (2010) A memetic algorithm for selection of A 3D clustered features with applications in neuroscience. Paper presented at the International Conference on Pattern Recognition, Istanbul 2010. pp 1076–1079 Chellapilla K (1998) Combining mutation operators in evolutionary programming. IEEE Trans Evol Comput 2(3):91–96 Chen C, Low CP, Yang Z (2009) Preserving and exploiting genetic diversity in evolutionary programming algorithms. IEEE Trans Evol Comput 13(3):661–673 Das S, Abraham A, Konar A (2008) Automatic kernel clustering with a multi-elitist particle swarm optimization algorithm. Pattern Recogn Lett 29(5):688–699 Das S, Abraham A, Chakraborty UK, Konar A (2009) Differential evolution using a neighborhood based mutation operator. IEEE Trans Evol Comput 13(3):526–553 Ding C, He XF, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. Paper presented at the SIAM International Conference on Data Mining, Newport Beach 2005. pp 606–610 Dong H, He J, Huang H, Hou W (2007) Evolutionary programming using a mixed mutation strategy. Inf Sci 177(1):312–327 Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recogn 41:176–190 Geng X, Zhan DC, Zhou ZH (2005) Supervised nonlinear dimensionality reduction for visualization and classification. IEEE Trans Syst Man Cybern B 35(6):1098–1107 Guan NY, Tao DC, Luo ZG, Yuan B (2011) Non-negative patch alignment framework. IEEE Trans Neural Netw 22(8):1218–1230

Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11(1):56–76 Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Applied Statistics 28:100–108 Kim J, Park H (2008) Toward Faster Nonnegative Matrix Factorization: a new algorithm and comparisons. Paper presented at the 8th IEEE International Conference on Data Mining, New York, 2008 Le MN, Ong YS, Jin Y, Sendhoff B (2009) Lamarckian memetic algorithms: local optimum and connectivity structure analysis. Memetic Comput J 1(3):175–190 Lee CY, Yao X (2004) Evolutionary programming using mutations based on the Le´vy probability distribution. IEEE Trans Evol Comput 8(1):1–13 Li B, Zhao SL, Fang L (2009) Optimized GM (1, 1) based on Romber algorithm and quadratic interpolation method. Paper presented at the International Conference on Apperceiving Computing and Intelligence Analysis, Oct 2009, pp 128–131 Liang KH, Yao X (2000) Evolutionary search of approximated N-dimensional landscapes. Int J Knowl Int Eng Syst 4(3): 172–183 Liang KH, Yao X, Newton C (1999) Combining landscape approximation and local search in global optimization. In: Proceedings of the 1999 Congress on Evolutionary Computation, 6–9 July 1999, vol 2. IEEE Press, pp 1514–1520 Liang RS, Jiang YF, Bian R (2009) Ordered hill climbing search for heuristic planning. Paper presented at the International Conference on Information Engineering and Computer Science, Tokyo, Dec 2009, pp 1–4 Lim M, Yuan Y, Omatu S (2000) Efficient genetic algorithms using simple genes exchange local search policy for the quadratic assignment problem. Comput Optimization Appl 15(3):249–268 Moscato PA (1989) On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech Concurrent Computation Program Report 826, Caltech, Pasadena Ng AY, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, pp 849–856 Scholkopf B, Smola A, Muller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10: 1299–1319 Schwefel HP (1995) Evolution and optimum seeking. Wiley, New York Sheng WG, Liu XH, Fairhurst MC (2008) A niching memetic algorithm for simultaneous clustering and feature selection. IEEE Trans Knowl Data Eng 20(7):868–879 Tzortzis GF, Likas AC (2009) The global kernel k-means algorithm for clustering in feature space. IEEE Trans Neural Netw 20(7): 1181–1194 Wang Z, Liu JL (2009) Hybrid memetic algorithm for uniting classes of university timetabling problem. Paper presented at the International Conference on Computational Intelligence and Security, Beijing, 11–14 Dec 2009, 978-0-7695-3931-7/09 Yao X, Liu Y (1996) Fast evolutionary programming. In: Proceedings of the Fifth Annual Conference on Evolutionary Programming (EP’96), 29/2-2/3/96. MIT Press, San Diego, pp 451–460 Yao X, Liu Y (1999) Evolutionary programming made faster. IEEE Trans Evol Comput 3(2):82–102 Zhang XR, Jiao LC, Liu F, Bo LF, Gong MG (2008) Spectral clustering ensemble applied to texture features for SAR image segmentation. IEEE Trans Geosci Remote Sens 46(7):2126–2136 Zhang TH, Tao DC, Li XL, Yang J (2009) Patch alignment for dimensionality reduction. IEEE Trans Knowl Data Eng 21(9): 1299–1313 Zhou TY, Tao DC, Wu XD (2011) Manifold elastic net: a unified framework for sparse dimension reduction. Data Min Knowl Disc 22(3):340–371

123

Kernel clustering using a hybrid memetic algorithm

Recommend Documents