Neural Comput & Applic (2009) 18:759–768 DOI 10.1007/s00521-009-0281-z
ORIGINAL ARTICLE
A novel pruning approach for robust data clustering Xu-Lei Yang Æ Qing Song Æ Yi-Lei Wu Æ Ai-Ze Cao
Received: 17 October 2007 / Accepted: 22 April 2009 / Published online: 8 May 2009 Springer-Verlag London Limited 2009
Abstract In this paper, we make an effort to overcome the sensitivity of traditional clustering algorithms to noisy data points (noise and outliers). A novel pruning method, in terms of information theory, is therefore proposed to phase out noisy points for robust data clustering. This approach identifies and prunes the noisy points based on the maximization of mutual information against input data distributions such that the resulting clusters are least affected by noise and outliers, where the degree of robustness is controlled through a separate parameter to make a trade-off between rejection of noisy points and optimal clustered data. The pruning approach is general, and it can improve the robustness of many existing traditional clustering methods. In particular, we apply the pruning approach to improve the robustness of fuzzy c-means clustering and its extensions, e.g., fuzzy c-spherical shells clustering and kernel-based fuzzy c-means clustering. As a result, we obtain three clustering algorithms that are the robust versions of the existing ones. The effectiveness of the proposed pruning approach is supported by experimental results. Keywords Robust clustering Mutual information Fuzzy c-means Fuzzy c-spherical Shells Kernel methods
X.-L. Yang (&) Q. Song Y.-L. Wu Nanyang Technological University, Singapore, Singapore e-mail:
[email protected] A.-Z. Cao Vanderbilt University, Nashville, USA
1 Introduction Interest in clustering has increased recently because of new areas of application, such as data mining, image and speech processing, and bio-informatics [23]. A very rich literature on clustering has been developed in the past three decades. Prototype-based clustering algorithms are by far the most popular clustering technique. In prototype-based clustering algorithms, each cluster is represented by a prototype, and the sum of distances from the data points to the prototypes is usually used as the objective function. The distance measure chosen and the objective function being optimized depend on the geometric structure of the clusters. Different distances have been presented to search for clusters of specific shapes in feature space. For example, the K-means [14] and fuzzy c-means [1] algorithms, using the Euclidian distance, look for clusters that are hyper-spherical in shape; the Gustafson and Kessel [10] and Gath and Geva [7] algorithms, in terms of fuzzy covariance matrix, can deal with hyper-elliptical data structures; the shell-based clustering algorithms as in [4, 5, 11], using shell-induced distance measure, work well for circles and/or ellipses in shape; and the kernel-based clustering algorithms as in [8, 20], in terms of kernelinduced distance measure, provide a possible solution for linearly nonseparable data structure. Most traditional clustering algorithms, such as K-means and fuzzy c-means,1 however, suffer from several major drawbacks: First, traditional clustering algorithms are essentially descent algorithms because the cost is reduced at each iteration. They therefore tend to get trapped in local 1
In this paper, the traditional clustering algorithms specially mean the basic K-means and fuzzy c-means algorithms, but not their improved versions, such as the enhanced LBG (less sensitive to data initialization) [18] or the possibilistic c-means (less sensitive to noisy data points) [12].
123
760
minima which riddle the cost surface [21], that leads to the sensitivity to the data initialization. Second, traditional clustering algorithms tend to assign any data point (even it is a noisy one) to one (hard clustering) or more than one (soft clustering) clusters. In the case of that the given data is contaminated by noisy points (noise and/or outliers), the computed cluster prototypes can be placed far away from the true values. That means these algorithms are sensitive to noisy points. Last, the cluster number is assumed as a priori in most traditional clustering algorithms. However, the optimal number for a given data is generally unknown in real applications. Many solutions have been proposed to overcome the above drawbacks of traditional clustering algorithms. Though the first disadvantage is usually ignored2 in most research papers, several approaches, such as the enhanced LBG [18] (in terms of the concept of utility of a codeword) and deterministic annealing [19] (through the annealing process with phase transitions) algorithms, have been proposed to handle the data initialization problem. Popular approaches to the last disadvantage are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a twostep procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters [23]. This paper focuses on overcoming the second disadvantage of traditional clustering algorithms, i.e., the sensitivity to noisy points (noise and outliers). Many researchers have attempted to solve this problem by modifying the constraint of probability in the membership of fuzzy c-means. In particular, the concept of possibilistic clustering [12] and its variances has been shown to have satisfied ability of handling noisy data [9, 12, 13, 17, 25, 26]. We will compare the possibilistic clustering approach to the proposed pruning method later in this paper. A novel pruning method, in terms of information theory, is proposed in this paper to phase out noisy points for robust data clustering. This approach identifies and prunes the noise and outliers based on the maximization of mutual information (MI) against input data distributions such that the resulting clusters are least affected by the noise and outliers, where the degree of robustness is controlled through a separate parameter to make a trade-off between rejection of noisy points and optimal clustered data. The pruning approach is general, and it can improve the robustness of many existing traditional clustering methods. In particular, we apply the pruning approach to 2
Most researchers assume that they can provide their clustering algorithms with a suitable initialization. Others use multiple (random) initializations that guarantee (with a given probability) that at least one initialization is good.
123
Neural Comput & Applic (2009) 18:759–768
improve the robustness of fuzzy c-means (FCM), fuzzy cspherical shells (FCSS), and kernel-based fuzzy c-means (KFCM) clustering algorithms. As a result, we obtain three clustering algorithms, i.e., robust fuzzy c-means (RFCM), robust fuzzy c-spherical shells (RFCSS), and robust kernelbased fuzzy c-means (RKFCM), that are the robust versions of the existing ones. The robustness of each algorithm is verified through experimental results. The rest of this paper is organized as follows. A brief review of fuzzy c-means clustering and its two extensions is given in Sect. 2. Section 3 contains the derivation and pseudo-code of the proposed pruning approach. The experimental results of the proposed method are reported in Sect. 4. Finally, conclusion and discussion are presented in Sect. 5.
2 Review of traditional clustering algorithms In the following discussions, let the input data set be X ¼ fx1 ; x2 ; . . .; xl g Rn , where l is the number of input data points, n is the dimension of input space. Based on a measurement of similarity, the data set is partitioned into c (known as a priori) clusters. 2.1 Fuzzy c-means clustering Let the prototypes (centers) denoted by V ¼ fv1 ; v2 ; . . .; vc g Rn . Mathematically, fuzzy c-means algorithm (FCM) is derived to minimize the following objective function with respect to the membership function uk|j and the cluster center vk, as given by JFCM ¼
l X c X
ðukjj Þm dkj
ð1Þ
j¼1 k¼1
where m [ (1,?) is the weighing exponent on fuzzy memberships. The uk|j is the membership of point xj in the kth cluster. And the dkj is the squared Euclidean distance between data point xj and cluster center vk as follows, dkj ¼k xj vk k2
ð2Þ
By using the squared Euclidean distance measure, the FCM is limited to partition clusters that are hyper-spherical in shape. To take complicated shapes into account, we need to incorporate an appropriate distance measure into the FCM algorithm for different data structures, e.g., using shellinduced distance measure for circles and/or ellipses in shape or using kernel-induced distance measure for linearly nonseparable data structure, as discussed later in this section. The minimization of (1) gives the updating equations for membership uk|j and cluster center vk, which are given by
Neural Comput & Applic (2009) 18:759–768
761
and
1=ðm1Þ
dkj ukjj ¼ Pc i¼1
1=ðm1Þ
ð3Þ
dij
Qk ¼
Pl
m j¼1 ukjj xj Pl m j¼1 ukjj
vk ¼
ð4Þ
ð9Þ
Fix the cluster number c, initialize membership {uk|j}c,l k,j=1, and set threshold e be a small positive value. Alternatively update the cluster center and membership function by using (4) and (3) until k unew-uoldk \ e.
2.2 Fuzzy c-spherical shells clustering In the last years, several FCM-based algorithms, as, e.g., fuzzy c-shells (FCS) [4], fuzzy c-spherical shells algorithm (FCSS) [11] and fuzzy c-rings (FCR) [15], have been developed to detect and separate spherical shells (ringshaped data) in the clustering literature. Here we give a brief review of FCSS algorithm. In the FCSS algorithm, the cluster prototype wk is a n-dimensional (hyper-)spherical shell which can be characterized by its center vk and radius rk, where k = 1,2,...,c. The distance from point xj to a prototype wk: = (vk,rk) is defined as in [11] as 2 ðDkj Þ2 ¼ kxj vk k2 rk2 ¼ pTk Mj pk þ qTj pk þ bj ð5Þ where bj = (xTjxj)2, qj = 2(xTjxj)yj, yj = [xTj 1]T, Mj = yjyTj and pk = [-2vTk vTkvk-r2k]T. Note that all the variables (vectors or matrix) in (5) are only related to the input point xj except the parameter pk which is decided by a pair of parameters, i.e. the center vk and the radius rk. Then the objective function of FCSS to be minimized is defined as JFCSS ¼
c X l X
ðukjj Þm ðDkj Þ2
k¼1 j¼1
¼
c X l X
ðukjj Þm ðpTk Mj pk þ qTj pk þ bj Þ
ð6Þ
k¼1 j¼1
where m [ (1,?) is the weighing exponent on fuzzy memberships, and uk|j is the membership of point xj in the kth cluster. Then it is easy to show that the vectors pk that minimize (6) are given by [11] 1 pk ¼ Hk1 Qk 2
ð7Þ
ð10Þ
Unlike FCS in [4], which needs to solve the coupled nonlinear equations, FCSS calculates the prototypes directly, such that reduces the computation cost compared to the implementation of FCS. And unlike FCR in [15], which is limited to two-dimensional ring-shaped data, FCSS is general for clustering any dimensional (hyper-)spherical shells. 2.3 Kernel-based fuzzy c-means clustering By nonlinearly mapping the observed data from a lowdimensional input space into a high-dimensional kernel space by means of a given kernel function, which aims to increase the probability of the linear separability of the given data, the kernel-based clustering algorithms can reveal structure in the data that may go unnoticed by traditional clustering algorithms in the standard Euclidean space [16]. The key issue extending traditional fuzzy c-means (FCM) to kernel-based fuzzy c-means (KFCM) is the computation of distance measure in the kernel space. Let /(xi) denote xj’s transformation, and /(vk), k = 1,...,c be the cluster centers in kernel space. Similar to the K-means in kernel space [20], we have /ðvk Þ ¼
l X
ckj /ðxj Þ;
k ¼ 1; 2; . . .; c
ð11Þ
j¼1
where ckj is the parameters to be decided by KFCM algorithm in the kernel space as shown later. Using (11), the squared Euclidian distance between a cluster center /(vk) and a mapped point /(x) can be stated as 2 l X 2 Dk ðxÞ ¼ k /ðxÞ /ðvk Þ k ¼ /ðxÞ cki /ðxi Þ i¼1 ¼h/ðxÞ /ðxÞi 2
l X
cki h/ðxÞ /ðxi Þi
i¼1
þ
l X
cki ckj h/ðxi Þ /ðxj Þi
ð12Þ
i;j¼1
where Hk ¼
It also can be shown that the updating equation for membership of FCSS is given by ðDkj Þ2=ðm1Þ ukjj ¼ Pc 2=ðm1Þ i¼1 ðDij Þ
The FCM algorithm is summarized as follows:
•
ðukjj Þm qj
j¼1
and
•
l X
l X
m
ðukjj Þ Mj
ð8Þ
Using the kernel trick kðx; yÞ ¼ h/ðxÞ /ðyÞi, the kernelinduced distance (12) can be rewritten as
j¼1
123
762
Neural Comput & Applic (2009) 18:759–768
Dk ðxÞ ¼ kðx; xÞ 2
l X
cki kðx; xi Þ þ
i¼1
l X
cki ckj kðxi ; xj Þ
ð13Þ
i;j¼1
where k(x,y) is the Mercer kernel function [16]. The most popularly used kernel function in the literature is the Gaussian kernel, which is defined by kðx; yÞ ¼ e
kxyk2 2r2
ð14Þ
where r is the Gaussian kernel parameter. Based on the squared Euclidean distance in kernel space, the objective function of KFCM is formulated by c X l X J¼ ðukjj Þm Dk ðxj Þ ð15Þ
any data point (even it is a noisy one) in all clusters, such that the traditional fuzzy clustering algorithms (FCM, FCSS, and KFCM) are not robust against outliers and noise or disturbance of the input data. In information theory, the maximization of mutual information against the constrained input distribution can phase out unreliable input data points to achieve the capacity of a noisy channel [2, 22]. In this paper, we use the maximization of mutual information to provide a pruning approach for robust fuzzy clustering algorithms. Using the notations in the last section, the formulation of MI3 can be defined as [2] I¼
k¼1 j¼1
l X c X j¼1 k¼1
where Dk(xj) is the squared Euclidian distance between the transformed pattern /(xj) and cluster prototype /(vk) in kernel space, m (1 \ m \ ?) determines the fuzziness, and uk|j is the membership of transformed pattern /(xj) in the kth cluster. The minimization of (15) gives the updating equations for membership uk|j and cluster prototype /(vk) as following: ðDk ðxj ÞÞ1=ðm1Þ ukjj ¼ Pc 1=ðm1Þ i¼1 ðDi ðxj ÞÞ Pl m j¼1 ðukjj Þ /ðxj Þ /ðvk Þ ¼ Pl m j¼1 ðukjj Þ
ð16Þ
ð17Þ
The Eqs. (16) and (17) are just the KFCM clustering algorithm. From (17), we also obtain the explicit expression of ckj which has been used to calculate the squared Euclidian distance in kernel space as in (12) and (13), that is ðukjj Þm ckj ¼ Pl ð18Þ m j¼1 ðukjj Þ 3 The proposed pruning approach We first derive the proposed pruning approach based on the maximization of mutual information (MI). Next, we apply it to the above three traditional clustering algorithms, as a result, we get three robust clustering algorithms, i.e., robust fuzzy c-means (RFCM), robust fuzzy c-spherical shells (RFCSS), and robust kernel-based fuzzy c-means (RKFCM). Finally, we give the pseudo-code of the robust clustering algorithms. 3.1 Derivation Assume after clustering phase by the algorithms in the last section, we obtain a set of memberships uk|j, which minimizes the related distortion function for a given data. As discussed in [6], the fuzzy membership tends to associate
123
uj ukjj log
ujjk uj
ð19Þ
where uj denotes the input data distribution,4 and uk|j is the membership of xj in the kth cluster obtained by the clustering algorithms discussed in the last section, and the posteriori uj|k is obtained through the Bayes formula uj ukjj ujjk ¼ Pl ð20Þ j¼1 uj ukjj According to the information theory, the mutual information can be maximized against the prior uj to gain the maximum capacity in a noisy channel [2, 3], i.e., to maximize the following objective function ! l X c l X X ujjk Fðuj ; ukjj Þ ¼ max uj ukjj log k uj e j uj uj j¼1 k¼1 j¼1 ð21Þ where k [ [0, ? ?) is a separate parameter to control the degree of robustness against noisy points, ej is the expense of using the jth input point, which is given by c X ej ¼ ukjj dkj ð22Þ k¼1
where dkj denotes the distance measures, i.e., (2) for FCM, (5) for FCSS, and (13) for KFCM. It turns out [2, 3] that the resultant prior uj that maximizes F(uj,uk|j) is given by Pc exp k¼1 ukjj log ujjk kej uj ¼ Pl ð23Þ Pc j¼1 exp k¼1 ukjj log ujjk kej Combine Eqs. (23) and (20), we get
3
The average mutual information in eitherP of the Ptwo P P can be written u following forms [2]: I ¼ lj¼1 ck¼1 uj ukjj log ujjkj or I ¼ lj¼1 ck¼1 u uj ukjj log ukjjk : 4 Please note the prior distribution uj is equal to 1l in the clustering procedure of traditional clustering algorithms, while it is used to phase out the noisy points in the proposed pruning method as discussed later in this section.
Neural Comput & Applic (2009) 18:759–768
uj exp uj ¼
!
Pc
k¼1 ukjj log Pl
ukjj
j¼1
Pl
j¼1
Pc
uj exp
763
k¼1
uj ukjj
ukjj log Pl
kej
ukjj
j¼1
uj ukjj
!
ð24Þ
kej
•
The average mutual information (19) is non-negative. u However, the individual item Iðj; kÞ ¼ log Pl kjj can be q¼1
uq ukjq
negative [3]. If the jth point xj is taken into account and P ukjj \ lq¼1 uq ukjq , then the probability of the kth prototype is decreased by the observed point and gives a negative information about xj. Then the particular input point may be considered as an unreliable (noisy) point and its negative effect must be offset by other input points. Therefore, the maximization of the constrained mutual information (21) provides a good robust estimation of the noisy point (outlier or noise) in term that the average information is over all clusters and input points. The robust estimation and optimization is to maximize the mutual information against the prior uj and posterior uj|k, for any P value of j, if k ujjk ¼ 0 , then uj should be set equal to zero in order to obtain the maximum, such that the corresponding point xj can be deleted and dropped from further consideration in the optimization procedure as an outlier or noise [2, 22]. In practical implementation, we treat a point as an outlier or noise if its according uj is less than or equal to a small positive value. The degree of robustness of the pruning process against the outlier (noisy input points) with maximum capacity is controlled by selecting a proper value of the parameter k to pruning each input point by the following Kuhn–Tucker conditions (the proof can be found in [2]) c X ukjj ukjj log Pl kej ¼ V; 8uj 6¼ 0 ð25Þ k¼1 q¼1 uq ukjq c X k¼1
ukjj log Pl
ukjj
q¼1
uq ukjq
kej \V;
8uj ¼ 0
versions, i.e., RFCM, RFCSS, and RKFCM clustering algorithms. In the following, we give the pseudo-code of the proposed robust clustering algorithms by taking RFCM as an example.
ð26Þ
where V is a constant. The above equations tell that the number of noisy points can be controlled through the parameter k. If a bigger value of k [ 0 is selected, then more points are forced to possess uj = 0 to satisfy the Kuhn–Tucker conditions, which implies that more noisy points are detected in the robust estimation procedure, i.e., the robustness of the pruning method is increased. The higher value k is the more data points are pruned.
•
Step (1) Set the number of clusters c, convergence parameter e = 0.001, fuzziness exponent 1 \ m \ ?, robustness control parameter 0 B k \ ?. Step (2) Fixed point iterations Perform clustering step: initialize membership uk|j (k = 1,2,...,c,j = 1,2,...,l), then alternative update the cluster prototypes and fuzzy membership by using (4) and (3) until the maximum change in the cluster prototypes between consecutive iterations is less than the given threshold value e.
Perform pruning step: initialize distribution uj ¼ 1l (j = 1,2,...,l), then iteratively update the prior distribution by using (24) until the convergence is reached. •
Step (3) Delete noisy points with the corresponding uj less than or equal to a small positive value, and recalculate the membership and cluster prototypes using the l-ln (ln denotes the number of noisy points) remaining data points. If more noisy data points are to be removed, then go back to step 2 with uj ¼ ll1 n (j = 1,2,...,l-ln) for the l-ln remaining data points.
4 Experimental results This section presents a few simulation examples to show the effectiveness of the proposed pruning approach for robust data clustering. We first illustrate the ability of the proposed RFCM, RFCSS, and RKFCM against noise and outliers using three noisy date sets, then compare the performance of RFCM to several existing robust fuzzy clustering algorithms on the widely tested X12 data set. All the data sets tested in this section are shown in Fig. 1. In all simulation results, the original data sets are marked by ‘‘o’’ and the partitioned clusters are displayed by using different marks. The determination of optimal cluster number is not the focus of this paper, we assume it is known as a priori denoted by c. 4.1 Illustrations
3.2 Pseudo-code By applying the above pruning approach to FCM, FCSS, and KFCM clustering algorithms, we get three robust
In this subsection, we try to illustrate the effectiveness of the three proposed robust clustering algorithms on noisy data sets.
123
764
Neural Comput & Applic (2009) 18:759–768
6
3.5
3
10
3
2.5
4
2.5 2
2
1.5
6
1.5 1
1
0
8
2
4
0.5 0.5
0
−2 −4
0
−0.5
−0.5
−1
2 0
−1.5 −6 −6
−1 −4
−2
0
2
4
6
−4
−2
−2 −3
−2
−1
0
1
2
3
−2 −1.5 −1 −0.5 0
(b) Data for RFCSS
(a) Data for RFCM
0.5 1
1.5 2
2.5 3 3.5
−6
−4
−2
0
2
4
6
(d) X12 Data Set
(c) Data for RKFCM
Fig. 1 The four data sets tested in the examples. The first three data sets are used for illustration purpose. The X12 data set is used for comparison purpose
4.1.1 Test on RFCM The first example aims to demonstrate the robustness of RFCM against noisy points by using the pruning approach, where the degree of robustness is controlled by the separate parameter k. The data set, as shown in Fig. 1a, consists of two assumed clusters and some noisy points. The clustering result of the traditional FCM on this noisy data is shown in Fig. 2a. It can be seen that the partitioned cluster centers of FCM are obviously distorted by the noise. In contrast, the proposed RFCM has much better performances: the cluster centers are less or hardly distorted by the noise. The clustering results of RFCM with k = 0.00, k = 0,05, k = 0.10 are shown in Fig. 2b–d, respectively. It can be seen the higher value k is, the more noisy points are pruned, such that the more robustness RFCM achieves. Note in this example, the partitioned cluster centers are denoted by symbol ‘‘*’’, and the pruned data points are eliminated from the figures. 4.1.2 Test on RFCSS The data set, as shown in Fig. 1b, is used to verify the robustness of the proposed RFCSS algorithm. The given data consists of two appeared rings (spherical shells) and
some outliers. The partitioned results of traditional FCSS and the proposed RFCSS are shown in Fig. 3. It can be seen that the result of FCSS is distorted by outliers (see the plotted circumference) as shown in Fig. 3a. In contrast, RFCSS achieves robust performances: when k is set to be 0.00, the outliers between shells are eliminated by the pruning method as shown in Fig. 3b; when k is increased, more and more outliers are pruned as shown in Fig. 3c (k = 0.05) and Fig. 3d (k = 0.10). Note in this example, the partitioned spherical shells (determined by centers and radii) are plotted by ‘‘...’’, and the pruned data points are eliminated from the figures. 4.1.3 Test on RKFCM Before applying the kernel-based clustering algorithm, we need to select a suitable kernel function and its parameter(s). In our example, we use the Gaussian kernel function (14) and its parameter is selected as follows [24] !1=2 Pl k2 j¼1 kxj x ropt ¼ ð27Þ 2nl P where x ¼ 1l lj¼1 xj is the mean of input data points, and n is the dimensionality of input data. The data covariance is
6
6
6
6
4
4
4
4
2
2
2
2
0
0
0
0
−2
−2
−2
−2
−4
−4
−4
−4
−6 −6
−4
−2
(a)
0
2
4
6
−6 −6
−4
−2
0
2
4
6
(b)
Fig. 2 Demonstration of RFCM against noisy points. a Clustering result of FCM, in which the partitioned cluster centers are obviously distorted by noisy points. b, c, d Clustering results of RFCM with
123
−6 −6
−4
−2
(c)
0
2
4
6
−6 −6
−4
−2
0
2
4
6
(d)
different values of k, which are robust against noisy points with the degree of robustness controlled by k
Neural Comput & Applic (2009) 18:759–768
765
3
3
3
3
2.5
2.5
2.5
2.5
2
2
2
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5
−0.5
−0.5
−0.5
−1 −4
−1 −3
−2
−1
0
1
2
3
−4
(a)
−1 −3
−2
−1
0
1
2
3
−4
(b)
−1 −3
−2
−1
0
1
2
3
−4
(c)
−3
−2
−1
0
1
2
3
(d)
Fig. 3 Demonstration of RFCSS against noisy points. a Clustering result of FCSS, in which the partitioned shells are distorted by noisy points. b, c, d Clustering results of RFCSS with different values of k, which are robust against noisy points with the degree of robustness controlled by k
3.5
3.5
3.5
3
3
3
3.5 3
2.5
2.5
2.5
2.5
2
2
2
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
−0.5
0
−0.5
0
−0.5
−0.5
−1
−1
−1
−1
−1.5
−1.5
−1.5
−1.5
−2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5
(a)
−2 −2 −1.5−1 −0.5 0
−2 0.5 1
1.5 2
2.5 3 3.5
(b)
−2 −1.5−1 −0.5 0
0.5 1
1.5 2
(c)
2.5 3
3.5
−2 −2 −1.5−1 −0.5 0
0.5 1
1.5 2
2.5 3
3.5
(d)
Fig. 4 Demonstration of RKFCM against noisy points. a Clustering result of KFCM, the noisy points are classified into two clusters. b, c, d Clustering results of RKFCM with different values of k, which are robust against noisy points with the degree of robustness controlled by k
divided by n such that the dot products in the Gaussian kernel function is scaled with the dimensionality of the input space to make the r independent of it. The given linearly nonseparable data, as shown in Fig. 1c, consists of two concentric rings (containing 50 and 30 data points respectively) contaminated with some outliers. The Gaussian kernel parameter calculated by (27) for this data is ropt = 0.8383, which is used for KFCM and RKFCM in this example. The traditional KFCM can reveal the linearly nonseparable data structure, however, it can not discard the outliers as shown in Fig. 4a. The proposed RKFCM can reveal the linearly nonseparable data structure and achieve robustness against noisy points by using the kernel methods and the pruning approach. The clustering results of RKFCM with different values of k are shown in Fig. 1b–d: when k = 0.00, only the outliers between the two rings are eliminated by the pruning approach as shown in Fig. 4b, when k is gradually increased, the outliers located outside the outer ring are gradually eliminated as shown in Fig. 4c with (k = 0.05) and Fig. 4d (with k = 0.10). Note in this example, the partitioned cluster centers are invisible due to the high dimensionality of the kernel space.
4.2 Comparisons In this subsection, we try to make a comparison of the proposed RFCM to standard fuzzy c-means (FCM) and its two robust versions, i.e., possibilistic c-means algorithm (PCM) [12], possibilistic fuzzy c-mean algorithm (PFCM) [17] on the X12 data set [17]. The comparison does not aim to show the superiority of the proposed method to other existing robust clustering methods, we just want to show that the proposed method provides a new approach (maximization of mutual information against prior distribution) to phase out noisy points, which could improve the robustness for many traditional clustering algorithms, e.g., FCM. Before giving the comparison, we would like to give a brief review to PCM and PFCM, such that the comparisons could be more understandable to the reader. The brief review of standard FCM could be found in Sect. 2.1. The robust approach, called possibilistic c-means (PCM) algorithm, was introduced in [12] to overcome the relative membership problem of the FCM. The primary objective of the possibilistic approach is to achieve relative values that are possibilistic, i.e., the relative value of a point in a
123
766
Neural Comput & Applic (2009) 18:759–768
cluster represents the typicality of the point in the cluster or the possibility of the point belonging to the cluster. This was done by modifying the objective function of FCM as follows JPCM ¼
l X c X
ðtkjj Þg dkj þ
j¼1 k¼1
c X
ck
k¼1
l X
ð1 tkjj Þg
ð28Þ
j¼1
In the above equation, the gk are suitable positive numbers, which are chosen in [12] as Pl g j¼1 ðtkjj Þ dkj ck ¼ K Pl ð29Þ g j¼1 ðtkjj Þ where K is a user-defined constant, the most common choice is K = 1. There are no constraints on the typicality tk|j other than the requirement that they should be in [0,1]. Assuming that the distances ck are specified, the update equation for the typicality tk|j is given by tkjj ¼ 1þ
1 1=ðg1Þ
ð30Þ
dkj ck
Another robust approach, the possibilistic fuzzy c-means (PFCM) [17], combines the objective functions of the FCM-type and PCM-type algorithms with extra hyperparameters to include both the fuzzy membership and typicality values for further robust improvements, i.e. c X l X EPFCM ¼ aðukjj Þm þ bðtkjj Þg dkj k¼1 j¼1 c X
þ
ck
l X
ð1 tkjj Þg
ð31Þ
j¼1
k¼1
where a and b are extra hyper-parameters, which control the tradeoff between membership uk|j and typicality tk|j. The selection of gammak is same to that in PCM. The update equation for membership uk|j and typicality tk|j are given by 1=ðm1Þ
dkj ukjj ¼ Pc i¼1
1=ðm1Þ
ð32Þ
dij
and tkjj ¼ 1þ
1 b ck
dkj
1=ðg1Þ
ð33Þ
The data set X12 is widely used to illustrate the effect of outliers on the robust fuzzy clustering as in [12, 17]. The data set, as shown in Fig. 1d, consists of 12 points whose coordinates are given in columns 2 of Table 1. The data set was constructed such that the first ten points (from point 1 to point 10) forms two good clusters and the last two points (point 11 and point 12) are considered as outliers. The parameters of the four clustering algorithms are set as
123
follows: m = 2 for FCM, K = 1 and g = 2 for PCM, m = 1, K = 1, g = 2, a = 1 and b = 1 for PFCM, and m = 2, k = 0 for the proposed RFCM. Table 15 shows the values of the membership (U1,U2) of the FCM, typicality (T1,T2) of the PCM, typicality (T1,T2) of the PFCM, and the robust density estimate (U) of the proposed approach. From the table, it can be seen that the standard FCM algorithm can not detect the outliers in the last two rows, since the membership values U1 and U2 in the last rows possess the same value of 0.5. In contrast, the proposed RFCM can easily identify the outliers since the estimated values of U in the last two rows are equal to zero (i.e., 0.0000). And the PCM and PFCM are also capable of detecting one outlier (i.e., point 12) since the typicality values T1 and T2 possess a very small value (0.070 for PCM and 0.072 for PFCM) as shown in the last row, however, they may have difficulty to identify whether or not the second last point (i.e., point 11) is an outlier, since the values of T1 and T2 in the second last row are close to those of good points (points 1–10). From the view of identifying outliers, we can say that the proposed RFCM has the best performance.
5 Conclusion and discussion A novel pruning approach has been proposed for the traditional clustering algorithms to phase out noisy points (noise and outliers). The maximization of mutual information (MI) against input distribution provides a robust estimation of probability soft margin to phase out noise and outliers. The degree of robustness is controlled through a separate parameter to make a trade-off between rejection of noisy points and optimal clustered data. Three robust versions of traditional fuzzy clustering algorithms, i.e., robust fuzzy c-means (RFCM), robust fuzzy c-spherical shells (RFCSS), and robust kernel-based fuzzy c-means (RKFCM), are specifically developed. The experimental results have shown that the proposed pruning method effectively phases out the noisy points such that reduces the effect of the noisy points on the final partitioned results when compared to the traditional clustering algorithms. The proposed pruning method, on one hand, achieves the robustness against noisy points; on the other hand, introduces an additional parameter k to control the degree of robustness. Though the value k = 0.10 works well for 5
We run the four clustering algorithms on X12 with the same initial cluster centers ([-3.34, 1.67][1.67, 0.00]) as in [17]. It is observed that the result of FCM is nearly identical to that in [17]; however, the results of PCM and PFCM are a bit worse than those in [17]. For the purpose of fair comparison, the numerical results of PCM and PFCM in Table 1 are directly taken from [17], while the numerical results of FCM and RFCM are generated by our program.
Neural Comput & Applic (2009) 18:759–768
767
Table 1 Numerical results of different fuzzy algorithms on X12 data X12 Data Pts.
FCM [x, y]
U1
PCM U2
T1
PFCM T2
T1
IFCSS T2
U
1
[-5.00,-3.34]
0.9364
0.0636
0.492
0.134
0.621
0.113
0.0256
2
[-3.34, 1.67]
0.9673
0.0327
0.655
0.194
0.801
0.165
0.1834
3
[-3.34, 0.00]
0.9897
0.0103
0.847
0.208
0.953
0.171
0.2803
4
[-3.34,-1.67]
0.8994
0.1006
0.648
0.193
0.642
0.157
0.0096
5
[-1.67, 0.00]
0.9156
0.0844
0.972
0.351
0.840
0.278
0.0110
6
[1.67, 0.00]
0.0844
0.9156
0.351
0.972
0.278
0.840
0.0110
7 8
[3.34, 1.67] [3.34, 0.00]
0.0327 0.0103
0.9673 0.9897
0.194 0.208
0.655 0.847
0.165 0.171
0.801 0.953
0.1834 0.2803
9
[3.34, -1.67]
0.1006
0.8994
0.193
0.648
0.157
0.642
0.0096
10
[5.00, 0.00]
0.0636
0.9364
0.134
0.492
0.113
0.621
0.0256
11
[0.00, 0.00]
0.5000
0.5000
0.631
0.631
0.490
0.490
0.0000
12
[0.00, 10.0]
0.5000
0.5000
0.070
0.070
0.072
0.072
0.0000
most data sets in the experiments, the selection of k needs further investigation in our future research work. We here give several insights about this issue: (1) The degree of the robustness is controlled by a small value of the parameter k to prune each input point by the Kuhn– Tucker conditions (Eqs. 25, 26). If a bigger value of k [ 0 is selected, then more points will possess uj = 0 to satisfy the Kuhn–Tucker conditions, which implies that more noisy points are detected in the robust estimation procedure, i.e., the degree of robustness is increased. (2) As a general rule of thumb, for a given data set, we set k = 0 to phase out the most unreliable noisy points; if eliminating more noisy points is desired, we can gradually increase k and redo the maximization optimization to phase out more noisy points. Acknowledgments The authors sincerely thank the anonymous reviewers for their insightful comments and valuable suggestions on an earlier version of this paper.
References 1. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York 2. Blahut RE (1972) Computation of channel capacity and ratedistortion functions. IEEE Trans Inform Theory 18(4):460–473 3. Blahut RE (1987) Principle and practice of information theory. Addison-Wesley, Massachusetts 4. Dave RN (1990) Fuzzy shell-clustering and applications to circle detection in digital images. Int J Gen Syst 16(4):343–355 5. Dave RN, Bhaswan K (1992) Adaptive fuzzy c-shells clustering and detection of ellipse. IEEE Trans Neural Netw 3(5):643–662 6. Dave RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5(2):270–293 7. Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–781 8. Girolami M (2002) Mercer kernel-based clustering in feature space. IEEE Trans Neural Netw 13(3):780–784
9. Cuille´n A, Pomares H, Rojas I, Gonze´lez J, Herrera LJ, Rojas F, Valenzuela O (2007) Studying possibility in a clustering algorithm for RBFNN design for function approximation. Neural Comput Appl 17(1):75–89 10. Gustafson EE, Kessel WC (1979) Fuzzy clustering with a fuzzy covariance matrix. In: Proceedings of the IEEE conference decision control. San Diego, CA, pp 761–766 11. Krishnapuram R, Nasraoui O, Frigui H (1992) The fuzzy c-spherical shells algorithm: a new approach. IEEE Trans Neural Netw 3(5):663–671 12. Krishnapuram R, Keller JM (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1(2):98–110 13. Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393 14. MacQueen S (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281–297 15. Man Y, Gath I (1994) Detection and separation of ring-shaped clusters using fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 16(8):855–861 16. Mu¨ller KR, Mike S, Ratsch G, Tsuda K, Scho¨lkopf B (2001) An introduction to kernel based learning algorithms. IEEE Trans Neural Netw 12(2):181–201 17. Pal NR, Pal K, Keller JM, Bezdek JC (2005) A possibilistic fuzzy c-means clusteirng algorithm. IEEE Trans Fuzzy Syst 13(4):517– 530 18. Patane´ G, Russo M (2001) The enhanced LBG algorithm. Neural Netw 14(9):1219–1237 19. Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86(11):2210–2239 20. Scho¨lkopf B, Smola AJ, Mu¨ller KR (1996) Nonlinear component analysis as a kernel eigenvalue problem. Technical report, Max Planck Institute for Biological Cybernetics, Tubingen, Germany 21. Selim SZ, Ismail MA (1984) K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6(1):81– 86 22. Song Q (2005) A robust information clustering algorithm. Neural Comput 17(12):2672–2698
123
768 23. Still S, Bialek W (2004) How many clusters? An informationtheoretic perspective. Neural Comput 16:2483–2506 24. Yang XL, Song Q, Zhang WB (2006) A kernel-based deterministic annealing algorithm for data clustering. IEE Vis Image Signal Process 153(5):557–568
123
Neural Comput & Applic (2009) 18:759–768 25. Yang XL, Song Q, Wang Y (2007) A weighted support vector machine for data classification. Int J Pattern Recogn Artif Intell 21(5):961–976 26. Zhang JS, Leung YW (2004) Improved possibilistic c-means clustering algorithms. IEEE Trans Fuzzy Syst 12(2):209–217