Journal of the Operational Research Society (2009) 60, 1085 --1095
© 2009
Operational Research Society Ltd. All rights reserved. 0160-5682/09 www.palgrave-journals.com/jors/
CL.E.KMODES: a modified k-modes clustering algorithm N Mastrogiannis1 , I Giannikos1∗ , B Boutsinas1, 2 and G Antzoulatos1 1 University of Patras, Patras, Greece; and 2 University of Patras Artificial Intelligence Research Center (UPAIRC), Patras, Greece
In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures. Journal of the Operational Research Society (2009) 60, 1085 – 1095. doi:10.1057/jors.2009.25 Published online 29 April 2009 Keywords: computers; artificial intelligence; clustering; dissimilarity measure; ELECTRE I; k-modes
1. Introduction The essence of clustering is to partition a set of objects into disjoint and homogeneous clusters, such that objects belonging to the same cluster are more similar to each other than those belonging to different clusters (Jain et al, 1999). Objects to be clustered are represented by a set of attributes, thus an object is considered as a conjunction of attribute values. Furthermore, every attribute is assigned with a weight, determined during the clustering process, which incorporates its strength in the clustering process as a whole. Clustering and other data mining applications frequently involve categorical data. The traditional approach of converting categorical data into numerical ones does not necessarily produce meaningful results in cases where the categorical domains are not ordered. Thus, handling such data is a very important research topic in data mining. Recently, many clustering algorithms that handle categorical data have been proposed in the literature. In the following review though, we will focus only on the k-modes algorithm and its variants. The k-modes algorithm (Huang, 1998) extends the k-means paradigm (MacQueen, 1967) to the categorical domain. In particular, a new dissimilarity measure is employed to deal with categorical data, replacing means with modes, and a frequency-based method is used to update modes in the clustering process in order to minimize the clustering cost function. Based on k-modes, Jollois and Nadif (2002) propose an adaptive mixture model for categorical
∗ Correspondence: I Giannikos, Department of Business Administration,
University of Patras, GR-26500 Patras, Greece.
data, which gives a probabilistic interpretation of the criterion optimized by the k-modes algorithm. Furthermore, a fuzzy k-modes algorithm is presented in Huang and Ng (1999) and a tabu search technique is applied in Ng and Wong (2002) to improve the fuzzy k-modes algorithm. Moreover, an iterative initial-points refinement algorithm for categorical data is presented in Sun et al (2002), while Gan et al (2005) have introduced a genetic clustering algorithm (called GKMODE) by integrating a k-modes algorithm. Additionally, San et al (2004) and He et al (2005) independently introduced a new dissimilarity measure to the kmodes clustering process, in order to improve the accuracy of the clustering results. Their main idea is to use the relative attribute frequencies of the cluster modes in the similarity measure of the k-modes objective function. This modification allows the algorithm to recognize a cluster with weak intrasimilarity and, therefore, assign less similar objects to such a cluster, so that the generated clusters have strong intrasimilarities. Ng et al (2007) enhance the above procedure. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure. It aims at introducing an alternative assignment of a categorical object to a cluster, and thus, to accomplish a significant improvement in the performance of the k-modes algorithm. In order to make the proper assignment, the proposed method uses several concepts of the theoretical framework of the ELECTRE methods (Roy, 1968, 1991, 1996; Pirlot, 1997; Bouyssou, 2001), and in particular the ELECTRE I method (Roy, 1968) in its ELECTRE Iv version (Figueira et al, 2005), due to its simplicity and its use to solve choice problems in order to select the most appropriate object. Concepts like the indifference threshold, the concordance and veto thresholds
1086
Journal of the Operational Research Society Vol. 60, No. 8
and the concordance index are employed and adjusted to the rationale of the method. The essence of the new dissimilarity measure is to compare the value of each attribute of a categorical object with the corresponding attribute value of every mode (centroid). Through these comparisons, a suitable similarity relation is developed between the object and each of the modes. This similarity relation is based on the following three principles: 1. The degree of resemblance between the object and the mode, identified according to a difference that is based on the number of matches/mismatches in their common attributes. 2. The strength of the mode in terms of the attributes that positively influence the above resemblance. 3. The existence of attributes in which the difference between the object and the mode is considered so significant that the object cannot be assigned to the cluster signified by the mode. The first two principles, representing the strength of the majority of the attributes, result in choosing the best of the modes, enabling us to assign the object to the cluster of the chosen mode. On the other hand, the third principle confirms or rejects the above assignment. The study of similarity/dissimilarity between categorical data objects is not new. According to Boriah et al (2008), Pearson proposed a chi-square statistic in the late 1800s, which is often used to test independence between categorical variables in a contingency table. This was later modified and extended to several other measures (Pearson, 1916; Maung, 1941; Cramer, 1946). More recently, the overlap measure (Stanfill and Waltz, 1986) was proposed, which defines similarity/dissimilarity based on the number of matches/mismatches of attributes. Due to its simplicity and ease of use, it has become the most commonly used similarity measure for categorical data (Boriah et al, 2008). It should be noted that the notion of similarity/dissimilarity as defined by the overlap measure was used on k-modes (Huang, 1998) by incorporating the similarity computation into the algorithm, and also in some of its variants. Furthermore, some neighbourhood-based approaches for categorical data (eg Guha et al, 2000; Le and Ho, 2005; Ahmad and Dey, 2007) use notions of similarity (usually the overlap measure) to define the neighbourhood of a data object. Other similarity measures (in particular data-driven) include the ones proposed by Gambaryan (1964), Goodall (1966), Smirnov (1968), Burnaby (1970), Anderberg (1973), Lin (1998) and Eskin et al (2002). The rest of the paper is organized as follows. In Section 2 we describe the proposed method in greater detail. In Section 3 we present experimental results of the method’s efficiency and accuracy. Finally, in Section 4 we draw a series of conclusions and suggest issues for further research.
2. Methodology Let us suppose that a finite data set Y = {y1 , y2 , . . . , yn } containing n categorical objects is evaluated in a family F = {a1 , a2 , . . . , ad } of attributes. Each attribute is different from the others, describing only a part of the clustering problem. Thus, every attribute is assigned a weight that is calculated within the main steps of the proposed algorithm. This weight incorporates the strength and importance of the attribute in the clustering process as a whole. CLustering through ELECTRE and K-MODES (CL.E.KMODES), like k-modes, is an iterative process realized in four phases: I. Selection of the k initial modes, one for each cluster. II. Assignment of each object to the proper cluster according to the new four-step dissimilarity measure. After each assignment, the mode of the cluster is updated. III. After all objects have been assigned to clusters, the dissimilarity of objects against the current modes is re-tested. If an object is found such that its most suitable mode belongs to another cluster rather than its current one, the object must be re-assigned to that cluster and the modes of both clusters must be updated. IV. Phase III is repeated until no object has changed clusters after a full cycle test of the whole data set. Let K E N = {x1 , x2 , . . . , xk } be the set of the randomly selected initial modes also evaluated in the family F = {a1 , a2 , . . . , ad } of attributes. These initial modes are k distinct objects from the data set. Taking into consideration that the proposed method is an iterative process, the clusters’ modes will be re-estimated and thus KEN will be updated as many times as required for the algorithm to converge according to the above four phases. The process of updating the mode of a cluster is based on Huang (1998) and is described in the Appendix. Note that according to this process, a mode is not necessarily an element of Y . The new dissimilarity measure, which is applied in phases II and III of the modified k-modes, consists of four steps and is described in the flow diagram of Figure 1. In Step 1, we identify the resemblance of the object with each of the k modes, according to a difference that is based on the number of matches/mismatches in their common attributes. In Step 2, we calculate the concordance indices and the corresponding thresholds. Furthermore, through the concordance test, which is based on the weights of the attributes, we assign initially the object to the proper cluster. In Step 3, the discordance test confirms or rejects the above assignment, considering the possible objections of some of the attributes. A final Step 4 deals with the possible default cases. Note, that Steps 1–3 of the new dissimilarity measure correspond to the three principles described in the introduction.
2.1. Description of the new dissimilarity measure 2.1.1. Step 1: Concordance–discordance coalition check The purpose of Step 1 is to determine the resemblance
N Mastrogiannis et al—CL.E.KMODES: a modified k -modes clustering algorithm
1087
Step 1: ConcordanceDiscordance coalition check
Step 2: Concordance test and initial clustering of the object Rejection of Initial Next object clustering Step 4: Clustering in default cases
Steps 2 & 3 have been completed without the clustering of the object
Step 3: Discordance test (Veto check)
The object is definitely clustered
Figure 1
The flow diagram of the new dissimilarity measure.
between each object yi , i = 1, 2, . . . , n and each of the modes x j , j = 1, 2, . . . , k through a series of pairwise comparisons on every attribute a f ∈ F, f = 1, 2, . . . , d that evaluates both yi and x j . In order to determine the above resemblance, a proper difference based on the concept of pairwise comparisons must be defined. This difference is based on the dissimilarity measure of Huang (1998), which is an overlap measure (Stanfill and Waltz, 1986). An overlap measure finds similarity between two categorical attributes by assigning a similarity of 1 if the values are identical and a similarity of 0 if the values are not identical. For two multivariate categorical objects, the similarity between them will be directly proportional to the number of attributes in which they match, and thus their dissimilarity to the number of attributes in which they mismatch. The overlap measure is recently the most commonly used similarity measure for categorical data due to its simplicity and ease of use (Boriah et al, 2008). Since our proposed algorithm is a modified k-modes algorithm (Huang, 1998), it makes sense to use Huang’s dissimilarity measure. It must be emphasized though, that while Huang examines with his measure the similarity/dissimilarity of an object against a mode as a whole, we take into account every attribute separately. In other words, for each attribute common between object yi and mode x j , we calculate the difference:
D(x j, f , yi, f ) =
(m x j, f + m yi, f ) × (x j, f , yi, f ) (m x j, f · m yi, f )
(1)
where x j, f is the value of mode x j on attribute a f , yi, f is the value of object yi on attribute a f , m x j, f is the number of times x j, f appears in the set of the modes on attribute a f , m yi, f is the number of times yi, f appears in the set of the modes on attribute a f ,
(x j, f , yi, f ) =
0
if x j, f = yi, f
1
if x j, f = yi, f
According to Huang (1998), the smaller the number of mismatches between the corresponding values of the attributes of an object yi and a mode x j the more similar these two are. Even though this general principle is valid for relation (1), it does not consider the size of the difference itself in the similarity process. This means that, if D(x j, f , yi, f ) = 0 on attribute a f , this attribute is automatically considered as a drawback in terms of how similar the object and the mode are, whether the difference is large or small. This approach considers a small difference equally inadequate, in similarity terms, to a larger one. In the proposed method we aim to expand the number of attributes that can contribute in the development of the similarity relation between an object yi and a mode x j . Thus, in addition to the attributes with zero value differences, some of the attributes with small non-zero value differences could also positively influence the similarity process. In order to achieve that, we check if D(x j, f , yi, f ) qi, f
(2)
1088
Journal of the Operational Research Society Vol. 60, No. 8
Parameter qi, f is an indifference threshold associated with object yi and each attribute a f ∈ F, namely an upper limit on the difference between object yi and any of the k modes on attribute a f ∈ F. The attributes that satisfy relation (2) belong to a concordance coalition Con(i, j) = {a f ∈ F : D(x j, f , yi, f ) qi, f }, that is, the set of attributes that support a strong resemblance between the object and the mode according to relation (1). All the other attributes belong to a discordance coalition Dis(i, j) = {a f ∈ F : D(x j, f , yi, f ) > qi, f } that opposes this resemblance. It should be noted that the attributes of the concordance coalition with only zero value differences belong to ConZero(i, j), which in general is a subset of Con(i, j). Note 1—Details on the indifference threshold: The indifference threshold qi, f is a very important parameter. The proper choice of qi, f will positively influence the results of the clustering process. We assign the indifference threshold using a fixed rule for every object that needs to be clustered on any data set. The rationale of the rule is to allow only small differences to satisfy relation (2). Note that the smaller the difference is for an attribute the more similar the object and the mode are for this attribute. Taking these into consideration, the indifference threshold is defined per attribute as the largest of the previously calculated differences, above which no similarity is valid between the object and the mode. Therefore, through the pairwise comparisons of object yi with each mode x j on every attribute a f ∈ F that evaluates both yi and x j , the discrete non-zero value differences (denoted by Ni, f ) are identified and are then sorted in descending order. The indifference threshold is defined as follows: • If |Ni, f | = 0 then qi, f = 0. • If |Ni, f |=1 then qi, f is equal to the discrete non-zero value difference in the first position. • If |Ni, f | 2 then qi, f is equal to the discrete non-zero value difference in the (|Ni, f | − 1) position. In Figure 2 we show a graphic representation of the above definition. Each box of the six columns represents a discrete nonzero value difference of attribute a f , when |Ni, f | is valued from 1 to 6, respectively. These differences are sorted in descending order according to their values. The black boxes correspond to the indifference thresholds defined in the above six distinct cases. From the above definition, it is clear that as the number of the discrete non-zero value differences increases, the number of the excluded differences increases as well. Thus, regardless of the number of the discrete non-zero value differences, only the smallest of them are able to satisfy relation (2). This enables the concordance coalition to include attributes that support a strong resemblance between the object and the mode according to relation (1), and the discordance coalition to include enough attributes that will be able to strongly judge the clustering results of the object.
Figure 2
Definition of the indifference threshold.
Note 2—Assessment of weights: Variable selection and weighting have been important research topics in cluster analysis (eg Makarenkov and Legendre, 2001; Modha and Spangler, 2003; Huang et al, 2005). In the proposed method, the weights of the attributes are automatically calculated within the clustering process. In particular, the process of calculating the weights follows the pattern of Huang et al (2005). According to this pattern, every weight is assigned taking into consideration the differences of every object with every mode available according to relation (1). This means that the weights are valued from the entire data space. Even though the computational complexity increases, this valuation process also increases the generality and thus the effectiveness of the weights. The weights are calculated as follows: ⎧ n k n·k ⎪ D(x j, f ,yi, f )=0 ⎪ ⎨ n k D(x ,y ) if i=1 j=1 j, f i, f i=1 j=1 wf = n k ⎪ ⎪ ⎩0 D(x j, f ,yi, f )=0 if i=1 j=1
(3) n k
D(x j, f ,yi, f )
is the overall average Note that A f = i=1 j=1n·k of the differences on attribute a f ∈ F for every mode and k object. According to relation (3), the smaller n every i=1 j=1 D(x j, f , yi, f ) is the larger the weight of the n k attribute should be set. When i=1 j=1 D(x j, f , yi, f ) = 0, this means that some of the differences on attribute a f ∈ F for every mode and every object are not zero. The smaller the differences the larger the possibility for attribute a f to belong to the concordance coalition, that is, to belong to the set of attributes that positively influence the similarity process. The use of the inverse of A f ensures that the weight of such an attribute will receive a large value. On the other hand, if D(x j, f , yi, f ) = 0 for an attribute a f ∈ F, the object and the mode match each other on this attribute. n k Consequently, if i=1 j=1 D(x j, f , yi, f ) = 0 for the entire data space, then attribute a f ∈ F has the same value for every object and every mode. Thus, such an attribute is redundant and is not taken into consideration in the clustering process. It should also be noted that the weights of the attributes will be re-calculated every time the set of modes changes, that is, when the object is assigned to a cluster. 2.1.2. Step 2: Concordance test and initial clustering of the object In order to identify the most suitable mode to properly cluster the object, two issues must be considered with
N Mastrogiannis et al—CL.E.KMODES: a modified k -modes clustering algorithm
respect to the attributes of the concordance coalition. At first, the normalized sum of the weights of the attributes that belong to the concordance coalition must exceed a certain limit, in order to ensure that the mode is strong enough to cluster the object properly. Second, some of the concordance coalition attributes, namely the ones with zero value differences, have stronger similarity capabilities than the rest. Thus, it is important to strengthen these attributes in order to ensure their superiority in the process of identifying the mode that will properly cluster the object. The consolidation of the above two issues results in defining the concordance index and the concordance threshold. The concordance index for each mode x j , j = 1, 2, . . . , k and each object yi , i = 1, 2, . . . , n is calculated as follows: C I (x j , yi ) ⎡⎛ ⎢⎜ z∈Con(i, j) = ⎣⎝ wh h∈S
wz
⎞
⎛
wc
⎟ ⎜ c∈ConZero(i, j) ⎠ + bonus× ⎝ wz z∈Con(i, j)
⎞⎤ ⎟⎥ ⎠⎦ (4)
where S = Con(i, j) ∪ Dis(i, j). The first fraction of relation (4) is the normalized sum of weights of the concordance coalition attributes. The second fraction is the normalized sum of weights of the concordance coalition attributes that have zero value differences. Its multiplication with bonus will result in the increase of the value of the concordance index when there are attributes with zero value differences. A large value of the concordance index implies that there are enough concordance coalition attributes with a strong total weight and some of them with zero differences in their pairwise comparisons. Hence, the corresponding mode is more similar to the object and more appropriate to cluster it. Following the same logic, the concordance threshold C T (x j , yi ) corresponding to the concordance index C I (x j , yi ) is calculated as follows: ⎞⎤ ⎡ ⎛ wc ⎢ ⎜ c∈ConZero(i, j) ⎟⎥ C T (x j , yi ) = ⎣m + bonus× ⎝ ⎠⎦ (5) wz z∈Con(i, j) where parameter m is given according to a specific rationale (see Note 4). The concordance threshold represents the limit the concordance index must exceed in order to ensure a valid assignment of the object to the cluster of the mode that corresponds to this index. Thus, in order to identify the best mode for the clustering of the object, we must first sort the concordance indices in descending order, which means that the clustering strength of the corresponding modes decreases as the value of the indices decrease. Then, the concordance index with the larger value will be compared to its corresponding concordance threshold. If the larger concordance index is at least equal to the compared threshold, the mode that corresponds
1089
to this index will assign the object to its cluster. If not, the process continues with the other indices of the descending order. This comparison represents the concordance test, given by the following relation: C I t (x j , yi ) C T t (x j , yi )
(6)
Parameter t =1, . . . , k denotes the positions of the descending order of the concordance indices and the corresponding thresholds and consequently the descending order of the modes, considering their clustering capabilities. Clearly, inequality (6) is satisfied if the normalized sum of weights of the concordance coalition attributes exceeds parameter m. The process of the concordance test is outlined below: t =1 DO WHILE (t k) AND object yi is not clustered IF C I t (x j , yi ) C T t (x j , yi ) THEN Object yi is initially assigned to the cluster of the mode in the t position ELSE t =t +1 END IF LOOP It should be noted that as the process of finding the appropriate concordance index continues, the strength of the indices decrease. Consequently the smaller the value of a concordance index the larger the possibility of a false clustering for the object. Note 3—Parameter bonus: Parameter bonus in relations (4) and (5) constitutes an essential factor in order to properly sort the concordance indices in descending order. In particular, bonus aims at increasing the value of the concordance indices according to the number of the concordance coalition attributes with zero value differences, and its proportion in respect to the number of the concordance coalition attributes as a whole. Based on these guidelines, the value of bonus is determined as follows: • If ConZero(i, j) = Con(i, j), which means that all the attributes that belong to the concordance coalition have zero value differences, then bonus = p · b, where p is the number of attributes that belong to ConZero(i, j) and b is the total number of attributes taking place in the clustering process (including the attributes that belong to the discordance coalition). • If ConZero(i, j) ⊂ Con(i, j), then bonus = p, where p is the number of attributes that belong to ConZero(i, j). According to the above definition, bonus is always larger in the first case. This is based on the matching/mismatching rationale of the dissimilarity measure of Huang (1998). According to this rationale, the similarity between an object and a mode will be directly proportional to the number of attributes in which they match, and thus their dissimilarity to the number of attributes in which they mismatch. Thus,
1090
Journal of the Operational Research Society Vol. 60, No. 8
when an object and a mode are compared, the fact that all the concordance coalition attributes are zero valued constitutes a stronger indication of similarity than the one that stems when the same object is compared with another mode, and the concordance coalition attributes with zero value differences represent only a part of the concordance coalition attributes as a whole. The use of b will always ensure the numerical superiority of bonus in the first case. It should also be noted that in the second case of the above definition, the value of bonus will increase as the number of the concordance coalition attributes with zero value differences increases as well. Note 4—Estimation of parameter m: This parameter represents a lower limit of the normalized sum of the weights of the attributes that belong to the concordance coalition. It is an important factor in calculating the concordance thresholds in Step 2, and thus an important factor in the implementation of the concordance test. This test designates the most appropriate mode and its corresponding cluster for the object to be assigned to. Taking into consideration the above assumptions, the value of parameter m is valid enough if it exceeds 50%. However, we have excluded values that are smaller than 60% and larger than 80%. In the first case, a small m would result in allowing most of the modes to exceed the concordance test, which might result in bad clustering results. On the other hand, if m was valued high, some of the modes that could positively influence the results would be excluded from the clustering process by the concordance test. Hence, in practice m should be valued in the interval [60%, 80%]. After extensive experimentation, involving sensitivity analysis in this interval, we have concluded that 70% is an adequate value for all our experiments. This means that for each object, the best of the modes must incorporate at least 70% of the strength of the weights of the attributes that belong to the concordance and the discordance coalition. 2.1.3. Step 3: Discordance test (veto check) After the clustering of the object is completed, we must examine the strength of the indications against the similarity of object yi towards the selected mode x ∗j through the discordance test. The discordance test examines the ability of the minority of the attributes (the ones that belong to the discordance coalition) to set veto to the decision of the majority. Thus, if D(x ∗j, f , yi, f ) < Ui, f
(7)
where Ui, f is the veto threshold for each attribute a f that belongs to the discordance coalition, then the initial clustering that took part in Step 2 is confirmed and the object is formally assigned to the proper cluster. It is quite possible though that at least one of the attributes does not satisfy the discordance test. In such a case the initial clustering is not valid. This means that the concordance index, and consequently the mode that lies behind it, chosen by the concordance test, is excluded from the clustering process. The method then starts again from Step 2. The process, combined
Figure 3
Definition of the veto threshold.
with the concordance test, is outlined below: t =1 DO WHILE (t k) AND object yi is not clustered IF C I t (x j , yi ) C T t (x j , yi ) THEN IF D(x ∗j, f , yi, f ) < Ui, f for each attribute a f that belongs to the discordance coalition THEN Object yi is assigned to the cluster of the mode in the t position ELSE t =t +1 END IF ELSE t =t +1 END IF LOOP In other words, until a proper concordance index and its corresponding concordance threshold that will satisfy both Steps 2 and 3 are found, the process continues by choosing and checking the next best concordance index and concordance threshold. Note 5—Details for the veto threshold: The veto threshold Ui, f designates the minimum difference D(x ∗j, f , yi, f ), above which the dissimilarity of the selected mode x ∗j over object yi is so substantial that the similarity relation between them cannot be justified. The veto threshold is very closely related to the indifference threshold qi, f . It is important to ensure that qi, f Ui, f for each attribute a f that belongs to the discordance coalition. If the veto threshold is smaller than the indifference threshold for each attribute that belongs to the discordance coalition, the differences of these attributes will always exceed the veto threshold, and the whole process of veto check becomes unnecessary. In order to estimate the veto threshold, we use the discrete non-zero value differences of the discordance coalition attributes identified in the process of calculating qi, f (denoted by Ni, f ). Having sorted these differences in descending order, the veto threshold is estimated as follows: • If |Ni, f | 3, then the discordance test is not applied to the discordance coalition attribute. • If |Ni, f | 4, then Ui, f is equal to the discrete non-zero value difference in the second position (the second largest discrete non-zero difference). In Figure 3 we show a graphic representation of the above definition. Each box of the six columns represents a discrete
N Mastrogiannis et al—CL.E.KMODES: a modified k -modes clustering algorithm
non-zero value difference of discordance coalition attribute a f , when |Ni, f | is valued from 1 to 6, respectively. These differences are sorted in descending order according to their values. The black boxes correspond to the indifference thresholds defined in the above six distinct cases, and the grey ones to the veto thresholds. There is a two-folded rationale in the above definition. At first, it is not necessary to define a veto threshold and consequently apply the discordance test for each and every attribute of the discordance coalition. The necessary condition for this is the existence of a significant number of non-zero value differences in the discordance coalition attributes. In other words, the veto threshold is defined and the discordance test is applied only in those cases where there exist strong indications of rejection of the chosen mode. Secondly, when the number of non-zero value differences of the discordance coalition attributes is significant enough, we must judge and exclude from the clustering process the selected mode that includes discordance coalition attributes with the largest nonzero value differences.
2.1.4. Step 4: Clustering in default cases It is possible that in some of the iterations, no clustering can be made in spite of the continuous efforts through the concordance and the discordance tests. This means that either the modes are weak (failure of the concordance test) or that there are modes that raise strong opposition in assigning the object to a cluster (failure of the concordance test combined with the discordance test). This possible deficiency is overdrawn by introducing an alternative process of assigning an object to a cluster. This process is based solely on the absolute values of the differences between the concordance indices and the concordance thresholds. The object is assigned to the cluster of the mode that corresponds to the concordance index–threshold with the largest of the above absolute differences. Thus, this alternative process ignores the possible objection of some of the attributes (discordance test) and takes into consideration the concordance test under a new perspective that still incorporates the dynamics of the strength of the mode in terms of its attributes’ weights. It should be noted that the default cases are rather rare and did not appear at all in our empirical results.
2.2. Summary of CL.E.KMODES The main features of the CL.E.KMODES are outlined below: INPUT set Y = {y1 , y2 , . . . , yn } containing n categorical objects, evaluated in a set F = {a1 , a2 , . . . , ad } of attributes DEFINE the set of the initial modes K E N = {x1 , x2 , . . . , xk } evaluated also in F DO UNTIL the algorithm converges, that is until no object yi ∈ Y , i = 1, 2, . . . , n has changed clusters after a full cycle test of the whole data set FOR every object yi ∈ Y , i = 1, 2, . . . , n
1091
FOR every mode x j ∈ K E N , j = 1, 2, . . . , k FOR every attribute a f , f = 1, 2, . . . , d Calculate the differences D(x j, f , yi, f ) IF D(x j, f , yi, f ) qi, f , where qi, f is an indifference threshold THEN Attribute a f belongs to a concordance coalition ELSE Attribute a f belongs to a discordance coalition END IF NEXT FOR Calculate the weights of the attributes w f , f =1, 2, . . . , d Calculate the concordance index C I (x j , yi ) and the corresponding threshold C T (x j , yi ) NEXT FOR SORT the concordance indices C I (x j , yi ), for every mode x j , in descending order t =1 DO WHILE (t k) AND object yi is not clustered IF C I t (x j , yi ) C T t (x j , yi ) THEN IF D(x ∗j, f , yi, f ) < Ui, f for each attribute a f that belongs to the discordance coalition THEN Object yi is assigned to the cluster of the mode in the t position of the descending order. Update the mode of the above cluster and the mode of the previous cluster (after the first iteration). Go to the next objet. ELSE t =t +1 END IF ELSE t =t +1 END IF LOOP IF no assignment to a cluster can be made THEN FOR every mode x j , j = 1, . . . , k Calculate the absolute differences |C I (x j , yi ) − C T (x j , yi )| NEXT FOR SORT the absolute differences |C I (x j , yi )−C T (x j , yi )| in descending order and assign the object to the mode that corresponds to the larger absolute difference END IF NEXT FOR After all objects have been assigned to clusters, re-test the dissimilarity of objects against the current modes. If necessary, re-assign the object to that cluster and update the modes of both clusters. LOOP End of the clustering process when the algorithm converges
2.3. Complexity of CL.E.KMODES According to the outline presented in Section 2.2, the computational complexity of the proposed method is O(n · k 2 · d ·
1092
Journal of the Operational Research Society Vol. 60, No. 8
log(k) · u), where n is the number of objects to be clustered, k is the number of the clusters, d is the number of attributes in which the objects and the modes are evaluated, v is the maximum number of the discrete non-zero value differences D(x j, f , yi, f ) and u is the number of the total iterations required for convergence. More specifically, for each of the objects to be clustered, we need to perform the following operations: • The calculation of the differences D(x j, f , yi, f ), the indifference threshold qi, f , the concordance indices and the corresponding concordance thresholds, with a complexity equal to (k · d · v · log(v)). • The calculation of the weights of the attributes with a complexity equal to (n · k · d). • The sorting of the k concordance indices and the corresponding thresholds in descending order, with a complexity equal to (k · log(k)). • The assignment of the object to the proper cluster, after implementing the concordance and the discordance tests or the clustering process of the default cases, with a complexity equal to k. • The update of the modes with a complexity equal to (n·k·d).
3. Experimental results 3.1. Description of the data sets In order to test its validity and robustness, we have applied the proposed clustering method to seven well-known categorical data sets supplied by the UCI Machine Learning Repository (Aha and Murphy, 1994). These are the Small Soybean, Zoo, Audiology (standardized) (compiled by Professor Jergen at Baylor College of Medicine), Congressional Voting, Balance Scale (Siegler, 1976), Tic Tac Toe and Car Evaluation (Bohanec and Rajkovic, 1988) data sets. They describe a variety of diverse problems, such as plant disease (Small Soybean), medical issues (Audiology (standardized), Balance Scale), social issues (Congressional Voting, Zoo, Car Evaluation) and games (Tic Tac Toe). In addition, some of them are very small, such as the Small Soybean data set, others are quite large, such as the Car Evaluation data set, with the others ranging between them as far as their size is concerned. Finally, the selected data sets vary in both the numbers of
Table 1 Data sets Small Soybean Zoo Audiology (standardized) Congressional Voting Balance Scale Tic Tac Toe Car Evaluation
attributes and clusters. Table 1 shows the details of these data sets. It should be noted that in this paper, every missing attribute values of the tested data sets is considered as another distinct attribute value (Shyu et al, 2005).
3.2. Validation of clustering results A series of well-known validity indices have been used to evaluate the clustering results of the proposed algorithm (Dubes and Jain, 1979; Dubes, 1998; Theodoridis and Koutroumbas, 1999; Halkidi et al, 2001; Rosenberg and Hirschberg, 2007). We define as target clustering the partition of a data set for testing purposes, while the partition of the same data set according to the clustering process proposed by CL.E.KMODES is defined as hypothesized clustering. In both the above target and the hypothesized clustering, a ‘clustering solution’ is a mapping from each object to its cluster assignments. In the above context, we will refer to the target partitions, or clusters, as ‘classes’ (see Table 1), referring only to hypothesized clusters as ‘clusters’. It should be noted that the target clustering for each of the seven used data sets is defined by the UCI Machine Learning Repository. The first category of clustering evaluation techniques is based on a combinatorial approach that examines the number of pairs of objects that are clustered similarly in the target and the hypothesized clustering. That is, each pair of points can be (1) clustered together in both clusterings (SS), (2) clustered together in the hypothesized but not the target clustering (SD), (3) clustered together in the target but not in the hypothesized clustering (DS) or (4) clustered separately in both clusterings (DD). Based on these four values, a number of measures have been proposed, including the Rand Index (Rand, 1971), Jaccard (Milligan et al, 1983) and Fowlkes–Mallows (Fowlkes and Mallows, 1983). High values of the above three measures indicate strong similarity between the target and hypothesized clustering. Assuming that a, b, c and d are the numbers of SS, SD, DS and DD pairs, respectively, and N is the number of objects to be clustered, then Rand =
2 · (a + b) N · (N − 1)
(8)
Details on the tested data sets
No. of objects
No. of attributes
No. of classes
Missing values
47 101 226 435 625 958 1728
35 16 69 16 4 9 6
4 7 24 2 3 2 4
No No Yes Yes No No No
N Mastrogiannis et al—CL.E.KMODES: a modified k -modes clustering algorithm
Table 2 Data sets
Rand
Jaccard
Small Soybean CL.E.KMODES k-modes
0.9167 0.8132
0.7794 0.5259
Zoo CLEKMODES k-modes
0.9261 0.8784
Audiology (standardized) CLEKMODES k-modes
1093
Clustering evaluation results Purity
Entropy
F-measure
0.8609 0.6920
0.6960 0.5497
0.1308 0.2774
0.1800 0.1130
0.7218 0.5860
0.8348 0.7344
0.4001 0.3410
0.1495 0.2168
0.3329 0.1589
0.8586 0.7995
0.1638 0.1256
0.3057 0.2258
0.0639 0.0176
0.2887 0.4294
0.0416 0.0163
Congressional Voting CLEKMODES k-modes
0.9069 0.5685
0.8382 0.4947
0.9104 0.6768
0.9502 0.6652
0.2610 0.8624
0.5911 0.3435
Balance Scale CLEKMODES k-modes
0.5345 0.5227
0.2850 0.2441
0.4438 0.3943
0.4538 0.2091
0.7972 0.8242
0.3876 0.2974
Tic Tac Toe CLEKMODES k-modes
0.5338 0.5009
0.3974 0.3749
0.5683 0.5454
0.5570 0.5056
0.8922 0.9274
0.5245 0.3665
Car Evaluation CLEKMODES k-modes
0.5118 0.4905
0.2592 0.2233
0.4321 0.3900
0.2978 0.2438
0.4877 0.5470
0.2711 0.2100
a (a + b + c) Fowlkes.Mallows =
Jaccard =
Fowlkes–Mallows
(9) a a × (a + b) (a + c)
k 1 × maxi (nri ) N r =1
(11)
Entropy =
r =1
i q 1 nri n × − log r N log q i=1 nr nr
|ci | max{F(ci , k j )} N k j ∈K c ∈C
(13)
i
and k nr
F(C, K ) =
(10)
Two other very commonly used external measures for assessing clustering success are Purity and Entropy, defined as Purity =
cluster k j ∈ K . The F-measure is described below:
(12)
where N is the number of objects to be clustered, q is the number of classes, k is the number of clusters, nr is the size of cluster r and nri is the number of objects in class i clustered in cluster r . Another frequently used external clustering evaluation measure is commonly referred to as ‘clustering accuracy’. The calculation of this accuracy is inspired by the information retrieval metric of F-measure, which was first introduced by Van Rijsbergen (1979). Let N be the number of objects, C the set of classes, K the set of clusters and n i j be the number of members of class ci ∈ C that are elements of
where 2 · R(ci , k j ) · P(ci , k j ) , R(ci , k j ) + P(ci , k j ) ni j ni j R(ci , k j ) = and P(ci , k j ) = |ci | |k j |
F(ci , k j ) =
F-measure has a significant advantage over Purity and Entropy, in that it does measure both the homogeneity and the completeness of a clustering solution. However F-measure is sometimes problematic: if a cluster has the majority (or even all) of the objects, more than one class are matched with only such a cluster and some clusters are not matched with any classes (meaning that those clusters are not evaluated in F-measure). Thus, we exclude matched clusters in the process of the max function. In consequence, a class is matched with only a cluster that yields the maximum F-measure. The same logic is followed in the Purity measure as well. It should be noted that smaller Entropy implies better clustering quality, while the bigger F-measure and Purity are, the better the clustering quality is.
3.3. Clustering evaluation results In order to minimize the possible deficiencies caused by the effect of the record order, we have run each of the data
1094
Journal of the Operational Research Society Vol. 60, No. 8
sets 200 times, using 200 randomly selected discrete sets of initial modes. It should be noted that for every run, both the proposed algorithm and k-modes start their iterative clustering process using the same initial modes. Furthermore, in all the runs of CL.E.KMODES parameter m was set at 70%. Table 2 shows the weighted average of the clustering results of the proposed method compared with the corresponding results of the k-modes algorithm itself, according to the six evaluation measures previously described. Note that for the Entropy measure, the smaller the value is, the better. As we can see from Table 2, the modified k-modes algorithm performs decisively better than k-modes algorithm itself in every evaluation measure used. This observation is valid for all data sets used in the experiments regardless of their size or nature. The above results constitute a significant advantage of the proposed method against k-modes because even the smallest improvement in the clustering results may be extremely important due to the scientific, economic or other implications of the results, according to the nature of each data set. The proposed modified k-modes algorithm has expanded the number and the quality of the necessary conditions for assigning an object to a cluster, and thus each object is allocated to a specific cluster more efficiently than before.
4. Conclusions and future perspectives In this paper we have proposed a modified k-modes algorithm that incorporates a new four-step dissimilarity measure that introduces an alternative assignment of a categorical object to a cluster. In order to make the proper assignment, the new dissimilarity measure is based on elements of the theoretical framework of the ELECTRE methods and especially the ELECTRE I method. These elements, adjusted to the rationale of the k-modes algorithm, are able to expand the number and the quality of the necessary conditions for assigning an object to a cluster, resulting in a more efficient clustering accuracy. We have tested the performance of the proposed algorithm by applying it to seven well-known data sets and comparing it to the corresponding results of the k-modes algorithm. In every evaluation measure used, the clustering results are better than the results of the k-modes algorithm itself. Thus, we may conclude that the proposed method, through its new dissimilarity measure, is able to introduce a new clustering process based on elements of multicriteria analysis that outperforms one of the most important partitioning clustering algorithms. The proposed dissimilarity measure may also be incorporated to nearly every k-modes type algorithm. This is a very challenging task for the future. Furthermore, a new form of the dissimilarity measure could be created to enable the decision maker to use pure numerical or mixed data.
References Aha D and Murphy P (1994). UCI Repository of Machine Learning Datasets. University of California, Department of Information and Computer Science: Irvine, CA. Available at http://www.ics.uci.edu/∼mlearn/MLRepository.html. Ahmad A and Dey L (2007). A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recogn Lett 28(1): 110–118. Anderberg MR (1973). Cluster Analysis for Applications. Academic Press: New York. Bohanec M and Rajkovic V (1988). Knowledge acquisition and explanation for multi-attribute decision making. In: Proceedings of the 8th International Workshop on Expert Systems and their Applications. Avignon: France, pp 59–78. Boriah S, Chandola V and Kumar V (2008). Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the Eighth SIAM International Conference on Data Mining. Atlanta, GA, pp 243–254. Bouyssou D (2001). Outranking methods. In: Floudas CA and Pardalos PM (Eds). Encyclopedia of Optimization, Vol. 4, Kluwer: pp 255–289. Burnaby T (1970). On a method for character weighting a similarity coefficient, employing the concept of information. Math Geol 2(1): 25–28. Cramer H (1946). The Elements of Probability Theory and Some of its Applications. John Wiley & Sons: New York. Dubes CR (1998). Cluster analysis and related issues. Handbook Pattern Recogn Comput Vis 1: 3–32. Dubes CR and Jain AK (1979). Validity studies in clustering methodologies. Pattern Recogn 11: 235–254. Eskin E, Arnold A, Prerau M, Portnoy L and Stolfo S (2002). A geometric framework for unsupervised anomaly detection. In: Barbara D and Jajodia S (eds). Applications of Data Mining in Computer Security. Kluwer Academic Publishers: Dordrecht (Hingham, MA), pp 78–100. Figueira J, Mousseau V and Roy B (2005). Electre Methods. In: Figueira J, Greco S and Ehrogott M (eds). Multiple Criteria Decision Analysis: State of the Art Surveys. Springer: New York, pp 133–153. Fowlkes EB and Mallows CL (1983). A method for comparing two hierarchical clusterings. J Am Stat Assoc 78: 553–569. Gambaryan P (1964). A mathematical model of taxonomy. Izvest Akad Nauk Armen SSR 17(12): 47–53. Gan G, Yang Z and Wu J (2005). A genetic k-modes algorithm for clustering categorical data. In: Proceedings of 1st International Conference, ADMA’05. Wuhan, China, pp 195–202. Goodall DW (1966). A new similarity index based on probability. Biometrics 22(4): 882–907. Guha S, Rastogi R and Shim K (2000). ROCK: A robust clustering algorithm for categorical attributes. Inform Syst 25(5): 345–366. Halkidi M, Batistakis Y and Vazirgiannis M (2001). On clustering validation techniques. J Intell Inf Syst 17(2–3): 107–145. He Z, Deng S and Xu X (2005). Improving k-modes algorithm considering frequencies of attribute values in mode. In: Proceedings of the International Conference on Computational Intelligence and Security, CIS 2005, Xi’an, China, pp 157–162. Huang Z (1998). Extensions to the k-means algorithms for clustering large data sets with categorical values. Data Min Knowl Disc 2(3): 283–304. Huang Z and Ng MK (1999). A fuzzy k-modes algorithm for clustering categorical data. IEEE T Fuzzy Syst 7(4): 446–452. Huang Z, Ng MK, Rong H and Li Z (2005). Automated variable weighing in k-means type clustering. IEEE T Pattern Anal 27(5): 657–668.
N Mastrogiannis et al—CL.E.KMODES: a modified k -modes clustering algorithm
Jain AK, Murty NM and Flynn JP (1999). Data clustering: A review. ACM Comput Surv 31(3): 264–323. Jollois F and Nadif M (2002). Clustering large categorical data. In: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’02. Taipei, Taiwan, pp 257–263. Le SQ and Ho TB (2005). An association-based dissimilarity measure for categorical data. Pattern Recogn Lett 26(16): 2549–2557. Lin D (1998). An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning ICML’98. Madison, WI, pp 296–304. MacQueen JB (1967). Some methods for classification and analysis of multivariate observations. In: Proceeding of the 5th Berkley Symposium on Mathematics, Statistics and Probability. University of California press, Vol. 1, pp 281–297. Makarenkov V and Legendre P (2001). Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software. J Classif 18: 245–271. Maung K (1941). Measurement of association in a contingency table with special reference to the pigmentation of hair and eye colours of Scottish school children. Ann Eugen 11: 189–223. Milligan GW, Soon SC and Sokol LM (1983). The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure. IEEE T Pattern Anal 5: 40–47. Modha DS and Spangler WS (2003). Feature weighting in k-means clustering. Mach Learn 52: 217–237. Ng MK and Wong JC (2002). Clustering categorical data sets using tabu search techniques. Pattern Recogn 35(12): 2783–2790. Ng MK, Li MJ, Huang JZ and He Z (2007). On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE T Pattern Anal 29(3): 503–506. Pearson K (1916). On the general theory of multiple contingency with special reference to partial contingency. Biometrica 11(3): 145–158. Pirlot M (1997). A common framework for describing some outranking methods. J Multi-Criteria Deci Anal 6: 86–92. Rand WM (1971). Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336): 846–850. Rosenberg A and Hirschberg J (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, Prague, pp 410–420. Roy B (1968). Classement et choix en pr´esence de points de vue muptiples : La m´ethode ELECTRE. RIRO 8: 57–75. Roy B (1991). The outranking approach and the foundations of ELECTRE methods. Theor Decis 31: 49–73. Roy B (1996). Multicriteria Methodology for Decision Aiding. Kluwer Academic Publishers: Dordrecht.
1095
San O, Huynh V and Nakamori Y (2004). An alternative extension of the k-means algorithm for clustering categorical data. Int J Appl Math Comput Sci 14(2): 241–247. Shyu ML, Kuruppu-Appuhamilage IP, Chen SC and Chang L (2005). Handling missing values via decomposition of the conditioned set. In: Proceedings of the IEEE International Conference on Information Reuse and Integration, IRI’05, Las Vegas, NV, pp 199–204. Siegler RS (1976). Three aspects of cognitive development. Cognitive Psychol 8: 481–520. Smirnov ES (1968). On exact methods in systematics. Syst Zool 17(1): 1–13. Stanfill C and Waltz D (1986). Toward memory-based reasoning. Commun ACM 29(12): 1213–1228. Sun Y, Zhu Q and Chen Z (2002). An iterative initial-points refinement algorithm for categorical data clustering. Pattern Recogn Lett 23(7): 875–884. Theodoridis S and Koutroumbas K (1999). Pattern Recognition. Academic Press: Orlando, FL. Van Rijsbergen CJ (1979). Information Retrieval, 2nd edn. Butterworths: London.
Appendix The estimation of a new mode is based on the process of finding a mode for a set of the k-modes algorithm (Huang, 1998). In more details, if Y = {y1 , y2 , . . . , yn } is a set of n objects evaluated over d attributes with categorical values, then the mode of Y is a vector H = {h 1 , h 2 , . . . , h d } that minimizes the following relation: DIS(H, Y ) =
d n
D(y j f , h f )
(14)
j=1 f =1
where D(y j , H ) is defined according to relation (1). It should be noted that H is not necessarily an element of Y . Let n ck, f be the number of objects having category ck, f on attribute a f and f r (a f = ck, f /Y ) = (n ck, f )/n the relative frequency of category ck, f in Y . Then DIS(H, Y ) is minimized if f r (a f = h f /Y ) f r (a f = ck, f /Y ) for h f = ck, f for every attribute a f . Note that a mode of a data set Y is not unique. Received November 2007; accepted January 2009 after one revision