Knowl Inf Syst (2006) 9(1): 109–129 DOI 10.1007/s10115-005-0213-x
Knowledge and Information Systems
R E G U L A R PA P E R
Fadi Abdeljaber Thabtah · Peter Cowling · Yonghong Peng
Multiple labels associative classification
Received: 1 June 2004 / Revised: 26 November 2004 / Accepted: 30 January 2005 / Published online: 20 April 2005 C Springer-Verlag London Ltd. 2005
Abstract Building fast and accurate classifiers for large-scale databases is an important task in data mining. There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. In this paper, the problem of producing rules with multiple labels is investigated, and we propose a multi-class, multilabel associative classification approach (MMAC). In addition, four measures are presented in this paper for evaluating the accuracy of classification approaches to a wide range of traditional and multi-label classification problems. Results for 19 different data sets from the UCI data collection and nine hyperheuristic scheduling runs show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Keywords Data mining · Association rule · Classification · Multi-label classification · Frequent itemset · Hyperheuristic · Scheduling 1 Introduction Data mining is applicable to a wide variety of areas, such as finance, marketing and retail, and it has attracted much attention in the knowledge discovery and machine learning research communities [32]. Classification is a well-known task in data mining and is used to predict the class of an unseen instance as accurately as F. A. Thabtah (B) · P. Cowling Department of Computing, MOSAIC Research Centre, School of Informatics, University of Bradford, Bradford, BD7 1DP, UK E-mail:
[email protected] Y. Peng Department of Computing, School of Informatics, University of Bradford, Bradford, BD7 1DP, UK
110
F. A. Thabtah et al.
possible. Whilst single-label classification, which assigns each rule in the classifier the most obvious class label, has been widely studied [13, 20, 25, 26, 34], little work has been conducted on multi-label classification. Furthermore, the majority of the research carried out to date on multi-label classification relates mainly to text categorisation [15, 28]. Currently, one of the most common multi-label classification approaches is one versus the rest (OvR) [33], which constructs a set of binary classifiers obtained by training on each possible class versus all the rest. The OvR approach performs a winner-takes-all strategy that assigns a real value for each class to indicate the class membership. Another approach in multi-label classification is one versus one (OvO) [28], which constructs a classifier that has been trained on each possible pair of classes. For K classes, this results in (K − 1)K /2 binary classifiers, which may be problematic if K is large. On the other hand, the OvR approach has been criticised for training on several separate classification problems, since each class can easily be separated from the rest, giving contradictory decisions (whenever two or more rules predict a test instance), and no decision (whenever none of the resulting rules can predict a test instance) [9]. Most of the current traditional classification algorithms use heuristic-based strategies for building a classification system. During the construction of each classifier, the heuristics look for single-label rules with high accuracy. In cases where the rule is associated with more than one class label, they usually select the rule associated with the label that leads to a lower error rate and simply ignore all other possibilities. However, one or more of the ignored rules may also produce competitive error rates, if compared with the selected rule. Moreover, in some cases the ignored rules could have a larger representation in the training data, possibly making them very useful in decision making. The majority of current traditional classification approaches, such as decision trees [11, 25] and covering algorithms [20], do not take the creation of possibly useful rules with multiple labels into consideration. Since the presentation of association rule discovery [1], it has been extensively studied [2, 13, 21, 24], and is still one of the most active research areas in data mining [14]. The association rule approach aims to investigate the shopping behaviour of customers, with the hope of finding regularities. Classification aims to predict single class label, while association rule mining can discover the relationship among any attributes in the data. In recent years, a new approach that integrates association rules with classification, named associative classification, has been proposed [17, 22, 30]. A few accurate classifiers that use associative classification previously presented are CBA [22], CMAR [17], CPAR [34] and MCAR [30]. As in the majority of traditional classification techniques, most existing associative classification techniques create the most obvious class label correlated to a rule and simply ignore all other labels. However, multi-label classification rules may often be useful in practice. For the growing quantity of data in fields such as bioinformatics, scene classification and text categorisation, multi-label classification is more appropriate. It allows human refinement of classes, or allows further information (such as dictionaries in recognition) to be used in classification, hence there is great potential to develop and construct new algorithms to treat such problems.
Multiple labels associative classification
111
Consider for example medical diagnosis, where symptoms such as a fever, blocked sinus and coughing could relate to three potential class labels of illness “cold”, “flu” or “tuberculosis”, which are stored in a database. Assume that these symptoms are associated 40, 34 and 26 times with the “cold”, “flu” and “tuberculoses” labels, respectively. Furthermore, assume that the number of times the symptoms emerge in the database is 100 times. A traditional associative classification algorithm extracts only the rule associated with the most obvious label, i.e. “cold”, since it has the largest occurrence, and ignores the other potential rules. However, in this case it is useful to extract the other rules, since they bring up useful information that has a large representation in the database, which means that the two ignored rules may take a role in prediction and may be very useful to the decision maker. There are also many more application areas similar to medical diagnosis where multi-label rules are more appropriate than single-label rules, such as text categorisation and scene classification. This paper introduces a novel approach for multi-label classification, named multi-class multi-label associative classification (MMAC). The presented technique assumes that for each training instance that passes certain thresholds, there is a rule associated with not only the most obvious class label, but with the second, third, . . . , kth possible class labels. In order to evaluate classifiers derived by MMAC on different application themes, and compare them to other approaches, four evaluation methods are presented. Related research on the associative classification approach are surveyed in Sect. 2. The multi-label classification problem is introduced in Sect. 3. Basic concepts of association rule and associative classification are discussed in Sect. 4. The MMAC approach and our methods for evaluation of traditional and multilabel classifiers are presented in Sects. 5 and 6, respectively. Experimental results are given in Sect. 7 and finally the conclusions are presented in Sect. 8. 2 Related work on associative classification Ali et al. [3] used association rule mining to calculate the correlations between the attributes and the class labels for two applications, e.g. telecommunication and medical diagnoses, using the a priori approach of Agrawal and Srikant [2]. Their aim was to extract overlapping rules that are individually accurate, and they did not consider prediction of future class labels. The results of the two case studies indicate that association rule can be used to generate many useful rules for medical practitioners. The authors speculated on whether association rule mining could be used for prediction. Liu et al. [22] proposed an algorithm called CBA that integrates association rule mining with classification. CBA operates in three main steps. First, it discretises real/integer attributes and second, it uses the a priori approach of Agrawal and Srikant [2] to discover frequent itemsets and generate the rules. Finally, a subset of the rules produced is selected to represent the classification system. The discovery of frequent itemsets is the most resource- and time-consuming step in CBA, since it requires multiple passes over the training data. In each pass, the seed of the rule items found in the previous pass are used to generate potential rule items in the current pass. The experimental results show that CBA scales well with regard to error rate if compared with decision trees [25].
112
F. A. Thabtah et al.
Li et al. [17] developed the CMAR algorithm that uses the FP-growth approach [13] to find frequent itemsets. CMAR differs from other associative methods since it uses more than one rule to assign a class to a test object and stores the classification rules in a prefix tree data structure, know as a CR-tree. When classifying a new object, CMAR accumulates the subset of classification rules matching the new object and looks at their class labels. In the case where all rules have a common class, CMAR simply assigns that class to the test object. In cases where the classes of the accumulated rules are not identical, CMAR divides the rules into separate groups based on their class values and compares the effects of every group to identify the strongest one. The strength of the groups is measured by the weighted χ 2 adopted from [16]. The effectiveness of a rule-based classifier on test data sets that contain missing values has been studied by Li et al. [19]. The authors introduced two simple methods, one extends decision trees [25] and the other extends optimal class association rules [18]. They compare the predictive power of the two methods with popular classification methods, like CBA and decision trees, on test data that contain missing values. They claim that their association rule method generally derives a smaller set of rules that is able to predict test data objects with missing values. The k-optimal rules set has been introduced to obtain further minimal effective rule-based classifiers. The results reported on four data sets [23] showed that the optimal association rule method is competitive with traditional classification methods based on decision trees. A greedy associative classification algorithm called CPAR, which adopts the FOIL strategy [26] to generate rules, was proposed by Yin and Han [34]. CPAR seeks the best rule condition that brings most FOILgain among the available ones in the data set. FOILgain is used to measure the information gained from adding a condition to the current rule. Once the condition is identified, the weights of the positive examples associated with it are reduced by a multiplying factor, and the process repeats until all positive examples in the training data set are covered. The search for the best rule condition is the most time-consuming process of CPAR, since the gain for every possible item needs to be calculated to determine the best overall gain. In the rule-generation process, CPAR derives not only the best condition but also all similar ones, since there is often more than one attribute item with similar gain. It is claimed that CPAR improves the efficiency of the rule-generation process when compared with popular associative classification methods such as CBA. A recently proposed approach for building classification systems based on both positive and negative rules has been introduced by Antonie and Zaiane [4]. The “interestingness” of the rules for the proposed algorithm is based on the correlation coefficient that measures the strength of the linear relationship between pairs of variables [29]. In addition to confidence and support thresholds, the correlation coefficient has been used to prune the final rules in the classifier, giving a much reduced rules set. The way in which the algorithm generates the rules is similar to an a priori method where multiple scans are required to discover the rules. Ranking occurs in a similar way as the CBA rules ranking method, e.g. confidence, support. Experimental tests on six data sets from the UCI data collection [23] showed that negative association rules are useful when used with positive rules for producing competitive classification systems.
Multiple labels associative classification
113
Our proposed algorithm can be considered an associative classification technique since it uses the core concepts of association rule mining (support and confidence), in a classification framework. However, MMAC presents the idea of extracting rules with multiple labels and has many distinguishing features over current associative classification algorithms, which we will discuss in more detail in Sect. 5.3.
3 Multi-label classification problems Most of the research conducted on classification in data mining has been devoted to single-label problems. A traditional classification problem can be defined as follows: let D denote the domain of possible training instances and Y be a list of class labels, let H : D → Y denote the set of classifiers. Each instance d ∈ D is assigned a single class y that belongs to Y . The goal is to find a classifier h ∈ H that maximises the probability that h(d) = y for each test case (d, y). In multilabel problems, however, each instance d ∈ D can be assigned multiple labels y1 , y2 , . . . , yk for yi ∈ Y , and is represented as a pair (d, (y1 , y2 , . . . , yk )) where (y1 , y2 , . . . , yk ) is a list of ranked class labels from Y associated with the instance d in the training data.
4 Classification based on association rules 4.1 Frequent items, support and confidence Classification is a special case of association rule mining in which the consequent of the rule is the class label attribute. We aim to derive a complete set of class association rules of the form X ⇒ C, where X is an itemset and C is a ranked list of labels. Let us define the classification problem in an association rule framework. Let T be the training data with n attributes A1 , A2 , . . . , An and C is a list of class labels. A particular value for attribute Ai will be denoted ai , and the class labels of C are denoted c j . Definition 1 An item is defined by the association of an attribute and its value (Ai , ai ), or a combination of between 1 and n different attributes values, (A1 , a1 ), (A1 , a1 ), (A2 , a2 ), . . . , (A1 , a1 ), (A2 , a2 ), . . . , (An , an ). Definition 2 A rule r for multi-label classification is represented in the form (Ai1 , ai1 ) ∧ (Ai2 , ai2 ) ∧ · · · ∧ (Aik , aik ) → ci1 ∨ ci2 ∨ · · · ∨ cil , where the antecedent of the rule is an item and the consequent is a list of ranked class labels, i.e. cil has higher ranked order than ci2 and ci3 . Definition 3 The actual occurrence (Act Occ) of a rule r in T is the number of cases in T that match r ’s antecedent. Definition 4 The support count (SuppCount) of r is the number of cases in T that match r ’s condition and belong to a class ci for r . When the item is associated with multiple labels, there should be a different (SuppCount) for each label.
114
F. A. Thabtah et al.
Table 1 Training data RowId
A1
A2
Class
1 2 3 4 5 6 7 8 9
x1 x1 x1 x1 x2 x2 x2 x1 x3
y1 y2 y1 y2 y1 y1 y3 y3 y1
c1 c2 c2 c1 c2 c1 c2 c1 c1
Definition 5 A rule r passes the minimum support threshold (MinSupp) if for r , the SuppCount (r )/|T | ≥ MinSupp, where |T | is the number of instances in T . Definition 6 A rule r passes the minimum confidence threshold (MinCon f ) if SuppCount (r )/Act Occ(r ) ≥ MinCon f . Definition 7 Any item in T that passes the minimum support threshold is said to be a frequent item. 4.2 Associative classification The majority of current associative techniques use the step-wise a priori approach of Agrawal and Srikant [2] to find frequent items in a data set. If a frequent item consists of only a single value, it is said to be a frequent single item. Assume that the integer MinSupp is set to 3, the frequent single items in Table 1 are (A1 , x1 ), (A1 , x2 ) and (A2 , y1 ). The frequent single items are inputs to the process of finding possible frequent pairs of items, the frequent pairs of items are input to discover frequent triples of items, and so on. Associative classification techniques generate frequent items by making multiple passes over the training data. In the first pass, they count the support of single items and determine whether they are frequent, and then in each subsequent pass, they start with items found to be frequent in the previous pass in order to produce new possible frequent items. After frequent items have been discovered, associative classification methods derive a complete set of class-association rules (CARs) for those frequent items that pass MinCon f . These kinds of techniques are often called confidence-based methods since they generate only the class with the highest confidence for each frequent item. 5 MMAC Our proposed method consists of three phases: rule generation, recursive learning and classification. In the first phase, the method scans the training data to discover and generate a complete sort of CARs. In the second phase, MMAC proceeds to discover more rules that pass the MinSupp and MinCon f thresholds from the
Multiple labels associative classification
115
remaining unclassified instances, until no further frequent items can be found. In the third phase, the rules sets derived during each iteration will be merged to form a global multi-label classifier that will then be tested against test data. Algorithm 1 represents a general description of our proposed method, which we will explain in more detail below. Training attributes can be categorical, i.e. attributes with limited distinct values, or continuous, i.e. real and integer attributes. For categorical attributes, we assume that all possible values are mapped to a set of positive integers. In order to classify continuous attributes, we must first use a discretisation approach such as that described in [10]. Algorithm 1 Input: Training data, confidence and support (σ ) thresholds. Output: A set of multilabel rules and the classification accuracy
– Phase 1: 1. Scan the training data T with n columns to discover frequent items 2. Produce a rules set by converting any frequent item that passes MinCon f into a rule 3. Rank the rules set according to (confidence, support, . . . , etc.) 4. Prune redundant rules from the rules set – Phase 2: 1. Discard instances Pi associated with the rules set just generated in phase 1
2. Generate new training data T ← T − Pi
3. Repeat phase 1 on T until no further frequent item is found – Phase 3: 1. Merge rules sets generated at each iteration to produce a multi-label classifier 2. Classify test objects and calculate error rate using an accuracy measure 5.1 Building the classifier 5.1.1 Frequent items discovery and rules generation To increase the efficiency of frequent-items discovery and rule generation, MMAC employs an efficient technique based on an intersection method [31]. The method scans the training data once to count the occurrences of single items, from which it determines those that pass the MinSupp and MinConf thresholds. Frequent single items and their occurrences in the training data (rowIds) are stored in arrays. Then, by intersecting the rowIds of the frequent single items discovered so far, it is easy to obtain the possible remaining frequent items that involve more than one attribute. The rowIds for frequent single items can be used to locate items quickly in the training data in order to obtain support and confidence values for rules involving more than one item. For example, consider frequent single items A and B. If we intersect the sets of the rowIds of A and B, then the resulting set will represent the tuples where A and B happen to be together in the training data. Therefore, the classes associated with AB can be easily located, so that the support and confidence can be calculated, to
116
F. A. Thabtah et al.
decide whether AB is a frequent item and a candidate rule in the classifier. Since the training data has been scanned only once to discover and generate the rules, this approach is highly effective in terms of runtime and storage because it does not rely on the traditional a priori approach [2] of discovering frequent items, which requires multiple scans. Once an item has been identified as a frequent item, MMAC checks if it passes the MinCon f threshold. If the item confidence is larger than MinCon f , then it will be generated as a candidate rule in the classifier. Otherwise, the item will be discarded. Thus, all items that survive MinCon f are generated as candidate rules in the classifier. 5.1.2 Ranking of rules and pruning Most of the current associative classification techniques rank the rules mainly in terms of the confidence level. When several rules have identical confidences or supports, CBA and CMAR randomly choose one of the rules, which may degrade accuracy. In order to ensure a subset of effective rules form the classifier, a detailed ranking technique, shown in Definition 8, is presented. The presented ranking method is different from CBA and CMAR ranking techniques since it looks for detailed rules by adding a more detailed condition for breaking ties. Pruning takes place by discarding any item that has a support value less than the MinSupp and a confidence value less than the MinConf threshold. Another pruning of the rules occurs in the rule evaluation, which we discuss in Sect. 5.1.4. Definition 8 Given two rules, ra and rb , ra precedes rb if: 1. The confidence of ra is greater than that of rb . 2. The confidence values of ra and rb are the same, but the support of ra is greater than that of rb . 3. Both confidence and support and ActOcc values of ra and rb are the same, but ra has fewer conditions in its left-hand side (LHS) than that of rb . 4. All the above criteria are identical for ra and rb , but ra was generated before rb . 5.1.3 Recursive learning For given training instances D, associative classification algorithms such as CBA and CPAR derive a single-label rules set and form a default class for the remaining unclassified instances in D. On the other hand, MMAC derives more than one rules set and merges them to form a multi-label classifier. For D, the proposed method produces a first-pass rules set in which each rule is associated with the most obvious class label. Once this initial set is generated, all training instances associated with it are discarded and the remaining unclassified instances become a
new training data, D . The MMAC algorithm checks whether there are still more
frequent items remaining undiscovered in D (rules derived from D which may be associated with more than one class label). If so, a new set of rules will be
generated from D , and the remaining unclassified instances in D will form new
Multiple labels associative classification
117
training data, and so on. The algorithm proceeds with learning until no more frequent items are discovered in a pass. At that stage, any remaining unclassified instances will form a default class. This process results in learning from several subsets of the original training data and generating few rules sets. Consider for example the training data shown in Table 1. Assume that the integer MinSupp and MinCon f have been set to 2 and 0.40, respectively. At the first iteration, MMAC derives a set of rules: [(A1 , x2 ) → c2 , (A1 , x1 ) → c1 , (A2 , y1 ) → c1 )] that covers the instances that are not in bold in Table 1, which eventually will be discarded at the end of the iteration. The remaining unclassified instances, which are in bold, will represent the new training data for iteration two, in which one more rule will be learned, e.g. [(A1 , x1 ) → c2 ], and produced to form the second rules set. When the learning process is finished, a merging of the rule sets that have been produced at iterations 1 and 2 will be performed to obtain a global multilabel classifier. In many cases, a rule will be presented in more than one rules set and is associated with different class labels like item (A1 , x1 ), which has two representations in Table 1, one with class label “c11 ” in rules set 1, and one with class label “c12 ” in rules set 2. A good question will be how one can rank the class labels in a rule to represent this item.
5.1.4 Rules evaluation During each iteration, a rule r is said to be significant if and only if it covers at least one training instance. After a set of rules is generated and ranked, an evaluation step takes place to test each rule in order to remove redundant ones. If a rule correctly classifies at least a single instance it will be marked as a survivor and good candidate rule. In addition, all instances correctly classified by it will be deleted from the training data. In the case that a rule has not classified any training instances, it will be removed from the rules set. This process of pruning redundant and noisy rules from the set of produced rules minimises the chance that lowconfidence rules will take any role in the later classification process. Furthermore, rules that have large frequencies and high confidences are preferred by the MMAC rule-ranking method.
5.1.5 Ranking of class labels A class label l1 ≺ l2 , also known as l1 precedes l2 in a multi-label rule r , if it has a larger representation in the training data. Consider, for example, an item (A, a), (B, b) which is associated with three labels (“c1 ”, “c2 ”, “c3 ”). Assume that it has 100 representation in the training data in which it is associated with labels “c1 ”, “c2 ” and “c3 ”, 50, 30, and 20 times. Moreover, assume that this item has passed the MinSupp and MinCon f thresholds when associated with “c1 ”, “c2 ” and “c3 ”. MMAC ranks these labels based on their number of occurrences (“c1 ” ≺ “c2 ” ≺ “c3 ”), and a rule will be presented for this instance in the following form: (A, a), (B, b) → c1 ∨ c2 ∨ c3 .
118
F. A. Thabtah et al.
Table 2 CBA classifier RowId
FrequentItem
Supp
Conf
Default
Class c1
Table 3 MMAC Classifier RowId
FrequentItem
Supp
Conf
Class
1 2a 2b 3
x2 x1 x1 y1
2/10 3/10 2/10 3/10
2/3 3/5 2/5 3/5
c2 c1 c2 c1
5.2 Classification In classification, let R be the set of generated rules and T the training data. The basic idea of the proposed method is to choose a set of high-confidence rules in R to cover T . In classifying a test object, the first rule in the set of rules that matches the test object condition classifies it. This process ensures that only the highest ranked rules classify test objects. 5.3 Comparison of MMAC and CBA CBA and MMAC were applied on the training data shown in Table 1 by using a MinSupp integer of 2 and MinCon f of 0.40 to illustrate the effectiveness of the rule sets derived by both algorithms. Table 2 represents the classifier generated by CBA, which consists of one rule and covers five training instances, which are 1, 4, 6, 8 and 9. The remaining four instances, that cover 0.36 of the entire training data, are left unclassified. The columns “Supp” and “Conf” in Tables 2 and 3 stand for support and confidence values. Table 3 represents the classifier produced by MMAC on the same training data, in which more rules have been discovered, i.e. three more rules than the CBA classifier. In this particular example, there is only one rule derived by our proposed algorithm from Table 1 that has multiple labels, which is (A1 , x1 ) → c1 ∨ c2 . The MMAC classifier covers all training instances, and leaves no unclassified instances. Unlike the CBA algorithm that was unable to produce rules with multiple labels, our proposed method generates rules that can predict multiple labels. Moreover, the CBA classifier contains only a default rule, and therefore it has large impact on the classification of unseen data that may significantly affect the accuracy in the classification, and could lead to deterioration in the overall classification accuracy. Generally, the main differences between MMAC and other associative algorithms are the following: – MMAC presents not only a single class classifier but also a multi-label one, where each instance is associated with its ranked list of classes. – Other associative classification techniques often use multiple passes to discover frequent items. MMAC uses an efficient technique for discovering the rules, which requires only one scan of the data.
Multiple labels associative classification
119
– MMAC introduces a detailed rule ranking technique that minimises randomisation when a choice point among two or more rules occurs in the ruleranking process. – The proposed method presents a recursive learning phase that discovers more rules, aiming to minimise the role of the default class in classifying test objects. 6 Evaluation measures Multi-label classification has been investigated mostly in text categorisation [15, 28], and there has been very little work conducted on developing evaluation measures for its classifiers. There are no standard evaluation techniques applicable to the multi-label classification problems. Moreover, the right measure is often problematic and depends heavily on the features of the considered problem, such as those used in [5]. In this section, we introduce four evaluation measures suitable for the majority of single- and multi-label classification problems. First we define the mathematical notation we will use in our definitions. Let D denote the test data set with m rows d1 , d2 , . . . , dm , and let C denote the set of classes. Row d has class c(d) in the test data set. A (possibly multi-label) classifier is a multifunction h : D → 2C , where for d ∈ D, h(d) = h 1 (d), h 2 (d), . . . , h k(d) (d). The frequency (SuppCount) of h 1 (d), h 2 (d), . . . , h k(d) (d) in the training data is f 1 (d), f 2 (d), . . . , f k(d) (d) respectively. 6.1 Top-label This evaluation measure takes into consideration only the top-ranked class label and ignores any other labels associated with an instance. For traditional classification tasks where there is only one class label to assign to the test object, and given an instance and its associated class, the top-label method estimates how many times the top-ranked class label is the correct class label. The top-label is |{d ∈ D : h 1 (d) = c(d)}| /m. 6.2 Any-label This evaluation technique measures how many times any of the predicted labels of an instance match the actual class label in all cases of that instance in the test data. If any of the predicted class labels of an instance d match the true class label y, then the classification is correct. The any-label is |d ∈ D : ∃i such that h i (d) = c(d)|/m. 6.3 Label-weight This technique enables each predicted label for an instance to play a role in classifying a test case based on its ranking, and therefore it could be considered as a multi-label evaluation measure. An instance may belong to several class
120
F. A. Thabtah et al.
labels, each one associated with it by a number of occurrences in the training data. Each class label can be assigned a weight according to how many times that label has been associated with the instance. The label-weight is k(d) i i d∈D {i:h i (d)=c(d) }( f (d)/ j=1 f (d)). For example, if an item (A, a) is associated with class labels “c1 ”, “c2 ” and “c3 ”, seven, five and three times, respectively, in the training data. Each class label will be assigned a weight, i.e. 7/15, 5/15, and 3/15, respectively, for labels “c1 ”, “c2 ” and “c3 ”. This technique assigns the predicted class label-weight to the case if the predicted class label matches the case class label. For instance if label “c2 ” of item (A, a) matches a case in the test data that has “c2 ” as its class, then the case will be considered a hit, and 5/15 will be assigned to the case.
6.4 Support-weight This evaluation technique gives the top-ranked label the maximum weight, and each of the rest a weight equal to the number of times that the label is associated with the instance divided by the number of times it is associated with the topranked labels. This means that the top-ranked label is considered the fittest, and each of the rest are assigned a weight that is less than the top-raked class weight. Formally the support-weight measure is d∈D {i:h i (d)=c(d)} ( f i (d)/ f 1 (d)). For the example presented in the last section, the support-weight technique assigns 7/7, 5/7, and 3/7, respectively, to “c1 ”, “c2 ” and “c3 ”.
7 Experimental results We evaluated our approach on 19 different data sets from the UCI data collection [23] as well as different data sets for forecasting the behaviour of an optimisation heuristic within a hyperheuristic framework [8, 31]. Tenfold cross-validation was used to derive the classifiers and error rates in the experiments [32]. Crossvalidation is a standard evaluation measure for calculating the error rate on data in machine learning. Three popular classification techniques have been compared to MMAC in terms of classification accuracy (PART [11], RIPPER [7], CBA [22]) in order to evaluate the predictive power of the proposed method. The choice of such learning methods is based on the different strategies they use to generate the rules. Since CBA, PART and RIPPER are suitable for traditional classification problems, the classification accuracy derived by only the top-label evaluation measure has been used for a fair comparison. All experiments were conducted on a Pentium IV 1.6 GHz PC under Windows XP. The experiments for PART and RIPPER were conducted using the Weka software system [36]. Weka stands for Waikato Environment for Knowledge Analysis, which is an open java source code for the machine-teaching community that includes implementations of several data-mining tasks such as classification, clustering, association rule mining and regression. CBA experiments were conducted using a VC++ implementation version provided in CBA [35], and finally MMAC was implemented using Java.
Multiple labels associative classification
121
Many studies have shown that the support threshold plays a major role in the overall classification accuracy of the set of rules produced by existing associative classification techniques [17, 21, 22]. Moreover, the support value has a large impact on the number of rules produced in the classifier as well as the processing time and storage needed during rules discovery. From our experiments, we noticed that support rates ranging between 0.02 to 0.05 usually achieve the best balance between accuracy rates and the size of the resultant classifiers. Moreover, the classifiers derived when the support was set to 2 and 3 achieved high accuracy and are often better than that of decision trees rule (PART), RIPPER and CBA. Due to this, the MinSupp value was set to 0.03 in the experiments. The confidence threshold, on the other hand, is less complex and unlike the support value does not have a large effect on the behaviour of associative classification methods, and therefore it was set to 0.30. 7.1 Binary and traditional data results We have evaluated 19 selected data sets from [23], six of which were reduced by ignoring their integer and/or real attributes. Several tests using tenfold crossvalidation have been performed to ensure that the removal of any real/integer attributes from these data sets does not significantly affect the classification accuracy. As a result, we only considered data sets where the error rate was not more than six times worse than the error rate obtained on the same data set before the removal of any real/integer attributes. Table 4 represents the classification accuracy of the classifiers extracted by PART, RIPPER, CBA and MMAC on the 19 benchmark problems. The accuracy of MMAC has been derived using the top-label evaluation measure. Our algorithm outperforms the rule-learning methods in terms of accuracy rate, and the Table 4 Classification accuracy of PART, RIPPER, CBA, and MMAC Data set
PART
RIPPER
CBA
MMAC
Tictac Contact-lenses Led7 Breast-cancer Weather Heart-c Heart-s Lymph Mushroom Primary-tumor Vote Crx Sick Balance-scale Autos Breast-w Hypothyroid Zoo Kr-vs-kp
92.58 83.33 73.56 71.32 57.14 81.18 78.57 76.35 99.81 39.52 87.81 84.92 93.90 77.28 61.64 93.84 92.28 91.08 71.93
97.54 75.00 69.34 70.97 64.28 79.53 78.23 77.70 99.90 36.26 87.35 84.92 93.84 71.68 56.09 95.42 92.28 85.14 70.24
98.60 66.67 72.39 68.18 85.00 78.54 71.20 74.43 98.92 36.49 87.39 86.75 93.88 74.58 35.79 94.68 92.29 83.18 42.95
99.29 79.69 73.20 72.10 71.66 81.51 82.45 82.20 99.78 43.92 89.21 86.47 93.78 86.10 67.47 97.26 92.23 96.15 68.75
122
F. A. Thabtah et al.
Fig. 1 Difference in accuracy between MMAC evaluation measures and CBA algorithm
won–loss–tied records of MMAC against PART, RIPPER and CBA are 13–6–0, 15–4–0 and 15–4–0, respectively. The results shown in Table 4 indicate that our proposed algorithm outperforms CBA in terms of error rate. One of the principle reasons for this appears to be that MMAC often generates a few more rules than CBA. The increase in accuracy suggests that this is not simply overfitting and would likely justify the small increase in classification rate for MMAC over CBA in applications. However, in some cases, such as the “CRX” data set, the number of rules is large, even though every rule represents at least one training object. Thus, a post-pruning method, such as pessimistic error pruning, [27] may be useful in such cases. Figure 1 shows the runtime of MMAC, CBA, PART and RIPPER for 19 UCI data sets. It indicates that the associative-based techniques MMAC and CBA are slower than the traditional techniques PART and RIPPER. Particularly, since CBA adapts the a priori approach, which is a resource- and time-consuming approach, making it the slowest algorithm for building classifiers. Furthermore, PART is the fastest algorithm to construct a classification system due to its simple strategy to discover rules. The figures also suggest that our proposed algorithm is typically slower than traditional classification techniques such as RIPPER and PART. However, they also indicate that MMAC compares well in term of processing time with the CBA algorithm. 7.2 Multi-label scheduling data results Details of several solution runs generated by a hybrid hyperheuristic, named Peckish, for solving a complex scheduling problem are provided by Thabtah et al. [8], Cowling and Chakhlevitch [31]. The hyperheuristic is a robust, general approach that tends to find good solutions, rather than optimal ones for large and complex scheduling and optimisation problems. One can consider a hyperheuristic
Multiple labels associative classification
123
Table 5 Features of the hyperheuristic solutions Dataset Num.
Num. of Instances
Num. of Classes
1 2 3 4 5 6 7 8 9 10
698 2257 182 152 1424 1374 133 131 130 127
10 10 10 10 10 10 10 10 10 10
approach as a supervisor that manages the choice of which low-level heuristic to select from the available ones for the problem during the process of building a schedule. The Peckish hyperheuristic often selects the low-level heuristic that leads to the maximum improvement to the objective function (if one exists). If none of the existing low-level heuristic improves the objective function, the Peckish hyperheuristic selects one randomly. A low-level heuristic is a simple rule or method that yields a small change in the schedule. Often these low-level heuristics are based on human methods of constructing the schedule, such as adding an event, deleting an event or swapping two events. We have used nine solution runs, in which each solution contains 500 iterations of applied low-level heuristics. In addition, ten different low-level heuristics (LLH 1, LLH 2, LLH 20, LLH 27, LLH 37, LLH 43, LLH 47, LLH 58, LLH 68 and LLH 74) have been used to generate each solution. Each solution consists of six different attributes and a number of instances that maximise the objective function of the optimisation problem. Table 5 indicates the features of the data sets used in the experiments. Table 6 represents part of a solution run generated by the Peckish hyperheuristic where columns LLH2 and LLH1 represent the low-level heuristics applied at the previous two iterations. Column LLH represents the current low-level heuristic that improved the objective function and column Imp represents the improvement on the objective function value. Finally, the column Apply represents whether or not the selected low-level heuristic has been applied by the hyperheuristic. The data generated by the hyperheuristic is multi-label, since at each
Table 6 Sample of scheduling solution run Iteration
LLH2
LLH1
LLH
Imp
Apply
1 1 1 2 3 3
1 1 1 1 2 2
1 1 1 2 1 1
2 43 74 1 2 58
987 2 2 92 982 8
1 0 0 1 1 0
124
F. A. Thabtah et al.
iteration there could be more than one low-level heuristic that improves the objective function. For example, at the first iteration in Table 6, there are three low-level heuristics (LLH 2, LLH 43 and LLH 4) that improve the objective function. Thus, there are three class labels associated with the instance (1, 1). Generally, each training instance in the scheduling solutions may associate with at least; one class label. The evaluation measures presented with MMAC have been compared on the nine solution runs produced by the Peckish hyperheuristic, with regard to accuracy, and rules features. Figures 2 and 3 represent the relative prediction accuracy and indicate the difference in the classification accuracy of the MMAC evaluation measures with respect to those derived by CBA and PART. The
Fig. 2 Difference in accuracy between MMAC evaluation measures and CBA algorithm
Fig. 3 Difference in accuracy between MMAC evaluation measures and PART algorithm
Multiple labels associative classification
125
relative prediction accuracy numbers shown are calculated using the formula (AccuracyMMAC −AccuracyPART ) (AccuracyMMAC −AccuracyCBA ) for PART and for CBA. (AccuracyPART ) (AccuracyCBA ) After analysing the charts, we found that there is a consistency between the top-label and label-weight measures, since in most cases both of them consider only one class in the prediction. The top-label takes into account the top-ranked class, and the label-weight considers only the weight for the predicted class that matches the test case. Thus, both of these evaluation measures are applicable to traditional single-class classification problems. On the other hand, the any-label measure considers any class in the set of the predicted classes as a hit whenever it matches the predicted class, regardless of its weight or rank. Furthermore, the support-weight measure balance between any-label and top-label measures because it considers the top class the fittest on one hand and assigns each of the rest of classes a weight on the other hand. It should be noted that the relative accuracy of MMAC evaluation methods against data set number 8 in Figs. 2 and 3, is negative since CBA and PART achieved a higher classification rate on this particular data set. An investigation on the rules features derived from the scheduling data runs has been carried out. Figure 4 shows the number of rules extracted from each data set, categorised by the number of class labels. Unlike most associative techniques, our proposed algorithm is able to extract rules that are associated with up to four labels. This is one of the principle reasons for improving the classification accuracy within applications. Figure 4 also demonstrates that the majority of the rules created from each scheduling run are associated with one or two labels. It turns out that this is due to the fact that, during each iteration, often only one or two low-level heuristics improve upon the objective function in the scheduling problem. Thus, each training instance often corresponds to just one or two class
Fig. 4 Distribution of the rules with regards to their labels
126
F. A. Thabtah et al.
Fig. 5 Distribution of the training instances for one scheduling run with regard to the number of labels
labels. Figure 5 shows the distribution of the training instance association, with class labels for one scheduling data set where instances that are correlated with one class label dominate the rest. Finally, analysis of the rules sets also indicates that MMAC derives more rules than PART and CBA for the majority of the data sets. One principle reason for extracting more rules is due to the recursive learning phase that MMAC employs. This discovers hidden information that most of the associative classification techniques discard, as they only extract the highest confidence rule for each frequent item. 8 Conclusions A new approach for multi-class multi-label classification rules has been proposed that has many distinguishing features over traditional and associative classification methods: (1) It produces classifiers that contain rules with multiple labels, (2) It presents four evaluation measures for determining accuracy that are applicable to a wide range of applications, (3) It employs an efficient method for discovering rules that requires only one scan over the training data, (4) It employs a detailed ranking method, which prunes redundant rules, and ensures only effective ones are used for classification. Performance studies on 19 data sets from UCI data collection and nine hyperheuristic scheduling runs indicate that our proposed approach is effective, consistent and has a higher classification rate than PART, CBA and RIPPER algorithms. In addition, the proposed algorithm is able to extract rules with up to four multiple labels from the scheduling data sets, which results in a higher classification accuracy for test instances. In further work, we anticipate extending the method to treat continuous data and create a hyperheuristic approach that learns “on-the-fly” which low-level heuristic method is the most effective. Acknowledgements We wish to acknowledge Andrew Barnes for his help in cleaning up the data sets from noise. Moreover, special thanks to Mr. Suhail Hamoud for his help in preparing the scheduling data sets and K. Chakhlevitch for supplying us with the original versions of the scheduling data sets.
Multiple labels associative classification
127
References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining associations between sets of items in massive databases. In: Buneman, P., Jajodia, S. (eds) Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216. Washington, DC (1993) 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rule. In: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487–499 (1995) 3. Ali, K., Manganaris, S., Srikant, R.: Partial classification using association rules. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 115–118. AAAI (1997) 4. Antonie, M., Zaiane, O.: An associative classifier based on positive and negative rules. In: Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 64–69. Paris, France (2004) 5. Boutell, M., Shen, X., Luo, J., Brown, C.: Multi-label semantic scene classification. Technical report 813, Department of Computer Science, University of Rochester, Rochester , NY 14627 & Electronic Imaging Products R & D, Eastern Kodak Company (2003) 6. Clare, A., King, R.: Knowledge discovery in multi-label phenotype data. In: De Raedt, L., Siebes, A. (eds) Proceedings of the PKDD ’01, vol. 2168, Lecture notes in Artificial Intelligence, pp. 42–53. Springer-Verlag, Berlin Heidelberg New York (2001) 7. Cohen, W.: Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Mateo (1995) 8. Cowling, P., Chakhlevitch, K.: Hyperheuristics for managing a large collection of low level heuristics to schedule personnel. In: Proceedings of the 2003 IEEE Conference on Evolutionary Computation, pp. 1214–1221. Canberra, Australia (2003) 9. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley-Interscience, New York (2001) 10. Fayyad, U., Irani, K.: Multi-interval discritisation of continues-valued attributes for classification learning. In: Proceedings of the IJCAI, pp. 1022–1027 (1993) 11. Frank, E., Witten, I.: Generating accurate rule sets without global optimisation. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 144–151. Morgan Kaufmann Madison, Wisconsin, (1998) 12. Furnkranz, J.: Separate-and-conquer rule learning. Technical Report TR-96-25. Austrian Research Institute for Artificial Intelligence, Vienna (1996) 13. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12. Dallas, TX (2000) 14. Hipp, J., Guntzer, U., Nakaeizadeh, G.: Algorithms for association rule mining – a general survey and comparison. In: Proceedings of the ACM SIGKDD Explorations, pp. 58–64 (2000) 15. Joachims, T.: Text categorisation with support vector machines: learning with many relevant features. In: Proceedings of the Tenth European Conference on Machine Learning, pp. 137–142 (1998) 16. Li, W.: Classification based on multiple association rules. M.Sc. Thesis. Simon Fraser University (2001) 17. Li, W., Han, J., Pei, J.: CMAr: accurate and efficient classification based on multiple class association rule. In: Proceedings of the ICDM’01, pp. 369–376. San Jose, CA (2001) 18. Li, J., Shen, H., Topor, R.: Mining optimal class association rule set. In: Proceedings of the 5th Pacific–Asia Conference on Methodologies for Knowledge Discovery and Data Mining (PAKDD 2001), pp. 364–375 (2001) 19. Li, J., Topor, R., Shen, H.: Construct robust rule sets for classification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 564–569. Edmonton, Alberta, Canada (2002) 20. Lim, T., Loh, W., Shih, Y.: A comparison of prediction accuracy, complexity and training time of thirty-three old and new classification algorithms. Machine Learn. 40, 203–228 (2000) 21. Liu, B., Hsu, W., Ma, Y.: Mining association rules with multiple minimum supports. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 337–341 (1999)
128
F. A. Thabtah et al.
22. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Proceedings of the KDD ’98, pp. 80–86. New York, NY (1998) 23. Merz, C.J., Murphy, P.M.: UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA (1996) 24. Park, J., Chen, M., Yu, P.: An effective Hash-based algorithm for mining association rules. In: Proceedings of the International Conference on Management of Data, pp. 175–186. San Jose, CA (1995) 25. Quinlan, J.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo, San Francisco (1993) 26. Quinlan, J., Cameron-Jones, J.: FOIL: a midterm report. In: Proceedings of 1993 European Conference on Machine Learning, pp. 3–20. Vienna, Austria (1993) 27. Quinlan, J.: Generating production rules from decision trees. In: Proceedings of the 10th International Joint Conferences on Artificial Intelligence, pp. 304–307. Morgan Kaufmann, San Francisco (1987) 28. Schapire, R., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learn. 39(2/3), 135–168 (2000) 29. Tan, P., Kumar, V.: Interestingness measures for association patterns: a prospective. In: Proceedings of the Workshop on Postprocessing in Machine Learning and Data Mining in Conjunction with Seventh ACM SIGKDD (2000) 30. Thabtah, F., Cowling, P., Peng, Y.: MCAR: multi-class classification based onassociation rule. In: Proceedings of the 3rd ACS/IEEE International Conference on Computer Systems and Applications, pp. 1–8. Cairo, Egypt (2005) 31. Thabtah, F., Cowling, P., Peng, Y.: Comparison of classification techniques for a personnel scheduling problem. In: Proceedings of the 2004 International Business Information Management Conference, pp. 207–213. Amman, Jordan (2004) 32. Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Francisco (2000) 33. Yang, Y.: An evaluation of statistical approaches to text categorisation. Technical Report CMU-CS-97-127. Carnegie Mellon University (1997) 34. Yin, X., Han, J.: CPAR: classification based on predictive association rule. In: Proceedings of the SDM 2003, pp. 369–376. San Francisco, CA (2003) 35. CBA (1998) CBA: http://www.comp.nus.edu.sg/dm2 36. Weka (2001) weka: data mining software in java. http://www.cs.waikato.ac.nz/ml/weka
Fadi Abdeljaber Thabtah received a B.S. degree in Computer Science from Philadelphia University, Jordan, in 1997 and an M.S. degree in Computer Science from California State University, USA in 2001. From 1996 to 2001, he worked as professional in database programming and administration in United Insurance Ltd. in Amman. In 2002, he started his academic career and joined the Philadelphia University as a lecturer. He is currently a final graduate student at the Department of Computer Science, Bradford University, UK. He has published about seven scientific papers in the areas of data mining and machine learning. His research interests include machine learning, data mining, artificial intelligence and object-oriented databases.
Multiple labels associative classification
129
Peter Cowling is a Professor of Computing at the University of Bradford. He obtained M.A. and D.Phil. degrees from the University of Oxford. He leads the Modelling Optimisation Scheduling And Intelligent Control (MOSAIC) research centre (http://mosaic.ac), whose main research interests lie in the investigation and development of new modelling, optimisation, control and decision support technologies, which bridge the gap between theory and practice. Applications include production and personnel scheduling, intelligent game agents and data mining. He has published over 40 scientific papers in these areas and is active as a consultant to industry.
Yonghong Peng’s research areas include machine learning and data mining, and bioinformatics. He has published more than 35 scientific papers in related areas. Dr. Peng is a member of the IEEE and Computer Society, and has been a member of the programme committee of several conferences and workshops. Dr. Peng referees papers for several journals including the IEEE Trans. on Systems, Man and Cybernetics (part C), IEEE Trans. on Evolutionary Computation, Journal of Fuzzy Sets and Systems, Journal of Bioinformatics, and Journal of Data Mining and Knowledge Discovery, and is refereeing papers for several conferences.