Journal of the Operational Research Society (2003) 54, 790–797
r 2003 Operational Research Society Ltd. All rights reserved. 0160-5682/03 $25.00 www.palgrave-journals.com/jors
Exploration of a hybrid feature selection algorithm K-M Osei-Bryson*, K Giles and B Kositanurit Department of Information Systems, Virginia Commonwealth University, Richmond, VA, USA In the Knowledge Discovery Process, classification algorithms are often used to help create models with training data that can be used to predict the classes of untested data instances. While there are several factors involved with classification algorithms that can influence classification results, such as the node splitting measures used in making decision trees, feature selection is often used as a pre-classification step when using large data sets to help eliminate irrelevant or redundant attributes in order to increase computational efficiency and possibly to increase classification accuracy. One important factor common to both feature selection as well as to classification using decision trees is attribute discretization, which is the process of dividing attribute values into a smaller number of discrete values. In this paper, we will present and explore a new hybrid approach, ChiBlur, which involves the use of concepts from both the blurring and w2-based approaches to feature selection, as well as concepts from multi-objective optimization. We will compare this new algorithm with algorithms based on the blurring and w2-based approaches. Journal of the Operational Research Society (2003) 54, 790–797. doi:10.1057/palgrave.jors.2601565 Keywords: feature selection; discretization; classification; multi-objective optimization; decision tree
Introduction Decision tree induction is one of the most commonly used data mining technique for classification purposes. The induction of the decision tree is done using a supervised knowledge discovery process in which prior knowledge regarding classes in the database is used to guide the discovery. The organizational data mining process can be considered to have five major phases: (1) Determination of Business Objectives; (2) Data Preparation; (3) Data Mining; (4) Analysis of Results; and (5) Assimilation of Knowledge. Within this context, two important components of the Data Preparation phase are: (1) feature (or attribute) selection, which aims to identify the subset of attributes that is necessary and sufficient to identify the target class; and (2) attribute discretization, which aims to partition the given predictor attribute’s values into a number of intervals in a manner that minimizes the loss of interdependency between the given predictor attribute and the class attribute. Both of these components are highly related, and many feature selection algorithms assume that the given predictor attribute has already been discretized. In this paper, we will focus on feature selection, which is an important aspect of data preparation as the presence of irrelevant attributes can make it impossible to identify significant relationships in the data, can compromise the accuracy of the decision tree, and can also increase the processing time for decision tree induction. Feature selection *Correspondence: K-M Osei-Bryson, Department of Information systems, The Information systems Research Institute, Virginia Commonwealth University, Richmond, VA 23284, USA. E-mail:
[email protected]
is an important component of procedures for addressing various real-world problems. For example, feature selection has been used in statistical process control (SPC) procedures for addressing quality problems,1 an automatic contentbased trademark retrieval method,2 data mining approaches to predict corporate failure,3 and web document categorization.4 For these applications,1–4 the major benefits of feature selection were the reduction in the dimensionality of the problem and the identification of the most relevant features, but feature selection also contributed to improved classification performance of the classifier for the corporate failure application.3 Various approaches have been presented by different researchers for selecting the most appropriate subset of attributes. One such approach, called blurring, was recently presented by Piramuthu,5 while another approach is the w2 family.6,7 In this paper, we will present and explore a new hybrid approach, ChiBlur, which involves the use of concepts from both the blurring and w2-based approaches to feature selection, as well as concepts from multi-objective optimization. It should be noted that this approach is applicable to data sets that do not include categorical predictor attributes. It should also be noted that the hybrid approach presented here could be generalized to accommodate attribute discretization methods other than those from the w2 family and other feature selection methods than blurring and those from the w2 family. For example, an LPbased attribute discretization method (eg Bryson and Joseph8) could be used for discretization. Similarly, a rough set heuristic (eg Zhong et al 9) could be used for feature selection. Such an approach would also fit into the multiobjective framework of our hybrid approach that involves
K-M Osei-Bryson et al—Exploration of a hybrid feature selection algorithm 791
the exploration of multiple non-dominated combinations of attribute discretizations and subsets of the predictor attributes. In this paper, we will compare this new hybrid algorithm with the blurring and w2-based approaches, primarily because they provided a large part of the motivation for the development of this hybrid algorithm.
Overview on feature selection The primary objective of feature selection is the selection of a subset of relevant features (ie predictor attributes) that would give the highest possible class-predication accuracy when an induction algorithm is performed on the data set containing only the selected features.10 Feature selection algorithms are generally categorized into two types: (a) the wrapper model, which involves application of the induction learning algorithm to different subsets of the features and the selection of the subset that provides the best predictive accuracy; and (b) the filter model, which first removes the most irrelevant features, using indirect performance measures such as distance and information measures, and then applies the induction learning algorithm.11 A unified model, composed of the four components of Feature Generation, Feature Evaluation, Stopping Criteria, and Testing, proposed by Liu and Motoda,11 is another way to visualize feature selection algorithm. The Generation Procedure is a step that generates the candidate attribute subset to be evaluated. The Evaluation Function is used to evaluate the candidate subset, and in some cases the decision tree induction algorithm is used for this purpose. The Stopping Criterion (minimum number of attributes, minimum classification rate, maximum number of iterations) is essential to prevent exhaustive searches of the attribute space. According to Dash and Liu,12 a fifth component, the Validation Procedure, is not a part of the feature selection process itself but is needed to make sure the results are valid. This procedure is sometimes known as the ‘Sanity Test’. Many feature selection methods have been proposed by various researchers and it is generally recognized that there is no single best feature selection method. For example, Almuallim and Dietterich13 presented a quasi-polynomialtime algorithm that performs a thorough search in order to find a subset of features that satisfy the consistency measure, while Lovell et al14 proposed a forward selection approach for selecting discrete-valued features. Other researchers have presented approaches that were based on genetic algorithms,15 neural networks,10,16 statistics,6,17–19 fuzzy sets,9,20–24 and mathematical programming.25–29 An example of an approach that is based on statistics is the w2 feature selection algorithm,6 which is based on w2 statistics and which is described in more detail later in this paper. Fuzzy set approaches typically involve heuristic strategies for selecting attributes into reducts (ie minimal sets that can be used to classify objects). For example, Kohavi and Frasca20 applied
a best-first search in the space of feature subsets and used the Bootstrap estimation method in estimating the accuracy of the induced classifier at each state. Recently, Zhong et al 9 proposed a feature selection algorithm using the rough set theory with greedy heuristics. Mathematical programming approaches25–29 involve examining the relevancy of the features in the objective function by forcing the weights of the features in the constraint functions to be zero. The minimal feature subset that reveals the best classification performance is then selected.28 Bradley and Mangasarian25 proposed an approach that involved the use of Support Vector Machines (SVM), while Bredensteiner and Bennett26 proposed a Frank–Wolfe type.
Description of three feature selection methods Definition of terms Entropy measures: The Information Gain measure, Quinlan,30,31 is based on maximizing the ‘gain’ in selecting a particular predictor attribute for branching when creating a decision tree for classification. For a discretization Gg of an attribute into g intervals, the Information Gain (IG) is defined as X IGðgÞ ¼ ps log2 ðps Þ s2S
X
pj
j2JGg
X
! psj j log2 ðpsj j Þ
s2S
where JGg is the index set of the intervals that are included in a particular discretization Gg that consists of g ¼ |Gg| intervals. Given that IG(g) is a non-decreasing function of g, it ‘tends to prefer attributes with large numbers of possible values’.32 An alternate approach involves the maximization of the P Gain Ratio GR(g) ¼ IG(g)/SI(g), where SIðgÞ ¼ pj log2 ðpj Þ j2JGg
is called the Split Information for the partition Gg with g intervals, from Quinlan.33 However, for some data sets the Gain Ratio measure ‘overcompensates’32 and so must be moderated by choosing the predictor attribute with the maximum Gain Ratio and an above average Information Gain. An examination of the Information Gain formula shows a component (ie the Unconditional or A Priori Entropy) that is the same for all predictor attributes and all values of g, and a second major component that is dependent on the relevant attribute and also on the value of g. This second component is called conditional entropy H(Class|ai) for predictor attribute ai, and is defined as follows: X X pj psj j log2 ðpsj j Þ HðClassjai Þ ¼ j2JGgi
s2S
792 Journal of the Operational Research Society Vol. 54, No. 7
With this measure, the attribute that gives the lowest conditional entropy is considered to be the best attribute to split on.
w2 discretization measure: For classification purposes, a predictor attribute should so partition the observations that the class distribution should differ for each pair of adjacent discretized levels (ie interval) of the given predictor attribute. Given this desired property for a discretization, for each pair of adjacent levels it may be useful to test the null hypothesis H0 that the given pair of adjacent discretized levels is independent of the classes at a specified level of significance a. Let oj(is), ej(is) be the actual and expected number of examples under H0, respectively, in interval Vj of attribute ai that have attribute value vi and are associated with class cs. The w2 statistic for interval Vj is defined as PP w2 ¼ i s(oj(is)ej(is))2/ej(is). Under H0 the w2 statistic follows the w2 probability distribution. Thus, the decision rule for independence is that the w2 statistic does not exceed the critical value of the appropriate w2 distribution at the specified level of significance a. Inconsistency rate discretization measures: Ideally, we would like to have each discretization interval Vj associated with a single class cs. However, in practice, this is unlikely to be true. So an important measure of the quality of the discretization is the total relative deviation from this ideal relationship between the interval and the class. One such measure has been called ‘inconsistency,’ which is based on the number of false class predictions associated with the interval. Let n be the number of examples in the entire data set; nj1 be the number of examples in the data set that belong to interval Vj, and let njs be the number of examples in interval Vj that are associated with class s. The Inconsistency Count for Vj is defined as qj ¼ nj1Max{njs: sAS}. The Inconsistency Rate (IR) of a given discretizaP tion Gg with g intervals is defined as IRðgÞ ¼ dj , j2JGg where dj ¼ (qj/n). Blurring-based feature selection Blurring-based feature selection involves the use of the blurring function, which is the average conditional entropy of the given subset of attributes. The FSB algorithm5 uses the blurring measure to iteratively remove features from the data set until a stopping criterion is reached, removing those features one at a time that have the smallest blurring value. This algorithm and measure are claimed to achieve great success in the domain of credit-risk evaluation decisions. However, this algorithm’s asymptotic performance is O(n2) on the number of attributes.34 Thus, data sets with large numbers of attributes would have an impact on the algorithm’s performance. In Giles et al,35 we developed the
Simple Blurring Algorithm. Similar to FSB, the description of the Simple Blurring Algorithm below assumes that the following actions have occurred before the invocation of the blurring algorithm: Calculation of the Conditional Entropy for each attribute. The Stopping Rule has been specified. Types of stopping rules include: (a) minimum number of attributes (say hMin); (b) minimum acceptable Classification Rate (CRMin). Simple Blurring Algorithm (SBA) Step 1 (Determine initial set of attributes): (a) Calculate the average Conditional Entropy for each attribute from the original data set. (b) Exclude those attributes whose Conditional Entropy is greater than the Average Conditional Entropy. Let there be ‘h’ attributes remaining. (c) Order these h attributes in descending order based on their Conditional Entropy values. Step 2 (Invoke Learning Algorithm): (a) Invoke the Induction Learning algorithm (eg C4.5) using the first h attributes only, and record the results. Step 3: Termination test (a) If Stopping Rule is satisfied then: (i) Based on recorded results from all iterations, return the set of attributes that give the best result. If two sets of attributes give results that are approximately the same (eg within 95%) pick the smaller set. (ii) Stop. (b) Else (i) Set h ¼ h–1. (ii) Go to Step 2. Step 1 determines the initial reduced set of attributes to work with. Similar to the way attribute nodes are chosen in the C4.5 algorithm, where a candidate node is chosen only if it has Max(GainRatio) and an Information Gain 4 AverageInformationGain, this algorithm removes attributes whose Conditional Entropy values are greater than the AverageConditionalEntropy. These attributes are then ordered by the Conditional Entropy value. This approach to eliminating attributes can thus be considered a case of aggressive feature selection in that an attribute may be eliminated even before its effect on the classification rate is known. Step 2 takes as input the reduced attribute data set and runs the normal induced decision tree algorithms, such as C4.5. The results of this step are fed into Step 3, which determines if the Stopping Rule has been met. If it has, then the minimal set of attributes with the best results (eg Classification Accuracy) is returned. For results that are similar in value (within 95%) then the set with the fewer number of attributes is returned.
K-M Osei-Bryson et al—Exploration of a hybrid feature selection algorithm 793
In terms of the Dash and Liu12 framework, Step 1 and Substep 3b map to the Generation Procedure, Step 2 maps to the Evaluation Function, and Step 3a incorporates the stopping Criterion (minimum number of attributes, minimum classification rate, maximum number of iterations). According to Dash and Liu,12 as mentioned previously, the Validation Procedure is not a part of the feature selection process itself but is needed to make sure the results are valid. It should be noted that the runtime performance of SBA is limited only by the sorting procedure in Step 1c. There are several sorting routines, such as Heapsort or Mergesort, with asymptotic runtime performance of O(n log2 n) or better.34 Thus, SBA has better runtime performance than FSB, which, as previously noted, has a runtime performance of O(n2) on the number of attributes.
w2-Based feature selection The w2 statistic has been the basis of the popular Chi2 feature selection algorithm.6 The basis of the algorithm is the use of the w2 statistic to determine the independence of the class attribute values from two adjacent intervals in an attribute. If two intervals of a predictor attribute have sufficiently similar relative class frequencies, as determined by the w2 statistic, then the two intervals are combined into one. The Chi2 algorithm is divided into two phases. The first phase, which is a generalized version of Kerber’s ChiMerge algorithm,7 sets an initially high global significance level. This level is used as a stopping criteria for interval merging—merging continues until all interval pairs have exceeded the w2 value determined by the significance level. Each iteration through the loop decreases the significance level until a preset inconsistency rate is reached. The second phase is a refinement of the first phase, and proceeds by merging any intervals in any of the attributes even further as long as the inconsistency of the training data is not increased above a given limit for each attribute (the first phase uses a global significance level as the stopping criteria while the second phase uses a local one). If any attribute is merged into only one interval, then that attribute is eliminated (feature selected).
Hybrid approach: ChiBlur This hybrid approach involves the use of concepts from both the blurring and w2-based approaches to feature selection, as well as concepts from multi-objective optimization. We will first describe the ChiMerge algorithm, from Kerber,7 which was the motivation for phase 1 of the Chi2 algorithm. The ChiMerge algorithm produces discretizations that vary depending on the value of a—if we apply ChiMerge to all attributes for the given value of a, then each attribute will be discretized. Let ni(a) be the number of intervals in the discretized version Gi(a) of attribute ‘i’ for w2 significance level a. Let IF be a subset of the attributes. If the discretized
version of all attributes in IF are to be included in the given DT induction algorithm, then the dimension of the given problem (excluding the class attribute) is Dim(a,IF) ¼ PiAIFni(a) and the corresponding inconsistency rate is IR(a,IF). Some combinations of IF and a will be said to be dominated, and will not be considered worthwhile for DT induction. A particular (aa,IFa) combination is said to be dominated if there is at least one other (ab,IFb) such that (Dim(ab,IFb), IR(ab,IFb))p(Dim(aa,IFa)), IR(aa,IFa)), with strict inequality holding for one of the components. The basic idea here is that if the dimension (Dim) is inferior then the corresponding inconsistency rate (IR) should be superior or vice versa. ChiBlur(c,j) Algorithm Step 1: (a) Specify a starting value for a ¼ a1, (eg 0.05) a stopping value aThreshold (eg 0.50), and a step size D (eg 0.05). Step 2: (a) Discretize each attribute ‘i’ using ChiMerge(a), and compute the conditional entropy value CE(Gi(a)). (b) Let IF(a) be the subset of attributes that are to be selected for the current a. If the Minimum(CE(Gi(a)))/Maximum(CE(Gi(a)))Xg then Include all attributes in IF(a); Else Include in IF(a) only those attributes whose CE(Gi(a)) pAverage(CE) þ k*Standard Deviation(CE). Step 3: (a) Compute the dimension Dim(a,IF(a)), and inconsistency rate IR(a,IF(a)) of the data set using the current discretized versions of those attributes in IF(a). (b) If (Dim(a,IF(a)), IR(a,IF(a))) is not dominated, the record the results. Step 4: (a) Set a ¼ a þ D. (b) If a4aThreshold then go to Step 5, otherwise repeat Steps 2–4. Step 5: (a) Invoke the Induction Learning algorithm using the relevant discretized version of attributes in each nondominated IF (a), and record the results. (b) Pick the IF(a) that provides the best classification rate from among those explored in Substep 5a.
Experimental results and analysis Experimental setup To compare the various w2-based and blurring-based algorithms, we performed nine test paths, each path being a unique, yet interesting, combination of feature selection, discretization, and classification algorithms. These paths are given in Table 1. The first test path, labeled ‘D-C4.5’, was our baseline measure of performance, as the C4.5 algorithm is widely cited in the literature and seems to be the algorithm against which all new algorithms are compared. Since it is a classification algorithm, it also allows us to directly compare the performance of the feature selection algorithms we
794 Journal of the Operational Research Society Vol. 54, No. 7
Table 1 Test paths of the feature selection, discretization, and classification algorithms Path label
Test path
D-C4.5 Chi2,M+I,D-C4.5 Chi2,M+I,S-C4.5 ChiMrg,M+I,Blr,D-C4.5 ChiMrg,M+I,Blr, S-C4.5 ChiMrg,M+I,S-C4.5 ChiMrg,M+I,D-C4.5 ChiBlur, 0 StdDev ChiBlur, 1 StdDev
Dynamic C4.5 Chi2-Mapping+Intervals-Dynamic C4.5 Chi2-Mapping+Intervals-Static C4.5 ChiMerge-Mapping+Intervals-Blur-Dynamic C4.5 ChiMerge-Mapping+Intervals-Blur-Static C4.5 ChiMerge-Mapping+Intervals-Static C4.5 ChiMerge-Mapping+Intervals-Dynamic C4.5 ChiBlur with k=0 ChiBlur with k=1
Table 2 Data set characteristics Data set
Number of classes
Predictor attributes Number
IRIS Breast Cancer Car Wave Glass Page blocks Yeast
3 2 4 3 6 5 10
4 9 6 21 9 10 8
discussed in the previous section. This is interesting because it is not just the reduction in attributes in a feature selection algorithm that is important, but the reduction in appropriate attributes that is desired. The C4.5 classification algorithm allows us to measure how well the reduced data sets perform. In addition to using the regular version of C4.5, which we call Dynamic C4.5, we also modified C4.5 to keep static predefined discretized intervals throughout the C4.5 process, which in effect treats the discretized continuous attribute as if it is a categorical variable. The Dynamic C4.5 approach treats the discretized continuous attribute as if it is an ordinal variable. Thus, with the Static C4.5 approach there would be no further dynamic merging of adjacent intervals by the splitting method, while in the Dynamic C4.5 approach there could be additional dynamic merging of adjacent intervals. The test paths labeled ‘Chi2,M þ I,D-C4.5’ and ‘Chi2,M þ I,S-C4.5’ take the reduced data sets from the Chi2 algorithm and use the appropriate C4.5 algorithm to classify the data using Dynamic C4.5 and Static C4.5. The outputs of the Chi2 algorithm are the resulting attribute values and a set of discretized intervals. The data values are then ‘remapped’ into the suggested discretized intervals. For example, if Chi2 says that, for an attribute, data values 4.3pxo4.9 should be one interval, then all values in that interval are set to 4.3. This process is represented as ‘Mapping þ Intervals’ in Table 1. It is this remapped data file that is then input into the classification algorithms. Dynamic C4.5 then takes the remapped data and can
Data type
Continuous Integer Non-numeric, ordered Continuous Continuous Integer, continuous Continuous
No. of distinct values Min
Avg
Max
22 10 3 519 32 104 2
31 10 4 714 104 909 52
43 10 4 885 178 1718 81
perform even more interval merging. In Static C4.5 however, the intervals are frozen (made static) and no further merging of intervals is permitted. The test paths labeled ‘ChiMrg,Blr,D-C4.5’ and ‘ChiMrg,Blr,M þ I,S-C4.5’, used the discretization intervals and data generated by the ChiMerge algorithm to feed into the Simple Blurring Algorithm in order to do feature selection, the results of which are then fed into the classification algorithms. The test paths labeled ‘ChiMrg,M þ I,S-C4.5’ and ‘ChiMrg,M þ I,DC4.5’ also use the discretization intervals and data generated by the ChiMerge algorithm, but feed these results directly (after being remapped) into the classification algorithms without performing explicit feature selection. These two paths allow us to see how well discretization with classification compares to feature selection and classification algorithm combinations. The test labeled ‘ChiBlur, 0 StdDev’ and ‘ChiBlur, 1 StdDev’ show the results of the ChiBlur algorithms where the k input is set to 0 and 1, respectively.
Data sets We used seven publicly available benchmark data mining data sets that we obtained from the UCI Irvine machine library: IRIS, Breast Cancer, Car, Glass, Yeast, Page Blocks, and Wave.36 We chose data sets from a variety of problem domains but limited our choice of data sets to those that had attributes with numeric interval domains because of the requirements of the ChiMerge and w2-based algorithms. These data sets are summarized in Table 2.
K-M Osei-Bryson et al—Exploration of a hybrid feature selection algorithm 795
Table 3 Classification accuracies of test data for test paths Test path label
IRIS
Breast Cancer
Car
Glass
Yeast
Page blocks
Wave
D-C4.5 Chi2,M+I,D-C4.5 Chi2,M+I,S-C4.5 ChiMrg,M+I,Blr,D-C4.5 ChiMrg,M+I,Blr,S-C4.5 ChiMrg,M+I,S-C4.5 ChiMrg,M+I,D-C4.5 ChiBlur, 0 StdDev ChiBlur, 1 StdDev
95.33 95.33 96.00 96.00 96.00 97.33 95.33 96.67 96.00
72.49 74.50 73.64 69.05 71.06 73.64 74.50 73.07 75.64
92.48 96.18 96.18 77.78 77.78 94.62 94.62 77.78 96.18
65.89 76.17 74.30 71.03 70.79 75.70 78.04 74.77 79.91
54.78 59.03 54.25 49.66 47.30 54.25 59.03 48.92 59.70
96.95 97.11 95.43 95.63 94.67 95.12 97.06 97.26 97.11
77.02 77.02 70.42 75.54 70.00 70.52 77.20 73.90 73.28
Test programs For the classification algorithms, we used as a basis the wellknown Weka Java library of data mining algorithms (www.cs.waikato.ac.nz/Bml/weka). The Weka library contains the Dynamic C4.5 algorithm, and we modified the code in order to realize our Static C4.5 algorithm. As for the ChiMerge and Chi2 algorithms, we created our own implementations of the ChiMerge and Chi2 algorithms listed in Dash and Liu.12 For the ChiBlur algorithms, we used a combination of our ChiMerge, Simple Blurring Algorithm, and C4.5 implementations.
Results Table 3 shows the classification accuracies for the different paths for the test data. These test data classification accuracies are based on 10-fold cross validation on the data sets that were performed by C4.5. Table 4 shows the number of predictor attributes in the original data set (before feature selection) and the resulting number of predictor attributes left in the data set after feature selection for each algorithm (Simple Blurring Algorithm—labeled ‘Blur’, Chi2, ChiBlur). More detailed results of the ChiBlur test paths are shown in Table 5. The first column shows the non-dominated significance levels used for each data set that resulted in non-dominated classification accuracies. The significance level represents the a level used in the ChiMerge algorithm for the w2 lookup value for the stopping criterion. The second and third columns show the classification accuracies using Dynamic C4.5 for the two ChiBlur test paths (ChiBlur set with k ¼ 0 and 1). It should be noted that solutions corresponding to different significance levels may be nondominated for different values of k. For example, for the Iris data set, for a ¼ 0.50, the k ¼ 0 test path of ChiBlur was nondominated (designated as ‘N/A’), but the k ¼ 1 test path was not. However, for a ¼ 0.01, the corresponding solution was non-dominated for both ChiBlur runs.
Analysis The values in bold in Table 3 are the best values across all test paths for each data set. A tie was determined to exist if
Table 4 Number of predictor attributes before and with feature selection Data set IRIS Breast Cancer Car Glass Yeast Page Blocks Wave
Before
Blur
Chi2
ChiBlur
4 9 6 9 8 10 21
2 2 2 5 3 4 13
4 9 5 9 8 10 21
2 2–3 2 5 3 4 13
the absolute difference in the Classification Accuracies of the top competitors was no more than 0.5%. For example, for the IRIS data set, test path ‘ChiMrg,M þ I,S-C4.5’ had the highest classification accuracy across all test paths, with a classification accuracy of 97.33%. For the Car data set, with a 96.18% classification accuracy, ‘ChiBlur, 1 StdDev’ tied with test path ‘Chi2,M þ I,D-C4.5’ and ‘Chi2,M þ I,D-C4.5’. The summary of the classification accuracy wins/ties and test paths are shown in Table 6. As can be seen, for these seven data sets, the ChiBlur algorithm does quite well compared to the classification algorithms alone, to the Simple Blurring and feature selection algorithms plus C4.5, and to the ChiMerge discretization algorithm combined with C4.5. The ChiBlur algorithms won with three out of five data sets, and tied for the other two. The ChiBlur algorithm tended to do better with k ¼ 1 than with k ¼ 0 because with k ¼ 0 the Simple Blurring Algorithm tends to feature select too harshly, eliminating too many attributes, resulting in important attributes being removed from the data set, which can impair classification accuracy. But being more conservative in feature selection is not always the best strategy for all data sets, as the Page Blocks data set shows. The ChiBlur algorithm, with k ¼ 0 and a ¼ 0.25 and 0.15, retained four Page Blocks predictor attributes: height, p_black, mean_tr, and wb_trans; ChiBlur with k ¼ 1 and a ¼ 0.01 retained nine Page Blocks predictor attributes: height, length, area, eccen, p_black p_and, mean_tr, blackand, and wb_trans. The more aggressive k ¼ 0 input not only reduced the data set size by eliminating five more attributes, but also improved slightly the classification accuracy.
796 Journal of the Operational Research Society Vol. 54, No. 7
Table 5 Classification accuracies of test data for ChiBlur algorithms
Table 6 Summary of classification accuracy wins/ties vs test path
Data set
Algorithm
Significance level (a)
Accuracy, k=0
Accuracy, k=1
IRIS
0.50 0.35 0.10 0.01
96.67 N/A N/A 96
N/A 95.33 95.33 96
Breast Cancer
0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.01
N/A 71.06 N/A N/A 73.07 67.62 68.77 N/A 69.05 N/A
73.07 75.07 74.79 75.64 73.64 74.79 N/A 73.35 73.93 75.07
Car
0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.01
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 77.78
94.33 94.33 94.33 94.33 94.33 94.33 96.18 96.18 96.18 96.18 92.71
Glass
0.50 0.45 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.01
73.36 73.83 73.36 71.96 74.77 73.83 72.43 71.03 70.09 70.56
N/A N/A N/A 74.3 N/A 78.97 78.50 78.97 79.91 74.77
Yeast
0.50 0.45 0.40 0.35 0.25 0.20 0.15 0.10 0.05 0.01
N/A N/A N/A N/A N/A N/A N/A N/A N/A 48.92
57.61 57.21 59.64 59.10 58.56 58.02 58.56 58.69 59.70 57.68
Page Blocks
0.30 0.25 0.15 0.10 0.05 0.01
96.99 97.26 97.22 97.06 96.77 96.91
N/A N/A N/A N/A 97.08 97.11
Wave
0.01
73.9
73.28
ChiBlur, 1 StdDev ChiBlur, 0 StdDev All other paths
Wins
Ties
3 0 2
2 1 2
It was also noted, as summarized in Table 4, that in general the Chi2 algorithm did not tend to perform feature selection, and was the most conservative algorithm tested. Only on the Car data set did Chi2 eliminate any attribute. While previous literature found that Chi2 performed feature selection well, our results, following the published Chi2 and ChiMerge algorithms (ie http://www.comp.nus.edu.sg/ Bliuh/Fsbook/), were more conservative in terms of the amount of feature selection performed by Chi2. On the other hand, the Simple Blurring Algorithm was the most aggressive in feature selection. The ChiBlur algorithm also feature selected aggressively, but as seen for the Breast Cancer data set, most of the ChiMerge a levels feature selected down to two predictor attributes, while one of the a levels feature selected to three predictor attributes. Discretization can affect the algorithm’s decision as to whether or not to select an attribute.
Conclusion In this paper, we have concentrated on comparing and discussing feature selection algorithms, notably a w2-based feature selection algorithm (Chi2) and a blurring-based feature selection algorithm (Simple Blurring Algorithm), both which have been discussed in the current literature. We also introduced a hybrid feature selection algorithm called ChiBlur, which is based on concepts from the ChiMerge and Simple Blurring Algorithms as well as from multi-objective optimization. While this algorithm included the aggressive feature selection of the Simple Blurring Algorithm (aggressive when k is small), it was noted that, in general, classification accuracies were better for ChiBlur than the other tested feature selection algorithms. It is thought that these improved results are due to the benefits of discretization, as provided by the ChiMerge algorithm. In fact, it was noted that some test paths using ChiMerge benefited from discretization and most ChiMerge-only test path results were comparable to the other test path results using feature selection or classification alone. The ChiBlur algorithm combined the discretization benefits of ChiMerge with the attribute elimination benefits of the Simple Blurring Algorithm to produce an algorithm that performed classification well, along with good reductions in data dimensionality. Thus, we have demonstrated that good feature selection results can be realized from aggressive attribute feature selection along with attribute discretization, a demonstration
K-M Osei-Bryson et al—Exploration of a hybrid feature selection algorithm 797
which ties together previously separate literature claims, and we have contributed a practical algorithm for improved feature selection and classification for actual use. We have also demonstrated that concepts from data mining can be combined with concepts from multi-objective optimization to lead to an improved data mining algorithm. Although the results of our computational study are encouraging, at this point no claim is made that these results are necessarily generalizable for all datasets. Further, our algorithm is only applicable to datasets whose predictor attributes take their values from ordered domains (e.g. integer, continuous). Future research could involve the generalization of the algorithm so that approaches such as linear programming could be used for the discretization in Substep 2a and other approaches could be used for feature selection in Substep 2b.
References 1 Kang B and Park S (2000). Integrated machine learning approaches for complementing statistical process control procedures. Decis Support Syst 29: 59–72. 2 Yin P and Yeh C. (2000). Content-based retrieval from trademark databases. Pattern Recognition Lett 23: 113–126. 3 Lin FY and McClean S (2001). A data mining approach to the prediction of corporate failure. Knowl-based Syst 14: 189– 195. 4 Boley D et al (1999). Partitioning-based clustering for Web document categorization. Decis Support Syst 27: 329–341. 5 Piramuthu S (1999). Feature selection for financial credit-risk evaluation decisions. INFORMS J Comput 11: 258–266. 6 Liu H and Setiono R (1997). Feature selection by discretization. IEEE Trans Knowl Data Eng 9: 642–645. 7 Kerber R (1992). ChiMerge: discretization of numeric attributes. In: Swartout WR (ed). Proceedings of the 10th National Conference on Artificial Intelligence. AAAI Press: USA, pp 123–138. 8 Bryson K-M and Joseph A (2001). Optimal techniques for classdependent attribute discretization. J Opl Res Soc 52: 1130–1143. 9 Zhong N and Dong J (2001). Using rough sets with heuristics for feature selection. J Intell Inf Syst 16: 199–214. 10 Kang B and Park S (2000). Integrated machine learning approaches for complementing statistical process control procedures. Decis Support Syst 29: 59–72. 11 Liu H and Motoda H (1998). Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers: Boston. 12 Dash M and Liu H (1997). Feature selection for classification. Intell Data Anal 1: 131–156. 13 Almuallim H and Dietterich T (1991). Learning with many irrelevant features. Proceedings of the Ninth National Conference on Artificial Intelligence. AAAI Press: Anaheim, CA, USA, pp 547–552. 14 Lovell DR et al (1998). Feature selection using expected attainable discrimination. Pattern Recognition 19: 393–402. 15 Yang J and Honavar V (1998). Feature subset selection using a genetic algorithm. IEEE Intell Syst 13: 44–49. 16 Last M, Kandel A and Maimon O (2001). Information-theoretic algorithm for feature selection. Pattern Recognition 22: 799–811. 17 Kira K and Rendell LA (1992). The feature selection problem: traditional methods and a new algorithm. In: Stwartout WR (ed). Proceedings of the Tenth National Conference on Artificial Intelligence. AAAI Press: San Jose, CA, USA, pp 129–134.
18 Kant S and Verma N (2000). An effective source recognition algorithm: extraction of significant binary words. Pattern Recognition 21: 981–988. 19 Nock R and Sebban M (2001). A Bayesian boosting theorem. Pattern Recognition Lett 22: 413–419. 20 Kohavi R and Frasca B (1994). Useful feature subsets and rough set reducts. In: Lin TY and Wildberger AM (eds). Proceedings of the Third International Workshop on Rough Sets and Soft Computing (RSSC’94), San Jose, CA, USA, pp 320– 323. 21 Choubey SK, Deogun JS, Raghavan VV and Sever H (1996). A comparison of feature selection algorithms in the context of rough classifiers. Proceedings of the Fifth IEEE International Conference on Fuzzy Systems. IEEE: New Orleans, LA, USA, pp 1122–1128. 22 Modrzejewski M (1993). Feature selection using rough sets theory. In: Brazdil PB (ed). Proceedings of the 1993 European Conference on Machine Learning. Springer: Vienna, Austria, pp 213–225. 23 Pawlak Z (1995). Rough set approach to knowledge-based decision support. Commun ACM 38: 88–95. 24 Slowinski K and Stefanowski J (1996). On limitation of using rough set approach to analyse non-trivial medical information systems. http://citeseer.nj.nec.com/27669.html (July 5, 2002). 25 Bradley PS and Mangasarian OL (1998). Feature selection via concave minimization and support vector machine. In: Shavlik J (ed). Proceedings of the 15th International Conference on Machine Learning. Madison, WI, USA, pp 82–90. 26 Bredensteiner EJ and Bennett KP (1998). Feature minimization within decision trees. Comput Optim Appl 10: 111–126. 27 Mangasarian OL (1997). Mathematical programming in data mining. Data Min Knowl Disc 1: 183–201. 28 Street WN, Mangasarian OL and Wolberg WH (1995). An inductive learning approach to prognostic prediction. In: Prieditis A and Russell S (eds). Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann: Tahoe City, CA, USA, pp 522–530. 29 Fung G and Mangasarian OL (2000). Data selection for support vector machine classifiers. In: Ramakrishnan R (ed). Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: Boston, MA, USA, pp 64–70. 30 Quinlan J (1986). Induction of decision trees. Mach Learn 1: 81– 106. 31 Quinlan J (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann: San Mateo. 32 Witten I and Frank E (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann: San Francisco. 33 Quinlan J (1996). Improved use of continuous attributes in C4.5. J Artif Intell Res 4: 77–90. 34 Cormen T, Leiserson C and Rivest R (1990). Introduction to Algorithms. MIT Press: Cambridge. 35 Giles K, Bryson NKM and Weng Q (2001). Comparison of two families of entropy-based classification measures with and without feature selection. Proceedings of the Hawaii International Conference on System Sciences (HICSS-34), Maui, HI, USA, pp 1–10. 36 Murphy PM and Aha DW (1994). UCI Repository of machine learning databases. Department of Information and Computer Science, University of California at Irvine.
Received January 2002; accepted January 2003 after two revisions