Journal of Applied Intelligence 4, 283-296 (1994) © 1994 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Classification Trees with Bivariate Splits DAVID L U B I N S K Y * Department of Computer Science, University of The Witwatersrand, Johannesburg, South Africa
david@concave, c s. wits. ac. za
Abstract. We extend the recursive partitioning approach to classifier learning to use more complex types of split at each decision node. The new split types we permit are bivariate and can thus be interpreted visually in plots and tables. In order to find optimal splits of these new types, a new split criterion is introduced that allows the development of divide-and-conquer type algorithms. Two experiments are presented in which the bivariate t r e e s - - b o t h with the Gini split criterion and with the new split c r i t e r i o n - - a r e compared to a traditional tree-growing procedure. With the Gini criterion, the bivariate trees show a slight improvement in predictive accuracy and a considerable improvement in tree size over univariate trees. Under the new split criterion, accuracy is also improved, but there is no consistent improvement in tree size. Key words: Recursive partitioning, bivariate split types, Gini, divide-and-conquer algorithms
1. Introduction
One of the most powerful paradigms in inductive learning is the use of recursive partitioning to build classification trees from data (see [1-3] for general information and overviews of the field). The data consist of a set of n cases where each case has an associated vector of attributes and a class value. The attributes may either be numeric (taking values in 9t) or categorical (associated with a finite set of unordered values). In general, the class variable can take any of a n u m b e r of values, but in this article we restrict our attention to datasets that have only two classes. The reason for this restriction is made clear in section 4.2. The goal of inductive-learning procedures is to produce rules based on the attribute values to *Much of this work was completed while the author was an employee of AT&T Bell Laboratories.
predict the class value for new cases. Recursive partitioning does this by dividing up the attribute space into mutually disjoint regions, each assigned a single class. The division is based on the values of the attributes and is usually found by a greedy algorithm that maximizes a split criterion at each stage. Each division splits the cases into two sets, and the algorithm proceeds recursively. The leaves of the resulting tree are labeled with classes, and the tree can be used to classify new cases by following a path from the root to a leaf according to the values of the attributes of the new case. The standard univariate tree-growing algorithm has the following steps: 1. Set the current group to the full data set. 2. If the current group contains only cases from one class, or contains fewer than a specified n u m b e r of cases, label this node with the class that has most reprsentatives at the node and return.
284
Lubinsky
3. For each variable, find the best split, using an appropriate split criterion. For numeric variables, the split is of the form x > c, and for categorical variables x E S. 4. Find the best split s, where s is the split that minimizes the criterion over all splits considered. 5. Use s to split the current group into two subgroups, and recursively apply the algorithm from step 2 to each of the subgroups. 6. Prune the tree so it does not overfit the data. The simplicity of this algorithm also leads to its two main weaknesses: • The algorithm may overlook important structures in the data because of the greedy nature of the search procedure. The symptom of this problem is a loss of predictive accuracy. • Structures in the data that are not orthogonal to the axes cannot be directly represented in the resulting trees. The symptom of this problem is an increase in tree size and a corresponding decrease in interpretability, since m a n y splits are generated to accommodate the nonorthogonal structures. Both these weaknesses are related to steps 2 and 3, in which we select the next split. There are two components to selecting the next split: the set of splits considered, and the split criterion used to pick the best among these possible splits. In the next section, we discuss extending the considered splits to include splits based on two variables. This allows for nonorthogonal splits and also overcomes, to some extent, the greedy nature of the algorithm without sacrificing the interpretability. This new strategy is then compared to the standard tree-growing strategy in a large study on a number of datasets. It is shown that we get a small increase in predictive accuracy and a substantial decrease in tree size by using bivariate splits. The brute-force algorithms for minimizing the various bivariate split types are very expensive, but it is not possible to minimize the standard split criteria in any more efficient way than total enumeration of all possible splits. To overcome this problem, we introduce the new class of additive split criteria and show how these can be used to develop more efficient divide-and-conquer algorithms to minimize the
new split criterion for bivariate splits. This new strategy is tested in another empirical comparison, and is also shown to improve accuracy, but it may lead to larger trees.
2. Extending Tree-Growing Methodology to Use More Complex Splits Each internal node of a classification tree represents a decision, and the complexity of the decision functions at each node affects both the power of the method and the intepretability of the results. We discuss extending the family of splits types by considering splits involving at most two attributes. We defend this choice in greater detail in the next two sections, but to give a feel for the power and interpretability of such trees, we first present an example in figure 1. This figure shows an example of a tree with bivariate splits grown on data from a heart-disease study [4]. There are 303 cases with 13 attributes (6 numeric and 7 categorical), and two classes, healthy and sick, denoted as a and b. Each node is represented by a plot or a table within a box; the values above the boxes summarize the split, giving the number of cases in each class going left and right in the order La, Lb/Ra, Rb (these split parameters are defined in table 1). Three new types of split are used in this tree: • The root split in the tree is a split on one numeric and one categorical variable and can be read as
(number of vessels colored > O) & (chest pain type ~ {3}). The values within the box are the counts for the two classes of each combination of the two variables. The column labels are reordered so that the best split can be indicated by partitioning off a rectangle including the lower-left corner of the box. The right child of the root is also a split on one numeric and one categorical variable. Since the numeric value has many values, the values are plotted rather than listed, and the split is oriented to cut the upper-left corner.
Classification Trees with Bivariate Splits
285
72,5/66,160 32/34
4/12
7/53
3/35
32/2
1/2
7/13
4/4
2/2
1/1
1/2
3/2
1/0
4
1
o
"5
2 3
3
2
chest pa/m type
72,3/0,2 1
36,14/38,146
3/0
2/0
9/1
40/1
"4"
~2 3
18/1 0
I
1
2
sex
o
4
0,1/72,2 Q~(3~D 0 O0
X£~
@
C~O
000
O0
0
0
o E
0
120
140
160
180
200
40
resting blood pres
x
o~"-<.~-'
0 %o-°
o
25,146/5,0
0
o °°00 0
1O0
4
6,11/30,3
OX
~o >~c4
thai
I
00o - .0. 4
0 OoO ° 0
50 _,_Age
60
~ 2~'~x ~ x - ' ~ / - -
I
o0 t 70
100
120
140
160
180
restin~ blood pres
Fig. I. E x a m p l e o f a bivariate tree for data from a heart-disease domain. L e a f nodes are not s h o w n explicitly, but are implied w h e n e v e r a node has fewer than two children.
• The split below the root to the left is a split on two categorical variables, and the optimal split is again drawn to include the lower-left corner. • All the splits in the third level of the tree are linear splits on two numeric variables. The root split is a particularly good example of the power of bivariate splits. Neither variable is a good discriminator on its own, and even though the single bivariate split corresponds to two univariate splits, the univariate procedure
Table 1. Parameters of a split in terms of counts
Class
Number of c a s e s going left
a b
La Lb
N u m b e r of c a s e s going right
R~=No-L. Rb = Nb -- Lb
Total number of cases
No Nb
would have missed the first split and this very important structure in the data would never have been uncovered. Another example of the power of bivariate splits is provided by the center split in the bottom row. In this split, a decreasing line divides healthy and sick cases based on age and maximum heart rate. A quick examination of the plot reveals that, within the group of cases at the node, the maximum heart rate needed to be healthy decreases with age. In the next section we elaborate on the problems with simple univariate splits and then discuss the reasons for not using splits of dimension higher than two. 2.1. Critique o f Univariate Splits
One of the key motivating ideas behind treebased methods is that in real data a single global
286
Lubinsky
model will not always be adequate. It is often the case that different regimes (models) apply in different areas of the space of predictors. For example, the process leading to heart disease in men may be different than for women, so different models are needed for each group. Splitting the data and treating each subset independently is one cure for this problem--the cure prescribed by recursive partitioning techniques. However, this is a drastic cure--each split halves (on average) the amount of data available to the next level. Even starting with 10,000 cases, a 10-level tree will be training its deepest nodes on about three cases. Since the language for expressing structure with univariate splits is so limited, trees of depth 10 or more are common. Univariate splits have to account not only for different regimes in the space of predictors but also for any structures within the regimes that are not orthogonal to the axes and thus take many splits to represent. When over-deep trees are produced, not only the predictive accuracy suffers but also the interpretability decreases with tree size. It can be very difficult for the analyst to deduce the underlying structure from the tree. It therefore makes sense to find the most powerful splits possible at each node--without sacrificing interpretability. There is a three-way trade-off between interpretability, the richness of the set of permissible splits, and the computational complexity of finding the best among the considered splits. Using bivariate splits is one good compromise point among these three criteria, since bivariate splits maintain interpretability, are much richer than univariate splits, represent many common realworld situations and, at least for small datasets, are computationally tractable. The issue of computational complexity is taken up further in the fourth section, where we discuss how more efficient algorithms can be derived for bivariate splits. Some of the situations in which univariate splits are insufficient while bivariate splits are all that is necessary are
some of the other attributes. Examples include age in measurements of ability, and weight in medical studies of drug efficacy. In each of these examples, the effect of the scale variable cannot be removed simply by dividing the other variables by the scale variable. Scaling may affect different variables in different ways. By allowing bivariate splits, the scale variable can be included as one of the variables whenever it is optimal to do so. Since we allow a very broad class of splitting rules, the scaling will be implicit and determined by the data. • Thresholds. Another common case is that a variable only becomes active once a threshold value has been reached on another variable. For example, in predicting heart disease, weight may only be a factor if age is over 50 years. Again, the predicting variable (in this case weight) and its thresholding variable (in this case age) can appear as the two variables in a rule. • Difference relationships. Only the difference of two may be useful for discrimination, but each on its own is not useful. To pick another example from the heart-disease domain; let us say we measure resting and stressed pulse rates. The difference of the two might be much more useful than either one alone. In this case, the two pulse measurements would appear in the same rule. Though the difference would be a good single variable, bivariate splits can discover this fact automatically. • Conjunctions. Two variables may only be important if they appear within certain ranges of values together. Again, from the heart-disease domain, high cholesterol and high weight might individually not predict heart disease, but together they may be dangerous. Many other possible configurations of two variables exist. We will be advocating very broad classes of discrimination functions to accommodate them. 2.2. Other A p p r o a c h e s to Extending the P o w e r o f Splits
• Presence o f a scale variable. One very com-
monly occurring case is that there is a single distinguished attribute in the dataset that scales
There have been a number of other approaches to extending the power of individual splits in re-
Classification Trees with Bivariate Splits
cursive partitioning schemes. Some of the proposed methods are as follows: • L i n e a r c o m b i n a t i o n s . The Classifications a n d R e g r e s s i o n Trees (CART) book [3] suggests a
heuristic search over a number of candidate directions for linear combinations of variables that result in good splits. • D i s c r i m i n a n t analysis. In [5], the suggested method is to apply a modified discriminant analysis at each node. • N e u r a l nets. A similar approach is suggested by Utgoff [6], who applies a neural-net classificatioan procedure at each node. • Fringe. The Fringe system [7] combines splits near the fringe into new variables and adds these to the set of attributes considered, and then regrows a new tree. The first three approaches mentioned above attempt to solve the problem of finding structure in data that will not be discovered by simple univariate splits. While their performance is, in certain cases, better than will be achieved by univariate splits, each of these methods sacrifices the interpretability of the resulting classifier. On the other hand, the Fringe system will yield more interpretable rules, but does not extend the types of structures that can be discovered by trees.
3. Growing Bivariate Trees To specify a strategy for growing bivariate trees, we must pick the set of splits to be considered, the split criterion, and the pruning methodology. The exact set of bivariate split types to be considered is discussed below. The split criterion we use initially is the Gini criterion [3], defined as gini -
L aL b La + Lbg
+
R ~R b Ra + R b"
It is also necessary to use a more sophisticated strategy for choosing splits than simply picking the one among all of the considered splits--univariate and bivariate--that minimizes the split criterion. The reason for this is closely related to the need for pruning trees; overfitting degrades predictive accuracy, and just as there is a danger of overfitting with a tree that is too large, there
287
is a danger of fitting local structure at a particular node with a split that is too complex. This is discussed in more detail in the next section. The pruning algorithm used for both univariate and bivariate rules is unmodified pessimistic pruning [81. 3.1. S e t t i n g the P a r a m e t e r to C h o o s e B e t w e e n Univariate a n d Bivariate Splits
To avoid the problem of local overfitting induced by always picking a bivariate split, we add a parameter to the tree-growing procedure that sets how freely the system can pick bivariate splits. A bivariate split is only picked if the improvement in the split criterion due to the bivariate split is at least f times the improvement due to the best univariate split. That is, only pick the bivariate split if (d - b ) > f ( d -
u)
where d = the value of the split criterion if no split is made, b/
=
the minimum value of the split criterion associated with a univariate split,
and b = the minimum value associated with a bivariate split. Since we want to pick bivariate splits only if they improve the split criterion more than univariate splits, only values of f greater than 1 make sense. Also, since each bivariate split has two comparisons, it would be too conservative to allow values of f greater than 2, so we restrict k E [1, 2]. A series of experiments was run to evaluate different values off. For the Gini split criterion, the best value of f for a set of datasets used in the design of the system was found to be f = 2. Though this values of f was found to work well, the value selected f o r f c o u l d also be set by the user. In some applications it might be desirable to pick a lower value of f to allow more bi-
288
Lubinsky
variate splits in order to get a shallower tree, possibly at the expense of some reclassification accuracy. Higher values o f f could be used to prevent the choice of bivariate splits unless they are clearly necessary. The use of cross-validation to select good values f o r f f o r new problems allows users to find a balance between tree size, cross-validated accuracy, and interpretability of the resulting trees. 3.2. Evaluation o f the Contribution
o f Each Type o f Bivariate Split We began by considering a large number of bivariate split families as candidates for inclusion in the final system for building bivariate trees. These candidates were then tested to see whether in fact their inclusion improved the predictive accuracy of the resulting trees. The initial set of split classes used is listed below. Each class definition is followed by an example of the type of rule it might generate. • The four directions of corner split, e.g., ((x < a) A (y < b)). A plant is healthy if it gets at least 10 ml of water and three hours of sunshine per day. • Lines (y -< (ax + b)). Patients are not likely to get heart disease if (weight < 3*height - 200). • Bivariate category ((x E X) A (y E Y)). People who live in cities or large towns and have Japanese or European cars are more likely to vote Democrat. • Categorical with numeric ((x > c) A (y E Y)). A mushroom may be poisonous if it is older than three months old and belongs to certain families of mushroom species. • Rectangles I ((a < x -< b) A (c < y -< d)). The camshaft only works if (0.3 -< length <- 0.4) A (1.2 -< width <- 1.3)). • Intervals (a < x -< b). A person has a healthy pancreas only if their blood-sugar level is within a certain range. • Density-estimate based classifiers. The classes are defined by the decision rule
D(x, y, a) > D(x, y, b) where D(x, y, c) is a density-estimation function of cases of class c at point (x, y). Since density
estimates are very flexible, they can model decision regions not covered by the other types. Though interval splits are not bivariate, they were still considered for inclusion, since they are more complex than the traditional forms of splits but can be easily interpreted. Moreover, intervals require two splits from the traditional catalog of univariate splits. Each of these split types can be displayed in a plot or a table and can thus be easily interpreted. A series of experiments was run to evaluate the contribution of each class of bivariate splits to predictive accuracy. In each experiment, the traditional univariate algorithm was compared to an augmented algorithm in which only one of the classes of bivariate splits was included. The predictive accuracies of the two runs were compared on a number of datasets by cross-validation, and a simple t-test was used to determine whether the contribution due to the included bivariate split class was significiant. Only corner, line, bivariate category, and categorical with numeric splits were found to improve the predictive accuracy. The rejected split types were interval, rectangle, and density-based splits. It seems that these splits allow too many parameters to be estimated at once, and overfitting cannot be avoided when these split types are allowed.
3.3. Bivariate Tree-Growing Procedure In the empirical comparison described in the next section, we used the two univariate split families--univariate threshold (x > c) and univariate category (x E A)--as well as the bivariate split families mentioned above, which were found to improve accuracy. At each node, the best split of each type was found for each pair of applicable variables. Thus the complexity of the computation of finding a new split is O(mZf(n)), where m is the number of attributes and f(n) is the complexity of minimizing all split types for a single pair of variables. For the split classes used, f(n) is O(n2), which is dominated by the cost of minimizing the corner and line splits. Since more cases are needed to fit a bivariate split than a univariate split, we also enforce the condition that a bivariate split can only be chosen if there are at least 50 cases at the node.
Classification Trees with Bivariate Splits
3.4. Description of Datasets
289
sions had been made and parameters had been chosen.
Table 2 describes the datasets used in the experments. Most were chosen from the UCI (University of California at Irvine) machine learning repository, 2 with the only requirement being that they have two classes, or two dominant classes. If there were more than two classes, the cases associated with the small classes were deleted; also, any cases containing missing values were deleted. The datasets are split into two groups: those used in the design of the system and methodology and a separate group of datasets used only later in testing the system once all design deci-
3.5. Results of Empirical Comparison The results are presented in table 3. Each value in the table is the mean of four cross-validation runs, each with a different random partition of the data. The datasets were partitioned into ten parts for each run, except for the diabetes dataset, which was run with fourfold cross-validation due to its size. The results of empirical comparisons of growing trees with bivariate splits using the procedure described above are compared to
Table 2. Description of datasets Attributes Dataset
Num.
Cases in
Symb.
class a
class b
Datasets u s e d i n design phase bupa 6
0
145
200
glass
9
0
87
76
biomed
6
7
161
137
cleveland
6
7
161
137
lymphography
0
18
81
61
diabetes
8
0
500
268
hepatitis
6
13
123
32
Datasets used in test phase credit-screening 6
10
296
357
echocardiogram
5
2
44
17
horse-colic
5
9
61
90
multiplex
12
1
189
811
tic-tac-toe
0
10
332
626
13
1
71
59
wine
Domain
Effect of alcohol consumption on liver disease Identification of glass manufacturing process Liver disease diagnosis Heart disease diagnosis Lymphography diagnosis Diagnosis of diabetes in Pima Indians Hepatitis diagnosis Screening of credit applicants Echocardiogram recognition Horse colic diagnosis Artificial multiplexor data Tic-Tac-Toe end games Wine recognition data
Source
[9]
[ 10]
[11] [4] [ 12] [13]
[14]
[15] [ 16] [17] [ 18] [19] [20]
Note: Most datasets were taken from the UCI repository and are available by anonymous FTP from the directory pub/ machine-learning-databases at ics.uci.edu.
290
Lubinsky
Table 3. Results of cross-validation runs where both traditional and BITS were optimized under the Gini function. The probabilities are the results of a Wilcoxon signed-rank test for difference in means. (a) Predictive accuracy
Dataset
BITS Gini
Trad Gini
Difference
Prob. Eq. Means
Datasets used in design phase biomed bupa lymphography cleve glass hepatitis diabetes
86.2 69.7 82.4 80.0 81.4 78.4 74.5
83.3 67.7 80.8 79.5 81.1 77.7 74.3
2.928 2.040 1.583 0.431 0.267 0.677 0.260
0.0581" 0.3112 0.6433 0.8125 0.9806 0.8163 0.8351
Datasets used in test phase (CBP=0.1875) credit-screening echocardiogram horse-colic multiplex tic-tac-toe wine
85.4 73.7 81.8 97.5 93.5 96.5
84.6 77.8 80.5 95.6 93.5 96.5
0.842 -4.10 1.30 1.94 0.026 0.000
0.355 0.229 0.570 0.011" 0.953 1.00
(b) Weighted size
Dataset
BITS Gini
Trad Gini
Difference
Prob. Eq. Means
- 1.67 - 3.03 - 2.40 - 2.67 - 1.83 - 2.02 -4.13
1.02e-004" 8.76e-005" 3.27e-007" 3.78e-005" 4.39e-007" 1.13e-005" 1.55e-002"
Datasets used in design phase biomed bupa lymphography cleve glass hepatitis diabetes
3.95 15.07 6.22 7.78 5.08 4.15 22.69
5.63 18.10 8.63 10.45 6.90 6.17 26.81
Datasets used in test phase (CBP= 0.031") credit-screening echocardiogram horse-colic multiplex tic-tac-toe wine
8.41 2.07 2.80 13.6 24.7 1.97
10.5 2.60 3.62 19.4 26.9 1.97
- 2.08 - 0.525 -0.825 - 5.75 -2.16 0.000
0.078* 0.000" 0.002* 0.000" 0.057* 1.00
*Values below 0.1 are considered significant.
a reimplementation of a traditional tree-growing procedure with univariate splits. The table is divided into two sections. The top section contains results for the seven datasets that were used in the design of the BITS (BIvariate TreeS) methodology. The bottom section contains results for a set of six test datasets that were run after the methodology was fixed.
The values for BITS and Trad. (traditional univariate algorithm) in table 3a are the means of the predictive accuracy for the cross-validation runs, for bivariate and traditional trees, respectively. In table 3b, the values are the means of the weighted sizes of the pruned trees. The weighted size counts the number of comparisons in the internal nodes of the trees. A univariate split counts one, and a bivariate split counts two. The values from the runs are compared with a nonparametric Wilcoxon signed-rank test [21]. The probabilities that the two means are the same for the given data are shown in the last column of the table. A value below 0.1 is considered significant and is indicated by a trailing asterisk in the tables. There is also a summary statistic for each set of values for the test datasets. This statistic treats each dataset as a binomial trial, with a positive outcome if the sign of the difference is positive, and likewise a negative outcome if the sign is negative. Datasets for which the difference is zero are not included in the comparison. The probability quoted in parentheses for each set is the probability of getting at least that many positive outcomes under the null hypothesis of P(positive outcome) = 0.5. This statistic combines the results over multiple datasets and gives a way of discussing the overall effect of introducing bivariate splits. We refer to this statistic as the Combined Binomial Probability or CBP and consider values of the CBP below 0.1 as significant.
3.6. Summary of Results under Gini For each dataset except for echocardiogram, a small to moderate improvement in predictive accuracy is achieved by BITS. This improvement is statistically significant for the biomed and multiplex datasets (one is from the design set and one from the test set). The echocardiogram dataset has only 61 cases, so it seems that it may be dangerous to use the BITS methodology on small datasets. The CBP for predictive accuracy is 0.1875, which is not significant. On the other hand, the improvement in the size of the trees due to BITS is quite dramatic. For all datasets--except the wine dataset where
Classification Trees with Bivariate Splits there is no change--the improvement in the size of the pruned trees is significant at the 1% level. The trees often have 30% fewer comparisons. The CBP is 0.031 and is thus also significant. Since the pruned trees are significantly smaller, and since except for the echocardiogram data there is always an improvement in accuracy, at least for the datasets used~ it appears that bivariate trees dominate univariate trees under Gini and should thus be the method of choice in growing classification trees for two-class problems. Under Gini, only the brute-force algorithms can be used to minimize the split criterion; therefore, depending on the size of the dataset and power of available hardware, this minimization may not be feasible. The rest of this article discusses how, by using a new split criterion, we can derive more efficient algorithms to overcome this difficulty.
4. The Need for an Additive Selection Criterion
In traditional tree-growing procedures, finding the best split at a node involves finding a single optimal split point on each variable and selecting the best one. If there are n cases, then at most n - 1 split points need be considered for each variable. This can be done in O(n log n) time by sorting and scanning, as is shown in figure 2. However, when more complex split families are considered, exhaustive testing each possible split may not be a viable approach with large datasets. For example, if a numeric variable has n values, then the best method we know of to find the interval that minimizes the Gini criterion is exhaustive consideration of all possible splits. This is an O(n 2) operation, since there are (~) possible intervals. One approach to finding the optimal split of this and other complex split families more efficiently is to use divide-and-conquer algorithmic techniques. To do this, the split criterion we use must be additive in the following sense: Given two disjoint sets of cases, C~ and C2, each with associated splits $1 and $2, where S~
291
na is the n u m b e r of cases of class a nb is the n u m b e r of c~ses of class b n a l , n b l are the n u m b e r of cases of each class to the left of the current cut c is the vector of classes of the n cases sort
the
nal=
0
nbl
on the
values
of t h e
attribute
= 0
Best for
classes
= oo i from
if c[i3 else
I to n =
nbl
CurrentVal
'a' t h e n n a l =
hal
+ I
= nbl+l = splitfun(nal,nbl,na-nal,nb-nbl)
if C u r r e n t V a l
< Best
then
Best
= CurrentVal
Fig. 2. A simple sorting and scanning algorithm for finding the best cut on a single numeric attribute.
splits the cases of C1 into the two sets LI and R~ and S 2 splits the cases of C2 into L 2 and R2, then the split criterionfis additive if f(L~ U L2, R~ U Rz) = f ( L 1, R1) + f(L2, R2). An example of an additive split criterion is a modified version of inaccuracy (the number of cases misclassified), which we call signed inaccuracy. There are two, individually additive, versions of signed inaccuracy defined as follows: inacc~ = La + Rb, and inaCc2 = Ra + Lb. Minimizing inacc~ assumes the majority of cases that go left of class b, and the majority of cases going right are of class a. The number of misclassified cases is therefore La + Rb. The converse case is taken care of by i n a c c 2. Inaccuracy, which is defined as inacc = min(L,, Lb) + min(Ra, R b) also allows for the two other cases: when one class dominates on both sides of the split. We can therefore rewrite inaccuracy as inacc = min(inacc l, inacc2, Na, Nb), and minimize inacc by first minimizing inacc~, then minimizing inacc2, and then comparing these values to the total number of cases in each class.
292
Lubinsky
4.1. Finding the Optimal Interval under Inaccuracy As an example of how an additive split criterion can be exploited in speeding up an algorithm, we show how inacc~ and inacc2 can be used to find the split interval that minimizes inaccuracy in O(n log n) time. While interval splits were rejected in section 3.2 as not being sufficiently selective, this algorithm is simpler than others that take advantage of the additive split criterion and illustrates why the additive criterion is useful. Details of other more complex algorithms for lines, corner, and rectangles are contained in [22]. The method for finding the optimal interval is based on the algorithm for finding the maximal consecutive sum of a sequence, i.e., given a sequence of values v~ . . . v,, find l, r such that ~ = t vi is maximized. This problem is discussed in [23] and has a O(n) solution. A modified version of this algorithm is reproduced in figure 3. To map the problem of finding the optimal interval of a numeric attribute x to maximal consecutive sum, we first order all cases by the value of x, which is an O(n log n) operation, and then replace cases of class a by 1 and those of class b by - 1, giving a vector y. The sequence of maximal sum of y corresponds to the interval that minimizes inaccz. If s = Zri=t Yi, then s = L, - Lb, and Z a = N a - Ra, SO s = - ( g a + Lb) + Na; since Na is
MaxSoFar = 0 M~xEndingHere CurrentLeft
4.2. Extending to Weighted Inaccuracy In the previous section, we showed how finding the optimal interval could be reduced from an O(n 2) to an O(n log n) operation by using the inaccuracy criterion rather than one of the traditional nonadditive criteria. While minimizing inacccuracy is locally the best criterion for improving accuracy, it is well known that inaccuracy is not a good selection criterion for use in generating splits for growing classification trees. This limitation was first mentioned in [24]. The problem with inaccuracy arises when one class is much more prevalent than the other. For example, in figure 4 there are eight x's and three o's indicating the points from the two classes. Any cut has inaccuracy of three, misclassifying all the o's. However, one of the cuts at $1 or $2 would be the best, since this could then be followed by the other, leading to perfect classification. The cut at S~ does in fact minimize both Gini and entropy3 measures. What we would like to find is an additive criterion that still selects cuts such as Sa. Assume we have a split in which La > Lb and R b > R.. We can get some insight into why the Gini function selects cuts such as S1 by arranging the terms of the function as follows:
= 0
gini
= i
for i from i to n MaxEndingHere
= MaxEndingHere
if M a x E n d i n g H e r e CurrentLeft
+ Y[i]
< 0 then
= i + 1
MaxEndingHere if M a x S o F a r
a constant at the node, maximizing s is equivalent to minimizing inacc2. By substituting - 1 for a's and 1 for b's, we can minimize inaccv Table 4 summarizes other algorithmic results described in [22] and gives indications of where there is room for further research.
= 0
< MaxEndingHere
then
Left = CurrentLeft Right = i MaxSoFar = MaxEndingHere
Fig. 3. A l g o r i t h m for i n t e r v a l of g r e a t e s t sum.
L. Rb --La -k LbLb + eg a-+ o
Ra"
Writing the function this way shows that the Gini function is a type of weighted inaccuracy function, since Lb and Ra are the two inaccuracies for this split, and the weights are equal to the proportion of the dominant class in the set. If there is a large proportion of the dominant class, then misclassifications tend to get weighted more heavily. The smallest weight of 1/2 is achieved when each class is equally represented. This accounts for Gini's propensity for choosing large pure splits, 4 which makes it such a successful
Classification Trees with Bivariate Splits
293
Table 4. Summary of results in deriving efficient split algorithms
Brute force complexity
Complexity of new algorithm
Split type
Split expression
Interval split Corner split Line split Rectangle split Categorical conjunction Categorical with numeric
l < x <- r
O(n z)
O(n log n)
(x -< a)/~ (y -< b)
O(n2) O(n3)
O(n log n)
Lower bound
O(n*)
O(n 2) O(n3)
x ~ C/~ y ~ R
O(22k)
O(2k)
O(n log n) O(n log n) l~(n log n) f~(n log n) 12(2~) assuming P # N P
(x ~>c)/~ (y e S)
O(n2k)
O(n log n + cn)
f~(n log n)
y > ax + b (a < x <- b)/~ (c < y <- d)
Note." The value of m in the categorical conjunction split is the minimum of the number of categories in x and y. The value
c in the categorical with numeric split is the number of categories in y. The © operator can be either - or >.
split criterion. The reason that Gini is not additive is that with each considered split the proportions L J ( L ~ + Lb) and R,z/(R~ + R b) change. H o w e v e r , we can define a variation of inaccur a c y - w e i g h t e d i n a c c u r a c y or w a c c - - a s w a c c = min(kL~, Lb) + min(kRo, Rb),
for any constant k > 0, where k is the ratio of cost o f misclassification of type a to type b. The two corresponding additive subfunctions, wacc~ and wacc2, are defined as
and =
kRa + Rb.
w a c c = min(3/8La, L b) + min(3/8Ra, Rb).
Wacc can also be written as w a c c = m i n ( w a c c t , waccz, kNa, Nb).
In order to maintain additivity, k m u s t be fixed w h e n selecting a m o n g splits at a node. Therefore, we cannot v a r y k so that this function ex-
xxxx
,
v
~
$1
k = NJNo.
This is a simple and natural choice, weighting the a ' s b y their proportion. Using the values from figure 4, k = 3/8, since there are 8 a ' s and 3 b's. The weighted i n a c c u r a c y m e a s u r e in this case is
wacc~ = kL~ + R b
wacc 2
actly m a t c h e s Gini, but at each node we can choose k, b a s e d only on Na and Nb, to m a t c h Gini as closely as possible. In [25] we derive an optimal value of k under certain conditions, and we give a theoretical justification for this choice. We also show that this split criterion works surprisingly well in a large empirical c o m p a r i s o n with the traditional split criteria. The weight selected in [25] is
o o
xx $2
The w a c c score for split S~ in figure 4 is min(15/8, O) + min(9/8, 3) = 9/8 and for split $2 it is min(15/8, 3) + min(9/8, 0) = 15/8, so split S~
is preferred to $2, and is in fact under wacc. N o t e that wacc is two class datasets, which is the nally restricting our attention to
the optimal split only defined for reason for origisuch problems.
4.3. Split Criterion f o r B i v a r i a t e Trees
In growing trees with bivariate splits, we use the wacc.gini split criterion w a c c . g i n i = max(l/k, l ) w a c c + gini/n.
Fig. 4. A case in which the value of the inaccuracy crite-
rion is constant for each cut.
This criterion has better p e r f o r m a n c e on the design datasets than plain wacc, since it is possible
294
Lubinsky
for many different splits to have the same minimum value under wacc, and using Gini as the tie-breaker (since gini/n is less than one) asserts a preference for large pure splits. Though wacc.gini is not an additive split criterion, it can still be minimized by the efficient algorithms for minimizing wacc presented in [22]. This is due to the nature of the algorithms, in that they keep a current best split that is updated as the algorithm proceeds. Whenever a split is considered that has wacc value equal to the current best, its Gini score can be evaluated, and the new split only replaces the current best if its wacc.gini score is
lower than that of the current best. Since the algorithms presented are guaranteed to visit all splits that minimize wacc, the result will be the split that minimizes wacc.gini. 4.4. Summary of Results where BITS uses wacc.gini and Traditional Method uses Gini In this final comparison, we compare the BITS methodology under wacc.gini with the traditional tree-growing results, as were presented in table 3. The results of this comparison (table 5) are presented in the same format as table 3.
Table 5. Results of cross-validation runs where BITS was optimized under wacc.gini and the traditional trees were optimized under Gini (a) Predictive accuracy Dataset
BITS
Trad
wacc.gini
Gini
Difference
83.3 67.7 80.8 79.5 81.1 77.7 74.3
0.875 0.664 - 0.357 - 2.172 5.331 3.531 - 0.977
84.6 77.8 80.5 95.6 93.5 96.5
0.115 0.416 2.28 0.024 0.156 0.000
Datasets used in design phase biomed 84.1 bupa 68.4 lymphography 80.5 cleve 77.3 glass 86.4 hepatitis 81.3 diabetes 73.3 Datasets used in test phase (CBP=O.031*) credit- screening 84.7 echocardiogram 78.2 horse-colic 82.7 multiplex 95.6 tic-tac-toe 93.6 wine 96.5
Prob. Eq.
Means
0.6026 0.6783 0.8722 0.2297 0.0205* 0.1735 0.5448 0,884 0,901 0.261 0,953 0.794 1.00
(b) Weighted size
Dataset
BITS
Trad
wacc.gini
Gini
Difference
5.63 18.10 8.63 10.45 6.90 6.17 26.81
0.70 1.07 - 0.05 - 0.600.90 - 1.22 1.19
0.09640* 0.16672 0.97282 0.40525 0.00355* 0.00127* 0.47206
10.5 2.60 3.62 19.4 26.9 1.97
0.916 - 0,050 1.25 - 1,58 - 0,416 0,000
0.747 0.768 0.000" 0.308 0.636 1.00
Datasets used in design phase biomed 6.33 bupa 19.18 lymphography 8.57 cleve 9.85 glass 7.80 hepatitis 4.95 diabetes 28.00 Datasets used in test phase (CBP=0.5) credit-screening 11.4 echocardiogram 2.55 horse-colic 4.87 multiplex 17.8 tic-tac-toe 26.5 wine 1.97
Prob. Eq.
Means
Classification Trees with Bivariate Splits The predictive accuracies for the design datasets are mixed, with four favoring BITS and two favoring the traditional method, the only significant difference being for the glass dataset and favoring BITS. For the test datasets, however, all five for which there was a difference favored BITS, giving a significant CBP of 0.031. The tree size comparisons for the test datasets are mixed: the means for three datasets are larger with univariate splits, two are larger under BITS, and one is identical. The comparison for the design datasets is similar.
5. Summary of Empirical Results In evaluating the overall results, we must put them in perspective of general research in machine learning. While many new methods are proposed for learning tasks, none can consistently outperform all other methods, since each method has its own bias and set of datasets for which it will be optimal. In the same way, using bivariate splits only improves performance where these splits are most clearly needed. The empirical results in the previous sections show that bivariate splits do often improve both the accuracy and size of the resulting trees, but there are cases in which performance is worsened by introducing the bivariate splits. Under Gini, the results are clear: except for the one very small dataset, accuracy is somewhat improved and tree size is much decreased by bivariate splitting. Under wacc.gini, however, there are three large improvements in predictive accuracy with bivariate splitting, even larger than was achieved under Gini. The combined trend in favor of BITS in the test data is significant, so this is another clear positive result. However, tree size was not improved in general and was significantly worse in some cases. It is interesting that the datasets for which the greatest improvement were achieved under Gini and wacc.gini are different, so it may be necessary to try both methods. Fortunately, cross-validation provides us with an unbiased method of choosing among the various tree-growing methodologies. So, the best advice would be to first test the cross-validated accuracy with bivariate splits under wacc.gini in
295
the hope of getting a big improvement such as the two we achieved in the experiment. Then, compare these results to the cross-validated accuracy of a tree grown with bivariate splits under Gini if this is feasible, or alternately to the accuracy under a univariate tree if the dataset is too large.
Notes 1. While corners are a subclass of rectangles, we include them as well to evaluate how they contribute to predictive accuracy, independently from rectangles. 2. The multiplex and bupa datasets are from other sources (see references for details). 3. The entropy measure is another popular split criterion, used in ID3 [2] and its descendants. 4. A split is pure if one of its branches is composed of cases from only one class.
References 1. Sholom Weiss and Casimir Kulikowski, Computer Sys-
tems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems, Morgan Kaufmann: San Mateo, CA, 1990. 2. J.R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, no. 1, pp. 81-106, 1986. 3. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone, Classification and Regression Trees, Wadsworth: Belmont, CA, 1984. 4. Robert Detrano, Cleveland heart disease data, Cardiology I II-C V.A. Medical Center 5901 E. 7th Street, Long Beach, CA 90028. From the UCI Machine Learning repository. 5. W.Y. Loh and N. Vanichsetakul, "Tree-structured classification via generalized discriminant analysis," J. Am. Statist. Assoc., vol. 83, no. 403, pp. 715-725, 1988. 6. Paul E. Utgoff, "Perceptron trees: A case study in hybrid concept representation," in Proc. AAAI, 1988, pp. 601-606. 7. Giula Pagallo, "Learning DNF by decision trees," in Eleventh Int. Joint Conf. Artif Intell., vol. 1, pp. 639644, 1989. 8. J.R. Quinlan, "Simplifying decision trees," in Proceed-
ings of Knowledge Acquisition for Knowledge Based Systems Workshop, Banff, Canada, 1986. 9. Richard S. Forsyth, Bupa liver disorders, 8 Grosvenor Avenue, Mapperley Park, Nottingham NG3 5DX, 0602621676, 1990. From the UCI Machine Learning repository. 10. B. German, Glass data, Central Research Establishment, Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN. From the UCI Machine Learning repository.
296
Lubinsky
11. Statlib, Liver disease diagnoses. From Carnegie Mellon University Statistics Library. 12. M. Zwitter and M. Soklic, Lymphography data, University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. From the UCI Machine Learning repository. 13. National Institute of Diabetes and Digestive and Kidney Diseases, Pima Indians diabetes data, from the UCI Machine Learning repository, 1990. 14. Bojan Cestnik, Hapatitis data, Jozef Stefan Institute, Jamova 39, 61000 Ljubljana, Yugoslavia. From the UCI Machine Learning repository. 15. Chiharu Sano, Japanese credit screening (examples and domain theory). From the UCI Machine Learning repository. 16. Evlin Kinney, Echocardiogram data, The Reed Institute, P.O. Box 402603, Miami, FL 33140-0603. From the UCI Machine Learning repository. 17. Mary McLeish and Matt Cecile, Horse colic database, Department of Computer Science, University of Guelph, Guelph, Ontario, Canada N1G 2Wl,
[email protected]. From the UCI Machine Learning repository.
David Lubinsky is a Senior Lecturer in the Department of Computer Science at The University of The Witwatersrand in Johannesburg. He holds a doctorate from Rutgers University (Computer Science) and an Msc Degree from Witwatersrand (Statistics). He was previously a member of technical staff at AT&T Bell Laboratories. His current research interests are in the areas of computational geometry, machine learning, and statistical computing.
18. Jason Catlett, Real-valued version of the multiplexor function. Private correspondence. 19. David W. Aha, Tic-tac-toe endgame database. From the UCI Machine Learning repository, 1991. 20. M. Forina, Wine recognition data. From the UCI Machine Learning repository. 21. M. Hollander and D. Wolfe, Non-parametric Statistical Methods, Wiley: New York, 1973. 22. David J. Lubinsky, "Bivariate splits and consistent split criteria in dichotomous classification trees," Ph.D. thesis, Department of Computer Science, Rutgers University, New Brunswick, NJ, 1994. 23. John Bentley, Programming Pearls, ACM, New York, 1980. 24. Robert Messenger and Lewis Mandell, "A model search technique for predictive nominal scale multivariate analysis," J. Am. Statis. Assoc., vol. 67, pp. 768772, 1972. 25. David J. Lubinsky, "The use of additive split criteria in speeding up classification trees," in Fourth Int. Workshop Artif. Intell. Statist., Fort Lauderdale, FL, 1993.