Statistics and Computing (1997) 7, 209±216
A fast splitting procedure for classi®cation trees FRANCESCO MOLA and ROBERTA SICILIANO Dipartimento di Matematica e Statistica, UniversitaÁ degli Studi di Napoli Federico II, Monte S. Angelo, via Cintia, 80126 Napoli, Italy
[email protected] [email protected]
Received August 1995 and accepted April 1997
This paper provides a faster method to ®nd the best split at each node when using the CART methodology. The predictability index s is proposed as a splitting rule for growing the same classi®cation tree as CART does when using the Gini index of heterogeneity as an impurity measure. A theorem is introduced to show a new property of the index s: the s for a given predictor has a value not lower than the s for any split generated by the predictor. This property is used to make a substantial saving in the time required to generate a classi®cation tree. Three simulation studies are presented in order to show the computational gain in terms of both the number of splits analysed at each node and the CPU time. The proposed splitting algorithm can prove computational eciency in real data sets as shown in an example. Keywords: CART, predictability index s, Gini index of heterogeneity, maximal binary tree
1. Introduction The problem of constructing a classi®cation rule has been faced in various approaches (Hand, 1981); namely, linear discriminant analysis, quadratic discriminant analysis, the kernel density method and the k-th nearest neighbour. We follow the approach of tree-structured classi®cation; see, for instance, Clark and Pregibon (1992). This consists of a recursive sequential binary division into two subgroups of N cases belonging to J groups. The goal is to grow a binary tree which de®nes a classi®cation rule for new cases. In this paper, we focus our attention on the splitting algorithm and the time required to build and validate the tree. Our framework is represented by the well-known CART methodology (Classi®cation and Regression Trees) of Breiman, et al. (1984). The CART splitting procedure is based on the concept of node impurity and uses the decrease in impurity as a splitting criterion to divide at each node a group of cases into two subgroups which are internally most homogeneous and externally most heterogeneous. A commonly used impurity measure is given by the Gini index of heterogeneity (Section 2). Alternative splitting criteria have been recently proposed; see, for instance, Quinlan (1986); Kwok and Carter (1990); Buntine (1992); Taylor and Silverman (1993); 0960-3174
Ó 1997 Chapman & Hall
Ciampi (1994); Mola et al. (1996); Siciliano and Mola (1997a, 1997b); Mola and Siciliano (1997). Some methods to accelerate the splitting procedure have also been proposed (Loh and Vanichsetakul, 1988; Mola and Siciliano, 1992, 1994) although such methods do not grow the same classi®cation tree as does CART. The original contribution of this paper is to provide a faster method to ®nd the best split at each node when using the CART methodology. To this purpose, we propose the predictability index s of Goodman and Kruskal (1954) as a splitting rule. As a ®rst result, we show that maximizing the index s at each node is equivalent to maximizing the decrease in impurity as measured with the Gini index (Section 3). In this way, we give an alternative interpretation of the CART splitting procedure in terms of the predictability index s. As a more relevant result, we introduce a new property of the index s that allows us to ®nd the best split at each node without necessarily trying out all possible splits (Section 4). Indeed, this property is used to derive a splitting algorithm that ®nds the maximal binary tree more quickly than CART does (Appendix). We present three simulation studies to show the computational gain in terms of both the number of splits analysed at each node and the CPU time (Section 5). Through an example we discuss how the proposed splitting algo-
210
Mola and Siciliano
rithm proves to be computationally very ecient in real data sets. In fact, the maximal binary tree is grown by inspecting overall a very low number of splits and thus using less computing time (Section 6).
2. The CART splitting procedure Let
Y ; X be a multivariate random variable where X is the matrix of M predictors taking values in X RM and Y is the criterion variable taking values in the set of prior classes C f1; . . . ; j; . . . ; J g. We denote by X a general predictor in X. The construction of a classi®cation ruleis based on a learning sample L with N cases, L
yn ; xn ; n 1; . . . ; N taken from the distribution of
Y ; X. A classi®er is a function d
x de®ned on X so that for every x, d
x is equal to one of the J classes. Binary tree classi®ers are grown up by repeated splits of subsets of X into descendent subsets beginning with X itself. Because of the evident analogy with graph theory, a subset of X is a node t. CART ®nds ®rst the maximal binary tree (the tree with very few cases in the most homogeneous terminal nodes) by a splitting procedure and then an honest size tree by a pruning procedure; see also Hastie and Pregibon (1990); Oliver and Hand (1995). A test sample or a m-fold crossvalidation is used to validate the classi®cation rule. The CART splitting criterion is based on the concept of node impurity and a commonly used impurity measure is given by the Gini index of heterogeneity i
t 1 ÿ
J X j1
pt2
j
1
where pt
j is the node proportion of all cases belonging to class j. A split s of a node t sends a proportion ptr of the data cases at node t to the right node tr and a proportion ptl to the left node tl . The best split at each node maximizes the decrease in impurity among the cases when passing from one group to two subgroups: Di
s; t i
t ÿ ptl i
tl ÿ ptr i
tr
2
where i
tl and i
tr are the impurity measures at sub-nodes tl and tr , respectively. The idea in CART is to calculate Di
s; t for each split s 2 Qt where Qt is the set of all possible splits at node t and then to select the best split s such that Di
s ; t max Di
s; t s2Qt
3
As a matter of fact, a drawback of tree methods is the eort required to build the maximal tree depending on the cardinality of Qt . A categorical predictor X having I categories can generate 2Iÿ1 ÿ 1 splits associated with questions of the form `Is X in B, or in the complement of B?', where B is a subset of I. An ordered or numerical predictor X with
N distinct values generates N ÿ 1 splits associated with questions of the form `Is X below, or above x?', where x is a cutpoint. In CART all possible splits are necessarily inspected to ®nd the best split at each node. In the following sections, we propose a faster method in order to ®nd the best split using a subset Q0t Qt .
3. The predictability index s as a splitting rule in CART Originally, Goodman and Kruskal (1954) introduced the index s to measure the relative decrease in the proportion of incorrect predictions, or equivalently the relative increase in the proportion of correct predictions as we pass from a case of no information to a case of information about the I categories of a given predictor X . Another interpretation of the index s is in terms of analysis of variation for categorical data (Light and Margolin, 1971; Margolin and Light, 1974). We denote by pt
i, for i 1; . . . ; I, the proportion of cases in node t that have category i of X and by pt
jji, for j 1; . . . ; J , the proportion of cases in node t that belong to class j of Y given that they have category i of X . The index s at node t is de®ned as P P 2 P 2 i j pt
jjipt
i ÿ j pt
j P 2
4 st
Y jX 1 ÿ j pt
j where the denominator is the Gini index of heterogeneity. Equation 4 gives the relative reduction in the total heterogeneity of the J classes due to the I categories of the predictor X . It is clear that 0 st
Y jX 1, with st
Y jX 0 in case of independence and st
Y jX 1 in case of perfect prediction. At each node t of the splitting procedure, a split s of the I categories of X into two subgroups, i.e. i 2 l or i 2 r, leads to de®ning a splitting variable Xs with two categories denoted by l and r. We can de®ne the index s to evaluate the relative decrease in the proportion of incorrect predictions due to the splitting variable Xs induced by the split s as follows: P 2 P 2 P 2 j ptl
jjlptl j ptr
jjrptr ÿ j pt
j P 2
5 st
Y js 1 ÿ j pt
j where st
Y js means st
Y jXs to simplify all notation. We show now that maximizing the predictability index st
Y js as a splitting rule is equivalent to maximizing the decrease in impurity (2). Indeed, when we consider the Gini index (1) for nodes t, tl and tr the decrease in impurity (2) can be written, after some simpli®cations, as X X X Di
s; t pt2l
jptl pt2r
jptr ÿ pt2
j
6 j
where ptl ptr 1.
j
j
A fast splitting procedure for classi®cation trees
211
It can be easily veri®ed that this quantity is related to the index s (5) in the following way: " # X 2 pt
j st
Y js
7 Di
s; t 1 ÿ j
P
2 j pt
j
As a result, since the term 1 ÿ is a constant for all splits, maximizing the decrease in impurity is equivalent to maximizing the relative increase in the proportion of correct predictions due to each split. We can calculate the index st
Y js for any split s of each predictor X , i.e. s 2 Qt , in order to select the best split s such that st
Y js max st
Y js s2Qt
8
In other words, the best split s at each node is provided by the splitting variable Xs that maximizes the explained prediction of the J classes within the two subgroups. The splitting criterion (8) oers the same result of the CART splitting criterion (3), but it provides a new interpretation of the best split in terms of prediction as opposed to the concept of impurity.
4. A fundamental property of the index s In the following, we introduce a theorem that proves a new property of the index s. A corollary derived from this theorem is particularly useful for searching the best split in classi®cation trees. (In order to simplify all notation we omit the indication of the node.) Theorem: Consider the predictor X with categories i 1; . . . ; I and the response variable Y with categories j 1; . . . ; J . Denote by S the set of all possible splits of I into two subsets. With each split s is associated a splitting variable Xs . De®ne the index s
Y jX and the set of indices s
Y js for s 1; . . . ; S. It can be proved that s
Y jX s
Y js
holds for l of size 2, and then by induction it follows for any l. Let a and b denote the two categories that are collapsed in l. We need to show that p2
j; a p2
j; b p
j; a p
j; b2 p
a p
b p
a p
b
12
holds for each j. Multiplying both members by p
a p
b yields, after some simpli®cations, the following inequality: p
b 2 p
a 2 p
j; a ÿ 2p
j; bp
j; a p
j; b 0 p
a p
b
13
For p
j; b p
b and p
j; a p
a, (13) is an inequality of second order with respect to p
j; a. The discriminant is equal to zero and the ®rst coecient is positive so that (13) is always satis®ed for p
j; a p
a. Similarly, (13) also holds for p
j; a p
a and p
j; b p
b. Note that (13) becomes an identity when p
j; a p
a and p
j; b p
b, i.e. when there is a perfect predictability for categories a and b. Likewise, (13) becomes an identity also when p
j; a p
ap
j and p
j; b p
bp
j, i.e. when there is independence in the cells
j; a and
j; b. In both cases the index s
Y jX and the index s
Y js attain the same value. Corollary: Let s be the best split of predictor X such that s
Y js ; t max s
Y js s2S
14
From the theorem it follows that s
Y jX s
Y js . Using these results we present in the Appendix an algorithm such that in each node the splitting criterion based on the index s does not need to be calculated for all splits in the set Q in order to ®nd the best split. In this way we propose a splitting procedure that is faster than the CART splitting procedure, while attaining the same binary tree. Hereafter, we call such a splitting algorithm FAST.
9
for each split s 2 S.
5. Simulation studies
Proof: Consider any split s of the I categories of X into two subsets l and r, i.e. i 2 l or i 2 r. Then l and r denote the categories of the splitting variable Xs induced by the split s. From (4) and (5) we need to show that X X p2
j; i X p2
j; l X p2
j; r
10 p
i p
l p
r i j j j P P where p
j; P p
j; r i2r p
j; i, so that P l i2l p
j; i and p
l i2l p
i and p
r i2r p
i. It is sucient to prove that for every j 1; . . . ; J the following inequality X p2
j; i i2l
p
i
p2
j; l p
l
11
In this section we present three simulation studies to show the performance of FAST with respect to CART: (A) when using only categorical predictors; (B) when using both numerical and categorical predictors; and (C) under a speci®c dependency structure on the data set. We apply FAST and CART algorithms to ®nd the best split at the top node and we compute both the number of splits analysed and the CPU time (in seconds) spent to ®nd the best split. (Elaborations were made on a PC Pentium 133 MHz Processor. Procedures for both FAST and CART splitting algorithms have been implemented in the MATLAB environment. CPU time (in seconds) refers to the splitting algorithm and it also includes the time for the data handling in memory.)
212
Mola and Siciliano
We expect a computational advantage in using FAST in terms of the reduced number of splits analysed to ®nd the best split. Whereas in CART all possible splits are investigated anyway, in FAST the best split can be found using a lower number of splits. This also reduces the CPU time for splitting. In simulation study (A) where only categorical predictors are considered, the number of splits analysed by CART exclusively depends on the number of the predictor categories for any size of the sample. For instance, when the number of the predictor categories I is equal to 3, the number of splits analysed by CART is equal to 30 (being 22 ÿ 1 3, the number of possible dichotomies for each of the 10 predictors); the number of splits analysed rapidly increases to 70, 150, 310 and 630 by increasing I to 4, 5, 6 and 7, respectively. Instead, the number of splits analysed by FAST can vary with both the number of the predictor categories and the sample size. Therefore, we have considered a 5 5 factorial design, with N (the sample size) taking values 25; 50; 100; 200; 300 and I (the number of categories of each categorical predictor) taking values 3; 4; 5; 6; 7; the response variable is dichotomous. For each combination of the factorial design we have generated 100 Table 1. FAST performance in terms of number of splits and CPU time (in seconds) when using categorical predictors in simulation study (A)
data matrices where all variables are randomly generated from a uniform distribution. Table 1a±e shows the performance of FAST compared to CART in simulation study (A) for I taking values 3; 4; 5; 6; 7 respectively. At the top of each table we report the number I of the predictor categories and the number Q of all possible splits which are necessarily analysed by CART. In the columns of the table we give for each N the average number of splits analysed by FAST, i.e. lFAST , with the coecient of variation, i.e. CVFAST . We have also calculated the following index RR in order to evaluate the relative reduction in the number of splits analysed by FAST with respect to CART: RR 1 ÿ
lFAST Q
We can easily check that the average number of splits analysed by FAST is always lower than the ®xed number of splits investigated by CART, i.e. lFAST < Q. For instance, for N 25 by increasing I from 3 to 4 the number of splits increases from 30 to 70 with CART and from 3:6 to 9:7 on average with FAST. The index RR as a percentage always achieves very high values varying from a minimum of 82%
(a) I 3; Q 30
number of splits
N
lFAST
25 50 100 200 300
3.6 3.9 3.7 3.6 5.3
average CPU time
CVFAST
RR 100
FAST
CART
0.4 0.4 0.4 0.4 0.6
87.9 87.1 87.6 88.1 82.5
0.50 0.61 0.85 1.27 3.68
0.75 1.11 1.86 4.60 17.50
(b) I 4; Q 70
number of splits
average CPU time
N
lFAST
CVFAST
RR 100
FAST
25 50 100 200 300
9.7 10.4 10.1 10.7 10.3
0.6 0.7 0.6 0.6 0.7
86.2 85.2 85.6 84.7 85.3
0.84 0.91 1.18 1.27 1.14
CART 1.34 2.10 4.93 4.60 3.23
(c) I 5; Q 150
number of splits
average CPU time
N
lFAST
CVFAST
RR 100
FAST
25 50 100 200 300
16.0 14.1 5.8 5.8 14.0
0.8 0.7 0.8 0.7 0.7
89.4 90.6 89.5 89.5 90.7
0.75 0.71 0.95 1.42 2.00
CART 1.31 1.80 2.76 4.67 6.69
A fast splitting procedure for classi®cation trees Table 1. (Continued)
213
(d) I 6; Q 310
number of splits
average CPU time
N
lFAST
CVFAST
RR 100
25 50 100 200 300
36.0 29.5 30.8 31.2 30.2
3.0 2.9 2.7 2.6 2.5
88.4 90.5 90.1 89.9 90.3
FAST 0.89 1.21 1.81 3.64 5.06
CART 2.36 3.83 5.84 10.11 14.46
(e) I 7; Q 630
number of splits
N
lFAST
CVFAST
RR 100
FAST
CART
25 50 100 200 300
32.3 40.3 39.7 31.5 51.2
2.7 2.8 2.9 2.6 0.7
94.9 93.6 93.7 95.0 76.0
1.73 3.03 3.48 7.37 10.61
4.53 8.05 12.56 21.55 30.67
to a maximum of 90% . In all tables, we notice that the coecient of variation proves a certain stability of the FAST splitting procedure with respect to the sample size. The computational gain of FAST over CART is also con®rmed when we take into account the CPU time. In fact, for each combination of the factorial design we also give the average CPU time (in seconds) required to ®nd the best split in FAST as well as in CART. Although the number of splits analysed by CART is ®xed, the CPU time varies with the sample size due to the data handling. Most of the time taken by CART in trying out all predictors is substantially reduced by FAST where only some of the predictors are considered. Then reducing the number of splits analysed also saves the CPU time to ®nd the best split at each node. In simulation study (B), we have used 5 numerical predictors, each with N distinct values, and 5 categorical predictors, each with I 6 categories; the response variable is binary. For the sample size N taking values 25, 50, 100, 200, 300 we have generated 100 data matrices where all variables are randomly generated from a uniform distribution. Table 2. FAST performance in terms of number of splits and CPU time (in seconds) when using categorical and numerical predictors in simulation study (B)
average CPU time
Due to the presence of numerical predictors, the number of splits analysed by CART also depends on the sample size; indeed, if a numerical predictor has N distinct values then there are N ÿ 1 possible dichotomies. Table 2 shows the performance of FAST compared to CART in simulation study (B). As in Table 1 we consider both the number of splits and the CPU time (in seconds). We notice that for both CART and FAST the number of splits analysed increases with the sample size but in any case the index RR always achieves percentages higher than 40% . In terms of the average CPU time FAST also uses less computing time than CART. In simulation study (C), we have generated one complex matrix with a sample size N equal to 300, a response variable having six categories
J 6 and M 10 categorical predictors each having six categories
I 6. We have imposed a dependency structure between the response variable and one predictor with a relatively high value of the index s. Such a predictor is the ®rst-best predictor that generates the best split at the top node. To ®nd such best split, CART has tried out 310 splits, being 25 ÿ 1 31, the number of possible dichotomies for each of
number of splits
average CPU time
N
Q
lFAST
CVFAST
RR 100
FAST
CART
25 50 100 200 300
275 400 650 1150 1650
127 204 359 673 978
0.2 0.2 0.1 0.1 0.1
53.8 49.0 44.8 41.5 40.7
1.82 3.54 7.96 22.59 45.05
2.24 5.56 15.92 57.91 125.72
214
Mola and Siciliano
the 10 predictors, whereas FAST has tried out only 31 splits which are those generated from the ®rst-best predictor with the highest s value. The CPU time (in seconds) used by FAST is only 28:94 with respect to 178:29 used by CART. As a result, the performance of FAST greatly improves when there is a dependency structure of the response variable on some predictors. In this case in fact the best split can be found in few iterations as the splitting algorithm is based on the predictability index s.
6. An example of graduates at the University of Naples We present an example using a data set consisting of 286 graduates of the Faculty of Economics at the University of Naples over the period 1986±1989. In Table 3 we describe the categories of the observed variables where the response variable is the Final Score at the degree and the explanatory variables are the remaining variables such as Sex, Origin, Age, Diploma, Study Plan, Time Needed to Graduate, Thesis Subject. In Fig. 1, we show the pruned binary tree obtained with CART. We refer to Breiman et al. (1984) for the methodological details on the pruning procedure. The pruned binary tree with only 5 terminal nodes is obtained by pruning the maximal binary tree with 43 terminal nodes. Figure 2 shows the number of splits investigated at each node by FAST and CART in order to grow the maximal binary tree. The overall number of splits inspected to ®nd all the best splits in the maximal binary tree is 978 when using the standard CART procedure, whereas it is 302 with the proposed FAST algorithm. The average number of splits analysed at each node is 22.74 for CART whereas it is 7.02 for FAST. Moreover, the relative reduction of the overall number of splits analysed is 69% . In conclusion, FAST proves to be computationally very ecient in real data sets where the gain in the reduced number of splits analysed increases with the degree of dependency among the variables.
Fig. 1. Pruned binary tree for graduates data
Fig. 2. The number of splits tried out at each node by FAST (solid line) and CART (dashed line) to grow the maximal binary tree for graduates data
7. Conclusion In this paper we have introduced a fast splitting procedure (FAST) that uses the predictability index s as a splitting rule. As opposed to the impurity function based on the Gini index of heterogeneity, we evaluate the predictability power of the splitting variable on the response variable. In
Table 3. List of variables for graduates data Variables
Categories 1
2
Sex Origin Age Diploma Study Plan Time Needed to Graduate Final Score Thesis Subject
Male Naples 25 Classical Ocial 4 years Low Economy
Female County 26±30 Scienti®c Managerial 5±6 years High Law
3
4
5
6
Other Counties 31±35 Technical Economics > 6 years
>35 Magistral Quantitative
Professional Public
Profes.
Quantitative
History
Management
A fast splitting procedure for classi®cation trees this way, when splitting a node sample into two subgroups we maximize the explained prediction of the response variable into the two subgroups. We have proved two methodological results: · maximizing the index s is equivalent to maximizing the decrease in impurity as measured with the Gini index; and · the index s for a given predictor has a value not lower than the index s for any split into two subgroups of the predictor categories. These methodological results have allowed us to de®ne a faster splitting procedure with respect to the CART splitting procedure. We have shown in three simulation studies the computational gain in terms of the reduced number of splits inspected to ®nd the best split at each node as well as the reduced average CPU time. This gain becomes even more relevant when growing the maximal binary tree. Indeed, we have shown through an example on a real data set how the FAST algorithm investigates at each node a lower number of splits with respect to the CART algorithm. As a result, the overall number of splits is much lower using FAST compared to CART. In this respect, the FAST algorithm is computationally more ecient. The proposed FAST splitting algorithm greatly contributes to diminishing the computing time used to estimate the misclassi®cation rate by cross-validation. In CART a mfold cross-validation requires high computational costs since the maximal binary tree is grown-up m times. As a ®nal remark, the methodological results proved in this paper can be straightforward as shown in the case where the partitioning procedure is into three rather than two sub-groups. In this way, the FAST algorithm also makes feasible the extension of the CART methodology to ternary trees.
Acknowledgements The authors wish to thank the anonymous referees for helpful comments on previous versions of this paper as well as Carlo Lauro and Jaromir Antoch for stimulating the work for this paper. This research was supported by CNR n.92.1872.P research funds and by CNR n. 95.02041.CT10 research funds. The authors have contributed equally to this work and thus they are merely listed in alphabetic order.
References Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classi®cation and Regression Trees, Wadsworth, Belmont CA.
215 Buntine, W.L. (1992) Learning classi®cation trees. Statistics and Computing, 2, 63±73. Ciampi, A. (1994) Classi®cation and discrimination: the RECPAM approach. In COMPSTAT '94 (eds Dutter R. and Grossmann W.), pp. 129±47, Physica Verlag, Heidelberg. Clark, L.A. and Pregibon, D. (1992) Tree-based models. In J.M. Chambers and T.J. Hastie (eds) Statistical Models in S, pp. 377±420. Wadsworth and Brooks, Belmont CA. Goodman L.A. and Kruskal, W.H. (1954) Measures of association for cross classi®cation. Journal of the American Statistical Association, 48, 732±62. Hand, D.J. (1981) Discrimination and Classi®cation. Chichester, UK: Wiley. Hastie, T. and Pregibon, D. (1990) Shrinking trees. Technical report, AT&T Bell Laboratories, Murray Hill, NJ. Kwok, S.W. and Carter, C. (1990) Multiple decision trees. In R.D. Schachter, T.S. Levitt, L.N. Kanal and J.F. Lemmer (eds) Uncertainty in Arti®cial Intelligence 4, pp. 327±35. Elsevier Science, Amsterdam. Light, R.J. and Margolin, B.H. (1971) An analysis of variance for categorical data. Journal of the American Statistical Association, 66, 534±44. Loh, W. and Vanichsetakul, N. (1988) Tree±Structured Classi®cation Via Generalized Discriminant Analysis. Journal of the American Statistical Association, 83, 715±28. Margolin, B.H. and Light, R.J. (1974) An analysis of variance for categorical data, II: small sample comparisons with chi square and other competitors. Journal of the American Statistical Association, 69, 755±64. Mola, F. and Siciliano, R. (1992) A two±stage predictive splitting algorithm in binary segmentation. In Y. Dodge, and J. Whittaker (eds) Computational Statistics 1, pp. 316±23. Physica Verlag, Heidelberg. Mola, F. and Siciliano, R. (1994) Alternative strategies and CATANOVA testing in Two±Stage Binary Segmentation. In E. Diday et al. (eds) New Approaches in Classi®cation and Data Analysis. Springer Verlag, NY. Mola, F. and Siciliano, R. (1997) Visualizing data in tree-structured classi®cation. In Proceedings of IFCS-96: Data Science, Classi®cation and Related Methods (eds Hayashi, C. et al.), Springer Verlag, Tokyo. Mola, F., Klaschka, J. and Siciliano, R. (1996) Logistic classi®cation trees. In COMPSTAT '96 Proceedings (ed. Prat, A.), Physica Verlag, Heidelberg, D. Oliver, J.J. and Hand, D.J. (1995) On pruning and averaging decision trees. In Machine Learning: Proceedings of the 12th International Workshop, pp. 430±37. Quinlan, J.R. (1986) Induction of decision trees. Machine Learning, 1, 81±106. Siciliano, R. and Mola, F. (1997a) On the behaviour of splitting criteria for classi®cation trees. In Proceedings of IFCS-96: Data Science, Classi®cation and Related Methods (eds Hayashi, C. et al.), Springer Verlag, Tokyo. Siciliano, R. and Mola, F. (1997b) Ternary classi®cation trees: a factorial approach. In M. Greenacre, J. Blasius, eds, Visualization of Categorical Data, Academic Press, CA. Taylor, P.C. and Silverman, B.W. (1993) Block diagrams and splitting criteria for classi®cation trees. Statistics and Computing, 3, 147±61.
216 Appendix: A fast splitting algorithm based on the index s Consider the response variable Y and the current set of predictors X
X1 ; . . . ; Xm ; . . . ; XM . In a binary segmentation procedure the number of predictors at a given node t can be lower than the number of predictors at the top node, and the number of non-empty classes can be lower than J. The FAST splitting algorithm to ®nd the best split s at node t consists of the following main steps. 1. Calculate the values of s
Y jXm for each m 2 M and order the predictors with respect to the value of the index s; denote the ordered predictors by
X
1 ; . . . ; X
m ; . . . ; X
M so that s
Y jX
m is the m-th higher s value.
Mola and Siciliano 2. Initialize k 1. 3. De®ne the set S
k of all possible splits of the categories of the predictor X
k . Find the best split s
k of the predictor X
k such that max s
Y js s
Y js
k : s2S
k
4. If k 1, set s s
1 . If k > 1 and s
k is a better split than s , set s s
k . Otherwise, let s be unchanged. 5. If k M then stop the procedure, s s
M being the best split. Otherwise, continue. 6. If s
Y js
k < s
Y jX
k1 then ®x k k 1 and go back to Step 3. Otherwise, stop the procedure, s s
k being the best split.