Statistics and Computing (1996) 6, 231-243
Non-binary classification trees STANISLAV KEPRTA Department of Probability and Statistics, Charles University, Sokolovskd 83, 186 O0 Praha, Czech Republic Received May 1994 and accepted January 1996
The main objective of this paper is to implement non-binary splits into Gelfand et al.'s modification of the CART algorithm. Multi-way splits can sometimes better reflect the structure of data than binary splits. Methods for the construction and comparison of such splits are described.
Keywords: Classification tree, tree growing, tree pruning, multi-way splits
1. Introduction This paper deals with classification trees. A classification tree is a tree that, given an input X, produces an output I7" that approximates some random variable Y, stochastically related to X, which shows to which class the object belongs. It is assumed that a data set consisting of input vectors x and their corresponding class labels is available. The goal of the construction of a classification tree from a data sample is to investigate the data structure and the dependence of the response variable on the set of predictors and to learn how to predict the response variable from known values of predictors. A large variety of methods has been proposed for the construction of classification trees: examples are FIRM suggested by Hawkins (1990), Ciampi's algorithm (Ciampi, 1993) and especially CART described in Breiman et al. (1984). For more references see Safavian and Landgrebe (1991). This paper is based on the iterative growing and pruning algorithm suggested by Gelfand et al. (1991). The methodology of Gelfand et al.'s algorithm is based on the CART algorithm, but a new method for using all of the data both to grow and to prune the tree was suggested. Both CART and Gelfand et al.'s algorithm construct strictly binary trees, nevertheless multi-way splits can sometimes better reflect the structure of data (see, for example, the analysis of Iris data in Section 5). That is why we suggest methods for the construction of non-binary splits and for comparison of splits with different numbers of groups. Several other authors considered non-binary splits. Kass (1980) proposed to choose the best partition on the basis of statistical significance and used the Bonferroni 0960-3174 9 1996 Chapman & Hall
inequality to calculate the significance. An improved method of calculating the significance is suggested in Biggs et al. (1991). Quinlan (1993) described a method of pairwise merging of predictor values to produce multiway splits. Oliver et al. (1992) suggested using decision graphs, a generalization of classification trees, to overcome shortcomings of binary classification trees. Bahl et al. (1989) used pylons to represent some composite questions. Section 2 contains basic definitions. Section 3 describes methods for construction of a tree, its growing and pruning. Section 4 describes methods for the construction of multiway splits and for the comparison of splits with different numbers of groups. Section 5 illustrates the behaviour of the suggested algorithm on real data.
2. Definitions Let (X', y)1 be a random vector with X taking values in f c R M and Y taking values in ~ = {1,2,... ,J}, where X is a vector of M predictors, Y is the corresponding class label, and J is the number of classes. Our goal is to estimate (predict) Y from observation of 2. Consider a random sample A~ with N items, ~
Keprta
232
d(x) defined on Re so that Vx E Re, d(x) is equal to one of the numbers from cg.
of T1 4 T and T1 r T, if T 1 is a subtree of T and root(T) = root(T1).
Remark 2.1. This means that d defines a partition of the space Re into J disjoint subsets ar ar 999 aCs such that
Terminology
9 UJ=lg~j=Re; 9 sr M ~r = 0,
l<=iCj
9 g x E Re d(x) = j r162x E agj.
Definition 2.1 is based on definitions in Breiman et al. (1984), p. 4, and in Gelfand et al. (1991), p. 164, and will be sufficient for our needs. Some other authors (cf. Buntine (1993)) use a more general definition allowing for each leaf to assign probabilities of an object belonging to each class. A classification tree is a classifier constructed by recursive partitioning of subsets of Re into two or more subsets, which correspond to the 'nodes' of the tree. The partitioning starts with the whole space s which corresponds to the 'root' of the tree. Let us recall basic tree terminology (cf. Breiman et al. (1984), pp. 279-84, Gelfand et al. (1991), p. 165).
Definition 2.2. A tree consists of a finite non-empty set T of positive integers and function sons, that assigns to each t E T a subset of T. Denote by ITI the number of elements in T. The function sons must satisfy the following requirements:
9 Vt E T sons(t) = 0 or Isons(t)l __>2; 9 gt E T s E sons(t) ~ s > t; 9 Vs E T, other than the smallest integer in T, 3! t E T such that s E sons(t).
Terminology 9 Each t E T is called a node of a tree. 9 The minimum element of a tree T is called the root of the tree and is denoted by root(T). 9 t E T such that sons(t) = 0 is called a terminal node or a
leaf. 9 A node s is called a son of node t and node t is called the father of node s if s E sons(t). 9 According to Definition 2.2, for each node except the root there exists just one father. Therefore, we can correctly define function father(t) which assigns to every t E T - root(T) its father. 9 If t = f a t h e r ( s ) , or t=father(father(s)), or . . . , t is called an ancestor of s and s is called a descendant of t. Given a non-empty subset T~ c T we define function sonsl (') by
Vt E T 1 sonsl(t ) ----sons(t) f') T 1. Definition 2.3. If both T1 and sonsl (t) satisfy the definition of a tree, T1 is called a subtree of T. T1 is called a pruned subtree of T and denoted by T1 ~ T, or by T1 -< T instead
9 The branch Tt of T with root node t E T consists of the node t and all descendants of t. 9 Pruning a branch Tt from a tree T consists of deleting all descendants of t from T, that is, cutting off all of Tt except the root node. The tree pruned this way will be denoted by T - Tr We will denote by 2? the set of leaves of tree T. Now we are ready to define a classification tree (cf. Gelfand et al. (1991), p. 165).
Definition 2.4. A classification tree is a tree T, where each t E T is assigned a region d ( t ) c Re and a class j(t) E cg which satisfy following requirements: 9 { d ( t ) ; t E 7~} forms a disjoint partition of Re; 9 Vt E T - 2r {a/(s); s E sons(t)} forms a disjoint partition of d ( t ) . The classification rule corresponding to this classification tree reads as follows:
Vx
Re vt E
d(x) = j ( t )
x E
3. Iterative growing and pruning of tree 3.1. Construction methodology Several different methodologies for constructing classification trees have been suggested in the literature. We will briefly describe the one based on Breiman et al.'s monograph. The tree is constructed by recursive partitioning of the space Re. This means that we assign the whole space to the root, 1, i.e. d ( 1 ) = 5f. Then we must find a suitable split. A split consists of conditions on the coordinates x l , . . . , X M . These conditions define a partition of Re = d ( 1 ) = U~o,~(1)d(s). Then the same method is applied recursively on each d ( s ) , s E sons(l), and so on. The construction of a classification tree requires three steps: 1. The rule for assigning class labels. 2. The selection of split. 3. The rule for determining terminal nodes. Suppose that the tree T is constructed using the data in 5r In each node t E T denote by p(t) = N ( t ) / N , where N(t) is the number of data elements in s such that x, E d ( t ) ; p(t) is the estimate of probability P(X ~ ag(t)) based on 5r Further denote by p ( j l t ) = N j ( t ) / U ( t ) , j = 1 , . . . , J, where Nj(t) is the number of data elements in Z,r such that x, E a ' ( t ) and y, = j ; p ( j l t) is the estimate ofP(Y ----jl X ~ ~r based on s
Non-binary classification trees
233
Step 1 The first step--the assignment of class labelj(t) to t--is the easiest. We get the estimate of misclassification error in t as 1 - p(j(t)[t). It is minimal when
NAt) p(j(t) lt) = max p(jlt) = max - j j N(t)" Such j(t) is called the major class. For the growing of the tree we will use the resubstitution estimate of misclassification error defined as
R(r) =
R(t), rE/~
where R(t) = M ( t ) / N and M(t) is the number of data elements in ~ for which xn c d ( t ) and Yn # j ( t ) . We get
R(t) = [1 - p(j(t)[t)]p(t) = [1 - max p(j[t)]p(t). J J L Thus, we estimated the prior class probabilities rr(j) = P ( Y = j ) , j = 1 , . . . , J , from the learning sample and assumed that the cost C(i[j) of misclassifying a class j case as a class i case was equal to 1 for all i # j. It is possible to modify the method described to handle known priors and variable misclassification costs as described in Breiman et al. (1984), pp. 112-15.
Step 2 Now we will describe the splitting procedure. We define node impurity function i( t) by
i(t) -- ~(p(llt),p(Z[t),... ,p(Jlt)), where 9 is a function with the following properties (see Breiman et al. (1984), p. 32): 1. 9 = ~ ( P l , . . . ,PJ) is defined on the set of all J-tuples of numbers (Pl,'",PJ) satisfying ~ J = l Pj = 1 and
O <=pj <= 1 , j = l , . . . , J 2.~5_>0 3. ~5 has a maximum only at the point (I/J,..., 1/J) 4. 9 achieves its minimum only at the points (1, 0 , . . . , 0 ) , . . . , ( 0 , . . . , 0 , 1) 5. @ is strictly concave 6. ~5 is a symmetric function of p l , . . . ,pj. The node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only cases belonging to one class. A decrease of impurity when the node t is split by s is defined by Ai( , t) = i(t) -
i(s p(s)
scsons(O
" p(t)'
where sons(t) correspond to s. Denote the set of considered
splits with k groups of predictor values by 5@ Split sk(t) E 5Pk is considered to be the best split in 5ek if Ai(sk(t), t) = max Ai(s, t). seek
Splits traditionally examine the value of one predictor. Predictors can be of several types (cf. Breiman et al. (1984), p. 5): 9 A predictor is called numerical if its measured values are real numbers. 9 A predictor is called categorical if it takes values in a finite set; elements of this set are usually positive integers 1, 2 , . . . We distinguish two types of categorical predictors. a nominal predictor is a categorical predictor with no ordering of categories. - an ordinal predictor is a categorical predictor with a natural ordering of categories. -
A split of the node t according to the continuous predictor Jr1 which can take values in the interval (a, b) looks as follows:
sons(t) = {Sl,..., SK}, d(sk) = d ( t ) n {x; hk-1 < xl _-< hk}, where a = h0 < hi < " " < hK = b is a division of (a,b); according to the nominal predictor Jf2 as: g;gg(Sk) : ,;g/(l) r~ {x; X2 e
Ak},
where Ak, k = 1 , . . . , K, is a disjoint partition of categories ~2(t) of X2. In the case of ordinal predictor only neighbouring categories can be joined. Several impurity functions were suggested (see Breiman et al. (1984), pp. 103-4). We will use the Gini index of diversity, which is defined as follows. Suppose that objects in a node belong to class i with probability Pi. If we use for classification the rule which assigns an object selected at random from the node to class i with probability Pi, then the estimated probability that the object actually belongs to classj is py. Therefore, the estimated probability of the misclassification error under this rule is
~"~ piPj = iT~j
pj - Z j
p2 = I - Z
p 2.
J
Other authors (Quinlan and Rivest, 1989; Wallace and Patrick, 1993; Buntine, 1993) use splitting criteria which do not satisfy Breiman et al.'s definition mentioned above.
Step 3 The decision whether the node is terminal is done by pruning. At first, a large tree Tmax is grown by splitting nodes until all leaves contain only objects from one class, or only a few objects. Then the sophisticated misclassification error estimate R(T) of the true misclassification error is
Keprta
234 calculated for each pruned subtree of T. The pruned subtree with the minimal estimate is selected. A different approach is described in Hawkins (1990). The decision whether the node is declared to be terminal is based on hypotheses testing.
8. Estimate the misclassification error of T* by /)(T*)=
Z
R(2)(t)+ E
R0)(t)'
(1)
t ~ U (2)
t E U (1)
where U (;~) -- {t E 7~*; t was assigned a class label
3.2.
Iterative algorithm
based on 5r
An efficient algorithm for iterative growing and pruning of a binary tree was suggested in Gelfand et al. (1991). This algorithm divides the data sample into two subsets and iteratively grows a binary tree with one subset and prunes it with the other subset, successively interchanging the roles of the two subsets. We modified Gelfand et al.'s algorithm for the construction of non-binary trees. A description of Gelfand et al.'s algorithm follows. Methods for the construction and selection of splits and for the pruning of the tree are described in the following sections. 1. Randomly divide the data in Aa into two subsamples A~ and ~e (2) with the same number of data from each class of the dependent variable Y (as much as possible). We will use symbols with the superscript to indicate on which subset they are based. 2. Using ~e (1) construct a tree TOby recursive splitting of nodes until every leaf t E T0 is class pure (i.e. N~l)(t) = N(1)(t) for some j), or has few data (i.e. N(1)(t) < Nmin), or all predictors have the same value in spite of the fact that the data belong to different classes of Y. Assign to every node t E T the major class j(t), i.e. N!~)(t) = maxj~e N}l)(t). 3. Select the smal(~st optimally pruned subtree T~ 4 To such that R(Z)(T;) = min R(2)(T ') T' 4To
and put k = 1. 4. Let a = 1 and/3 = 2 if k is even, or a = 2 and/3 = 1 if k is odd. 5. Using LP(~) construct tree Tk by recursive splitting of leaves of T~_I until every leaf t E 7~k is class pure (i.e. N(i~)(t)= N(~)(t) for some j), or has few data (i.e. NI~)(t) <-_Nmin), or all predictors have the same value in spite of the fact that the data belong to differ" to every node t E Tk -- T*k-1 ent classes of Y. Assign the major class j(t), i.e. N!~tl(t ) =max;~eNJ~)(t). Numbers, regions and class (at~els of nodes in T~_I are unchanged. 6. Select the smallest optimally pruned subtree T~ 4 Tk such that
R(~)(T*k) = min R(~)(T'). T' ~Tk
7. If IT~I = IT~_ll then put T* = T~ and stop the procedure. Otherwise let k = k + 1 and continue by returning to step 4.
/3 = 1,2.
Let s(k, t), d ( k , t), andj(k, t) denote split, region of Y', and class assigned to node t E Tk. We will assume that if
t E T k - Tk, t' E T l - Tl, d ( k , t ) = d(l,t'), j(k,t) = j(l, fl), and t and t p are split based on the same data subset we choose s(k, t) = s(/, t'), j(k,s) =j(l,s'), where s and s' are corresponding sons of t and t t. Furthermore, we choose
j(k, s) =j(k, t) for s E sons(t), whenever possible. Theorem 3.2.1 T*
Hence
j(k-
it
k-1 4 T k*
follows
that
V k = 1,2,...
d(k-
1,t) = d ( k , t ) 1,t) =j(k,t) Vt E T'k-1 and k = 1,2,...
and
Theorem 3.2.2
1. L e t g ( t ) = i n f { k >= 1 ; t E 7 ~*k_l M 7~} for t = 1, 2, .... If K(t) < oc then t E T*k Vk > K(t). This implies, for example d ( k , t) = d(K(t), t) and j(k, t) =j(K(t), t)
Vk > K(t). * .ThenK 1;I T*k-lt = ITkl} T~ = T~: Vk __>K, in particular d ( k , t ) = d ( K , t ) and j(k, t) = j(K, t) Vk > K. Proofs
The theorems and proofs for strictly binary trees can be found in Gelfand et al. (1991). Their generalization for non-binary trees can be found either in Keprta (1993) or in the Appendix. Remark 3.2.1. The theorems assert that:
9 The sequence of optimally pruned trees is nested and increasing in size. 9 If a node is a terminal node in two optimally pruned * :r trees Tk-1 and Tk, then it is a terminal node for ever. 9 The algorithm terminates as soon as the number of nodes is the same for T*x-1 and for Tk. * 3.3.
Pruning
The algorithm for pruning a tree is very simple. Assign numbers to each node in T so that Vs E sons(t) Vt E T s > t. The basic principle of the algorithm is the comparison of the misclassification errors calculated from s the node t and in the branch Tr If R(~)(t) <=R(;~)(Tt), the branch T t is removed, with the exception of the root t.
Non-binary classification trees
235
Denote S(t) = R(Tt) (the index/3 is now omitted). One pass through the tree from the bottom to the top is sufficient for pruning the tree. Thus, this algorithm for pruning is quick (cf. Gelfand et aL (1991), p. 169): For each t E T Do {If t E 7~ Then {s(t) := R(t)} I f t E T - 7 ~Then {S(t) := ~-~s~sons(t)S(S); If R(t) < S(t) Then {T := T - Usesons(t)Ts; := {}; S(t) := R(t)
Table 2. word(L,K)forse&ctedL andK
K
2
3
4
5
L= 8 L = 20
7 19
21 171
35 969
35 3876
In this way we will obtain for each predictor Xm, m = 1,... M, a contingency table (Table 1). It is obvious that the total number of data in the node t and columns' frequencies are the same for all predictors, but the numbers of rows and the rows' frequencies can be different. The number of possible partitions Word(L,K) of L categories into K non-empty groups for an ordinal predictor is given by the formula
} } }
Word(Z,g)=(L-l) 1
4. Non-binary splitting In this section, we will focus on splitting. We will describe methods for the construction and selection of splits. We will restrict the type of predictors to the categorical ones. 4.1. Construction of splits For splitting the node t we can use N(~)(t) data elements from Ae(~). Further suppose that Xm has values in the set ~m(t) =- {1,2,...,Lm}. Let n~m)(t) = I{Xn E ~r ~%~ (X~)m = I, y~ = j } [ -- the number of data from 5r (~) in the node t, for which the mth predictor has value equal to l E ~m(t) and which belong to t h e j t h class, j E oK. Denote
'
which gives for example, the results shown in Table 2. In the case of a nominal predictor the number of possible partitions Wnom(L, K) is given by the formula
1 ~-~(K)(-1)i(K-i)L,
Wnom(Z , K ) = K.1 i=0
which results in much higher values than the number of partitions for an ordinal predictor (Table 3). For ordinal predictor Xm and for small Lm we can examine all possible partitions, calculate corresponding impurities and select that partition A1, Az,...,AK, UK=lAk = ~m(t), A k f-1Ak, = O, k # k', which minimizes the Gini index
J
HIT)(') = Z H~m)(t)'
k=l
j=l 9
O w
N)~)(t),
J n!'m)(t) = ~ j=l
J Z
n~m)(t) = Z N } ~)(t) = N(~)(t)"
lE~m(t)
j=l
Table 1. Contingency table Xm • Y
kEIEA k n}m)(t)J J
j=l
"
For a nominal predictor Xm it is infeasible to try all possible partitions (see Table 3). Therefore, another algorithm must be used. We decided to modify the algorithm described by Chou (1991). Suppose nlm) # 0 for all l E ~m(t). Our goal is to find the disjoint partition of the set ~m(t) of the possible values of Xm to A 1 , . . . , A K and corresponding nodes q,...,tK, which minimizes the average impurity K
Y Xm
1
2
1
nll)(t)
nl2)(t)
2
n 7 It)
n 7/It)
Lm
nLml(t) (m)
(m) nLm2(t)
E
n!71(t)
n!71(t)
I(A1,..., AK[t) = Zp(tk]t)i(tk), ...
J
k=l
nl lItl
nlT/It/
nL(m)At)
(m) "Lm.(t)
( tl
n!m (t)
Table 3. W.om(L,K)for se~ctedL and K
K
2
3
4
5
L= 8 L = 20 (•
127 0.524
966 580
1701 45232
1050 749206
Keprta
236 where p(tklt) denotes the probability that an object in t has the value of the mth predictor in Ak. This probability will be estimated as tEAk
p(tkl t) -
~(m)/t~ t~l. k )
p(szlt) - n}m)(t)
n!.m)(t) " This means that st, l c ~m(t), correspond to the partition where the son sl is assigned to each l E ~m(t). Denote the vector of relative frequencies of objects from the class j in the node r by U(r) = (p(llr), p ( 2 l r ) , . . . , p(JIr))'; ~ will be called a centre of the node. Further denote the distance between centres of nodes rl and r2 by d(T1,7-2) = tS(/./,('rl),/J.('/2)) = ~ J = l (#j(7l) - .j(T2)) 2. Theorem 4.1.1 A necessary condition on any partition A 1, 999 AK minimizing the average impurity I ( A 1 , . . . , A ~ ; l t ) = ~-~f=l p(tklt)i(tk) is that I E Ak only if k = argmintc~m(t) d(sl, tk), or ifp(stlt) = 0, where tk correspond to Ak.
This theorem was proved by Chou (1991) for K = 2 and the proof for K > 2 is its straightforward extension. The algorithm for constructing the partition to K groups follows. 1. Select the initial grouping A1, A 2 , . . . , AK of ~m(t). 2. Compute the centres of the groups as vectors of relative frequencies of the j t h class, i.e.
/3,(k) = (/~1 (k),/~2(k),..., #j(k)) t, where
-(ml(t) 2l
Ak
=
j
=
. (0
3. Repeat the following steps iteratively: (a) Update the partitions so that every row of relative frequencies in the contingency table is assigned to that group whose centre is nearest, i.e.
Ak :
l k = argmin~
,K
-\nlm)(,/
X-" [ l Ak nt. (t) I
Is
Estimate the probability that an object in the node t has the value of the ruth predictor equal to I by
,
evaluated via the Gini index of impurity as
#j(t~)
,
k= 1,2,...,K. If the distance is the same for several centres, we can choose randomly among them. (b) Compute centres of new groups. 4. Stop the iterative process as soon as no reduction of impurity is reached. The impurity of the split is
[.
~(~)~) ..
,tj
1-
J
(#j(k)) 2
.
.=
4.2. Selection o f split We are ready to construct splits with an arbitrary number of groups for both types of predictors. At first we can try all binary splits and choose split s2 with the minimum impurity, say I2. Similarly we can choose the purest 3-, 4 - , . . . , MaxK-way splits s3, s4,... , SMaxX with impurities /3, I 4 , . . . , IMaxK. Using the six properties imposed on the impurity function in Section 3.1 it is easily verified that 1 2 > 1 3 > "'" > IMaxK. Should we use SMaxX for splitting t? We can expect a big decrease of impurity at the beginning of the sequence I2, I 3 , . . . , IMaxK and only small changes at the end. Moreover if we decide for too many groups in the split, we will get much smaller nodes than t and it will be difficult to split them appropriately further. On the other hand three-way or four-way splits can sometimes better reflect the structure of data than the binary ones. The crucial question is, what is the 'best' number of groups in a split? We suggest the following method for comparing splits with different numbers of groups. Let Max K denote the maximal number of groups for which we construct splits. Let sk be a split with k value groups. We will create sons of t according to Sk and continue their splitting until we get a tree with root t and Max K leaves. Now we will select that split which has grown to the purest tree. We must decide which splits we will consider for initial splitting of t and for completion of these splits. Basic possibilities for initial splits: 1. Try all splits. 2. For each predictor Xm, m = 1 , . . . , M , and for K = 2, 3,... , M a x K try splits according to Jfm with K groups and minimal impurity. 3. For K = 2, 3 , . . . , Max K try splits with K groups and minimal impurity. Basic possibilities for completion of split: 1. All possible partitions. 2. Gradually split temporary leaves until the number of them is Max K. For splitting use: (a) binary splits only; (b) both binary and non-binary splits. Of course, we are limited by time and the memory of the computer, because the number of splits grows exponentially. Let us now look more closely at the sequence of the purest splits S>...,SMaxK with 2 , . . . , M a x K groups according to the ruth predictor. It is quite possible that Sk
Non-binary classification trees is only a refinement of sl, l < k. For example, when categories of the mth predictor are divided by s2 to groups {1, 2, 3} and {4, 5, 6, 7, 8}, by s3 to groups {1, 2, 3}, {4, 5} and {6, 7, 8}, and by s~ to groups {1, 2}, {3, 4, 5} and {6, 7, 8}, S3 is a refinement o f s 2 but s~ is not. When sk is a refinement of some split st among s 2 , . . . , sk-1, it can scarcely be completed in any better way than s 1and we can save time by excluding such a split from the process of completion. When we consider non-binary splits in the context of Gelfand et al.'s algorithm, a new problem emerges. Assume that the node t is gradually split by binary splits into the tree T~. When we prune it, we will choose the subtree Tit ~ Tt. The number of leaves in T~t can range from 1 to 17~,l. Suppose that several first splits used for construction of Tt are based on the same predictor. Then we can express them alternatively as one multi-way split. However, we lost such variety during pruning. We can only accept or refuse this multi-way split. From this it follows that we must handle the construction of multi-way splits very deliberately. When we have a sequence of suggested splits $2," - - , SMaxK according to the ruth predictor, we will divide the data in t according to each split and in each group select the major category of Y. Then we will compute the frequencies r2,...,rMaxX of misclassified objects from 5 ~ by S2,...,SMaxK. We will find the greatest K0E {2,. .. , M a x K } for which rK0 -- mink=2,...,Maxxrk and exclude splits sK0+l, 9.., SMaxK from further consideration. We implemented the method described with option 2 for initial splits and 2(a) for their completion. A detailed description of our implementation follows. The node will be called degenerate if it contains data that have the same values of all predictors but belong to different classes. The algorithm for completion of split sK is now presented. It constructs the tree T t with root t and, if it is possible, with Max K leaves, calculates its impurity and then deletes Tt with the exception of the root t.
237
{ Find the purest binary split of 7-; Add it to ConsideredSplits; Calculate its impurity
} i:=i+1; Delete s from ConsideredSplits
} Calculate the impurity of the constructed subtree as a sum of the impurities of leaves with weights equal to the number of data in them; Delete all descendants of t. Now follows the algorithm for selecting the split of the node t. The algorithm finds the purest 2-way, 3w a y , . . . , M a x K - w a y splits according to each predictor, completes them using the above-described algorithm and selects the one which leads to the least impurity. The selected split is stored in a variable SelectedSplit and corresponding impurity in LeastImpurity.
Leastlmpurity := 1; For each predictor Do
{ For K := 2 To Max K Do
{ Find the purest split sK with K groups; Calculate the number of misclassified data rK in ~(;~) using split sK and major criterion
} Find maximal K0 such that rK > rK0 VK < K0;
Impurity : = 1; For K := 2 To K0 Do
{ Complete sK and put the calculated value of impurity to NewImpurity; If Newlmpurity <=Impurity Then
{ Split the node t according to sK into sons t l , . . . , tK; ToSplitting := { t l , . . . , tK}; Exclude from ToSplitting pure and degenerated nodes; ConsideredSplits := {}; For each node in ToSplitting Do
{
Impurity := Newlmpurity; Split := sK
} } If Impurity < Leastlmpurity Then
{
Find the purest binary split; Add it to ConsideredSplits; Calculate its impurity
}
LeastImpurity := Impurity; SelectedSplit := Split
} }
i:=K+I; While (ConsideredSplits r (~) and (i __
{
5. Empirical testing
Select from ConsideredSplits that split s with the maximal decrease of impurity; Carry out s and denote two new nodes 7-1, 7-2; For 7- := 7-1, 7-2 Do If 7- is not degenerated or pure Then
To illustrate the behaviour of the suggested algorithm, several examples will be presented. Assume that the node t can be split according to ordinal predictor X1 with nine categories or according to ordinal predictor X2 with four
238
Keprta Table 4.6. Contingency table X2 x Y
Table 4.4. Contingency table X1 x X2 x Y
X2
1
2
3
4
Y
1
2
3
4
Xl 5
6
7
8
9
Y X"2
1
2
1
1
1
0
0
0
0
0
0
0
1
2
4
2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
2 0 0 2 0 0 0 2 2 0 0 1 0 1 0
0 1 0 0 1 2 1 1 5 0 0 2 1 0 0
1 2 0 0 0 2 2 0 4 1 0 1 0 0 0
0 2 1 0 1 3 2 0 4 3 0 1 0 0 0
1 3 2 0 1 3 2 1 1 2 1 0 0 0 0
0 3 3 0 2 3 3 0 1 3 2 0 0 0 0
0 5 6 0 1 3 7 0 0 2 2 0 0 0 0
2 3 4
4 5 5 16
6 17 1 28
categories. The corresponding contingency tables are given in Tables 4-6. N o w we will consider the behaviour of the algorithm for M a x K = 2 and M a x K = 3. Max K = 2
In this case we must find the purest binary split according to X1 and the purest binary split according to X2. Different groups are expressed by symbols Vq, 9 and N. The symbol in the ith row and the j t h column indicates the group to which the ith category of X2 a n d j t h category of X1 belongs.
4
16 16 11 1 44
12 17 5 0 34
34 43 38 7 122
The numbers give corresponding impurities. It is obvious that the first split according to X1 is selected. Max K = 3
In this case the algorithm compares four partitions: the purest three-way split according to X1, the purest threeway split according to X2 and the best completions of the binary splits when Max K = 2. We present the following pictures: 3) the purest three-way split according to X1, 4) the purest three-way split according to X2, 5) the best completion of the purest binary split according to X1 by split according to II2, 6) the best completion of the purest binary split according to X2 by split according to X1, 7) the best completion of the best binary split according to X1 by split according to )(1, which cannot be better than the purest three-way split according to Xl, but is better than the best completion by split according to X2, 8) the best partition with three groups, but this partition is missed by the algorithm, and could be found only by exhaustive search through all possibilities. DSD||NNNN
DDDDDDDO0 DDODDDDDD |mmmmmmmm |mmmmmmu|
DDDDmmmmm DDDDmmBBi DDDDmBmBm DDDDBBmmm
3
ODOili~B~
ODD|mR|N| DDDI||NBN i m p u r i t i e s : 3 ) 0.619819
DOODOOOOD DODODDOOD lilililli
NNNNNNNNN
4) 0.656152
2) 0.678977
i m p u r i t i e s : l ) 0.649963
Table 4.5. Contingency table
X 1 •
Y
DDDDIIIil DODDliili
FqFqr~NNNNNN FqFqF~NNNNNN
DDDDNNNNN DDDDNNNNN
IllUlillU illUllill
impurities:5) 0.626758
6) 0.621850
Y X1
1
2
3
4
1
2
0
0
0
2
2 3 4 5 6 7
3 5 3 1 1 1
0 4 7 5 5 3
0 1 3 5 8 8
0 0 1 2 3 5
8
0
3
9
8
9
0 16
1 28
10 44
15 34
3 10 14 13 17 17 20 26 122
DDDD|BiNN DDDSBBiNN DDDDBi|NN OODDiliNN
imp~ities:7) 0.626396
DDFqNNNNNN ODDNNN[]NN QDDBRBBBi ODDIlilii
8) 0.612436
If Max K = 3, the algorithm selects the three-way split according to X 1. With Max K = 2 the node t is at first split into nodes tl and t2. D a t a with X 1 E {1,2, 3, 4} are assigned to t 1 and data with X1 c {5, 6, 7, 8, 9} are assigned to t2. In the next
Non-binary classification trees
239
2
x
."
8
3
9
~.
.
/ '
z
~
9
~.
,.
!
"
/
i 7
'
~
t
. 6
: 9
' ~
/
' 9
i .
~
"
J.
Tree To
~
.
I 9.
3
"
i ..
i
I
9
I
"
", .
!
,.
j
"
~ ~
"
i . .
1o /,'111
.
\
/
l i
~
i
.
i
.,
9
l
9
~
9
.
i
..
3
Tree To* 1
\
; ,,
x .
.
i ,,
o 0
1
2
3
4
5
9
/
7
.
i
8
9
Fig. 2. Functions
3
Tree T1
2
Tree 7'1"= To*= T*
Fig. 1. Iteration process for Iris data step the node t2 is split into nodes t 3 and t4. Data with X1 E {5, 6, 7} are assigned to t3 and data with tll E {8, 9} are assigned to t 4. This grouping corresponds to the picture 7 with impurity 0.626396. It is clear that, in general, running the algorithm with greater Max K enables us to select a Max K-way split, or to prefer a K-way split, K < Max K, different from that obtained in running the algorithm with smaller Max K. To try a more complex example we ran the algorithm on the Iris data from Andrews and Herzberg (1985) with Max K = 3. The data consist of measurements of sepal length, sepal width, petal length, and petal width in millimetres on 50 plants of each type--Iris setosa, Iris versicolor and Iris virginica. The measurements were used as ordinal predictors--one millimetre = one category. Closer examination of the data reveals that Iris setosa is 100% distinguishable from the other two types by both petal length and petal width. Despite the fact that both petal length and petal width are also quite good in distinguishing Iris versicolor and Iris virginica, these two types cannot be distinguished absolutely. It seems that petal length is a bit better than petal width. The final tree T* is very simple, containing only one three-way split. Petal length is the splitting variable. The estimated misclassification error is 0.053. The CART algorithm with the data divided into two halves (one half learning sample and one half test sample) and 1 SE rule gave also a tree with three terminal nodes (now it means two binary splits), and the estimated misclassification error 0.052. The first splitting variable is petal width
and the second splitting variable is petal length. Of course, both procedures are unstable in the sense that we could get different trees depending on the random division of data into groups in other runs. We notice several advantages of considering multi-way splits. The three-way split is more appropriate than the combination of two binary splits in this situation. Usually the three-way split uses petal length as a splitting variable and the binary tree uses petal length or petal width as the first splitting variable (to separate Iris setosa from the other two types) and petal length as the second splitting variable (to separate Iris versicolor and Iris virginica). The possibility of using different splitting variables in an arbitrary way seems to be a bit confusing. The third example was designed to give an empirical comparison of error rates and size of trees of the described algorithm with CART and to see the effect of adjusting Max K. The task was to determine one of the five functions plotted on Fig. 2 sampled at points 0, 1 , . . . , 9 (10 predictors) with N(0, 4) noise added. The results averaged over 10 runs for N = 250 (approximately 50 samples of each function) and N = 500 (approximately 100 samples of each function) are in Table 7. k is calculated according to (1) and R is calculated by classifying an independent data set with N = 5000. The same data sets were analysed by the CART algorithm using both 10-fold cross-validation with 0 SE option and test estimate with 0 SE option. Table 7 suggests that CART with cross-validation produces much larger and slightly more accurate trees than the described algorithm and CART with test estimate produces slightly larger than less accurate trees. Moreover, there is a greater variability of the size of the CART trees. The effect of adjusting Max K is not entirely clear. In the N = 250 case the greater the Max K the smaller and more precise the trees (with the exception of Max K = 3). A similar effect can be observed in the case N = 500--Max K = 2 trees are the largest, Max K = 5 trees are the smallest, but Max K = 2 trees are the most precise. The algorithm
Keprta
240 Table 7. Empirical comparison with C A R T Mgorithm N
I~1 CART cv
R R
CART ts
k R
Irl
MaxK= 2 tree T; MaxK= 2
MaxK= 3
MaxK= 4
MaxK= 5
Functions 250 mean
std
500 mean
std
33.2 0.2348 0.24662 16.6 0.2563 0.28162
12.339 0.02603 0.017 7.98088 0.02513 0.01546
40.7 0.1918 0.21778 25.7 0.21855 0.23734
21.5718 0.01682 0.01218 10.2637 0.03655 0.01198
15.9 0.22182 0.27486 14.6 0.23244 0.27808 14.4 0.22381 0.26536 14.1 0.22311 0.26514
1.80278 0.02983 0.02019 2.06828 0.05321 0.02458 3.91933 0.03998 0.02292 3.42783 0.03897 0.02779
25.7 0.17224 0.2276 23.5 0.18517 0.23249 24.1 0.19318 0.23662 22.7 0.20165 0.23288
3.9454 0.01707 0.01783 2.3214 0.02564 0.02098 3.3483 0.01835 0.01316 5.75519 0.01901 0.01693
~1 R ITI k R IT] k R ITI k R ITI k R
described with the M a x K = 2 option is almost identical to Gelfand et al.'s algorithm. Gelfand et al. (1991) state that their algorithm is superior to C A R T with crossvalidation. This result is not supported by our empirical testing. The last example uses the waveform recognition data which are used by both Breiman et al. and Gelfand et al. to test their algorithms. Gelfand et al. obtained (for data size N = 300) the misclassification errors for C A R T with cross-validation between 0.29 and 0.32 and for their algorithm between 0.26 and 0.27. The detailed results show that the major improvement was achieved already in iteration step 0. The difference between C A R T with test estimate and iteration step 0 of the iterative algorithm is only in the set of subtrees of Tmax which are considered during pruning ( C A R T creates a parametrized family of subtrees, the iterative algorithm uses all subtrees). To get better insight into the behavior of the algorithms we added the code for C A R T into our program. Then we used the maximal tree T o obtained in the iteration step 0 (with M a x K = 2) also for C A R T pruning and we calculated the misclassification error of T~ as well. It enables us to compare the misclassification errors of the C A R T with test estimate tree, of the tree after iteration step 0 (different method of pruning) and of the tree at the end of the iteration process (enlarged tree T;). The results in Table 7 are averages of 16 runs.
Waveforms 300 mean
std
17.0 0.2775 0.2930 11.5 0.2903 0.3151 9.4375 0.2685 0.3069 11.75 0.2432 0.2973 12.0 0.2556 0.3101 15.125 0.2553 0.3093
9.7980 0.0330 0.0197 6.2823 0.0326 0.0218 3.1616 0.0334 0.0263 2.1134 0.0329 0.0192 1.9322 0.0367 0.0174 3.4034 0.0455 0.0160
The results are similar to the results obtained for the functions data. The misclassification errors differ only slightly and any of the algorithms can be the best or the worst for a particular data set generated from the same distribution. On average, C A R T with cross-validation gives the most precise, but the largest, trees.
6. Software implementation The methods described were implemented in the programruing language Mathematica. The program has about 1800 lines and includes procedures for constructing classification trees, graphical output, categorizing continuous predictors, classifying independent data samples, estimating misclassification error, calculating the importance of predictors, etc.
7. Conclusions In this paper we have tried to include non-binary splits and the look-ahead strategy to Gelfand et al.'s algorithm. We have restricted the types of predictors to the categorical ones only. I f a variable is continuous it must be categorized. This is not a big disadvantage because the goal of constructing the classification tree is also a grouping of data and we can create as m a n y categories as we want.
N o n - b i n a r y classification trees If Max K, the maximal number of sons of nodes, is increased, a larger number of splits is considered and the algorithm works more slowly. However in most cases the time necessary for the construction of the tree is not so important and on the other hand we can benefit from a bit more precise and smaller tree during classification of new observations and during investigation of the structure of the data. Greater Max K does not necessarily give better trees. The greater decrease of impurity in one stage of construction calculated from a sample data set does not imply smaller true misclassification error of the tree. We used the Gini index of diversity as impurity function. Some other impurity functions might be more appropriate for multi-way splits. Trees constructed by the suggested algorithm are comparable with C A R T trees. However, our testing does not confirm superiority of Gelfand et al.'s algorithm to C A R T cross-validation.
241 large quantities of categorical data. Applied Statistics, 24, 119-27. Keprta, S. (1993) Klasifk/ttory konstruovan6 pomoci rekurzivnich postupfl. Master's thesis, Charles University Prague. Mingers, J. (1989) Empirical comparison of selection measures for decision tree induction. Machine Learning, 3, 319-42. Oliver, J. J., Dowe, D. L. and Wallace, C. S. (1992) Inferring decision graphs using the minimum message principle. In A. Adams and L. Sterling (eds) Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, 361-7. World Scientific, Singapore. Quinlan, J. R. (1986) Induction of decision trees. Machine Learning, 1, 81-106. Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Quinlan, J. R. and Rivest, R. L. (1989) Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227-48. Safavian, S. R. and Landgrebe, D. (1991) A survey of decision tree classifiers. IEEE Transactions on Systems, Man, and Cybernetics, 21, 660-74. Wallace, C. S. and Patrick, J. D. (1993) Coding decision trees. Machine Learning, 11, 7-22. Wolfram, S. (1991) Mathematica: A System for Doing Mathematics by Computer. Addison-Wesley, Reading, MA.
References Andrews, D. F. and Herzberg, A. M. (1985) Data--A Collection
of Problems from Many Fields for the Student and Research Worker. Springer-Verlag, New York. Bahl, L. R., Brown, P. F., deSouza, P. V. and Mercer, R. L. (1989) A tree-based statistical language model for natural language speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Proceeding, 37, 1001-8. Biggs, D., de Ville, B. and Suen, E. (1991) A method for choosing multiway partitions for classification and decision trees. Journal of Applied Statistics, 18, 49-62. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth, Belmont, CA. Buntine, W. L. (1993) Learning classification trees. Artificial Intelligence Frontiers in Statistics: AI and Statistics, lII, 182-201. Chou, P. A. (1991) Optimal partitioning for classification and regression trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 340-53. Ciampi, A. (1993) Constructing prediction trees from data, the RECPAM approach. In: J. Antoch, ed., Computational Aspects of Model Choice. Physica-Verlag, Heidelberg. Duran, B. S. and Odell, P. L. (1974) Cluster Analysis. SpringerVerlag Berlin. Gelfand, S. B., Ravishankar, C. S. and Delp, E. J. (1991). An iterative growing and pruning algorithm for classification tree design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 163-74. Hawkins, D. M. (1990) FIRM (Formal Inference-based Recursive Modeling). Technical Report Number 546, University of Minnesota, School of Statistics. Kass, G. V. (1980) An exploratory technique for investigating
Appendix In this Appendix we prove Theorems 3.2.1 and 3.2.2. We will need several lemmas. Recall that T~ denotes the smallest optimally pruned subtree of Tk with respect to R (~), where /3 = 2 if k is even and /3 = 1 if k is odd. Denote Tkt = (Tk) t and T*kt = (T*k)t. Let (Tkt)* be the smallest optimally pruned subtree of the branch Tkt with respect to R(~). It is obvious that
r*kt = (T*k)t = (Tkt)*
for t e T~.
(A.1)
Notice that the node t can be pruned offthe tree Tk-1 and the same number of the node can be used in irk. To avoid ambiguity we will replace R(t) by R(k, t) for t E Tk, and R ( T ) by R(k, t) for T a subtree of irk when necessary. Note that R(Tk) = R(k, Tk) and R(T*k) = R(k, T'k). It is obvious that
R(9)(k, T*kt) < R(~)(k, t)
for t E T~, - 7~.
(A.2)
Otherwise the branch Tkt would be replaced by t.
Lemma A.1 Vt E
(Tk -
Tk) --
(T~-I - T~-l) holds:
(a) R!~! (k, t) > ~-~ e ~o,s(t) R(~)lk, s). (b) R~'~)(k, t) = ~e~o,,s(t) R~'~)(k, s) ~
R(Z)(k, t) =
242
Keprta
where a = 1 and/3 = 2 if k is even and a = 2 and/3 = 1 if k is odd.
(without the assumption T*k-1 ~ Tk). * We must prove the assertion for 7 = a. Let T*k - 1 ~ T k *. We first show
Proof Let t E (Tk - 7~k) - (T~_I -- T~-I). Obviously each s E sons(t) belongs to Tk - T*k-1. Hence, the class labels j(k,s) for s E sons(t) are assigned based on ~(~), while j(k, t) might be assigned based on either ~(~) or ~(~). We
R(~)(k, Tkt) <
get:
Vt E (T~ - Tk)-* -- (T*k_l -- T'k-l).
Let t E (T k9 - Tk) -* - ( T*k - 1 - T*k-l)" The proof is by induction on the number of nodes in T k t - T k t . If [T~t I - 17~tl = 1 (i.e. the node t is split but all its sons are terminal) then by Lemma A.l(a)
R (~) (k, t) -- [1 - p(~)(j(t)lt)]p (~) (t)
R(~)(k'T~t) =
~[1-maxp(~)(jlt)]p(~)(t )
Z R('~)(k's) < R(~)(k,t)" s c sons(t)
If equality holds then by Lemma A. 1(b) (a) s
L
J Ls9176
>I1-
~
L s9
=
Z
R(~)(k, T*kt) =
p '~0JJ
m a x p ( ~ ) ( j l s ) @)] p (~~ ) ( t
p, ,(t)j
(i.3)
Z
R(9)(k,s) = R(~)(k,t),
which contradicts (A.2). Hence, R(~)(k, T*kt) < R(~)(k, t). Next let n = IT~tl - Ir~tl > 1 and assume that
R(~)(k, Tks) <
Vs E (T*k -- Tk)-* -- (T*k_l
--
T'k-l),
whenever ITks -* I < n. Since Ir~:tl ITkt -* I > 1, at least 9 I -]Tks one son of the node t, say So, must be split. Then by hypothesis R (~) (k, T*kso) < R (~) (k, So). Hence
[1-maxp(C~)(jls)lP(a)(s)
s 9 sons(t)
=
Z s 9 sons(t)
[1-P(~)(J(s)[s)]P(")(s)
R(~)(k, T*kt) =
s 9 sons(t)
Z
R(~)(k' T'ks)
s 9 sons(t)
=
E s 9 sons(t)
9
= R(~)(k' T*kso) +
The proof of part (b). Assume that R(~)(k,t)= ~ s 9 sons(t)R(~)(k, s). Then both of the inequalities in (A.3)
s0)+
must be satisfied with equality, i.e. max
maxp/~
(JI )p(~)(t)
by Lemma A. l(a). By induction R (~) (k, T*kt) < R (~) (k, t) is proved f o r t E ( T k* - 7 ~ ) - ( T * k-l-- 7~ *k-i)" Now the case t E T ~ _ I - 7~*k-1 remains. Owing to T*k-1 ~ Tk* we have T*k-l,t ~ Tkr * Denote A = (Tk, -* 7~, * - Tkt) fq k-l,r Then
9 ,p(")(s)
,p(-/(s)
Z p(~)(j(t)s)~ s e sons(t)
R(~)(K, T*kt) = R(~)(k, T*k_,,t)
which implies that
+ Z ( R ( ~ ) ( k , T~s) - R(~)(k,s)) sEA
p(~) (j(t)I s) = max p(~)(j Is) Vs E sons(t). J
= R(~)(k - 1, r~_l,t)
Hence, we may choose j(s) = j ( t ) Vs E sons(t)9 But we do choose j ( s ) = j ( t ) whenever possible9 Hence, we do choose j ( s ) = j ( t ) Vs E sons(t), which implies R(~)(k, t)=
+Z(R
A.2 If
(~)
9
(k, Tks)-R(~)(k,s))
sEA
< R(~)(k- 1, T*k-l,t)
E s 9 sons(t)R (8)(k, s). Lemma
Z
< R("/(k, t)
(~)" "s " p(~) (s)
s 9 sons(t)
=
TL)
s 9 sons(t)-(So}
p(~)(j(t)]t) = maxp(~)(j[t) = J seson~s(t)p
= Z
Z s 9 sons(t)- {so}
T k 1_ .
~
T k*
then
V t E T k* -
< R (~) (k - l, t)
T*k
R(7)(k, r*kt) < R('Y)(k,t), 7 = 1,2, for k = 0, 1,...
=
Proof Fix k and let a = 1 and/3 = 2 if k is even and a = 2
Here
and/3 = 1 if k is odd. By (A.2) the lemma holds for 7 =/3
d(k,s) = d(k-
the
t).
second and last equalities follow from 1,s), j(k,s) = j ( k 1,s) Ys E T* --
k - l ,
Non-binary classification trees
243
which imply R(~)(k,s) = R(~)(k - 1,s) Vs E T*k_l. The first inequality is trivial if A = 0 and follows from the first part of the p r o o f i f A ~ (3. The last inequality follows from (A.2). If R('~)(k, T*kt) < R(~)(k, t) 7 = 1, 2, then T*k ~ Tk+l for k = 1,2,....
Lemma
A.3
Vt E T~ - 7~[,
Proof We shall show that under the assumption we must have A = (T~: - 2?~) n iF*k-1 = (~' Since Tk,* T ~ + 1 ~ Tk+l, we have T~ ~ T~+ 1. Let B = (T~+~ - if'~+l) M ibm. Suppose A ~ 0. Then
R('r)(k + 1, r~+l) =- R('r)(k + 1, r~) - Z ( R ( ' r ) ( k + 1, r*kt) -- R(Z)(K+ 1, t)) tEA
+ Z ( R ( Z ) ( K + 1, r~+l,t) - R(~)(K + 1, t)) t~B
= R(~)(k+ 1, T'k) -- Z ( R ( ~ ) ( k , V*kt) - R(~)(k, t)) tEA
+ Z ( R ( 7 ) ( k + 1, T~+l,t) - R('r)(k + 1, t)) tEB
> R(7)(k+ 1, T'k) + Z ( R ( Z ) ( k + 1, r*k+l,t) - R(~)(k+ 1, t)) tEB
= R('Y)(k+ 1, T*kU T'k+1). Here the second equality follows from s ~ C ( k + l , t ) = ~r j(k + 1,t) =j(k,t) Vt E T'k, which implies R(7)(k + 1, t) = R(~)(k, t) Vt E T*k. The inequality follows from the assumption. Denote T r = T~U T~+ 1. Now T' ~ Tk+ 1 and R~)(k + 1, T') < R(~)(k + 1, T~+I), which contradicts the fact that T~+ 1 is an optimally pruned subtree of Tk+ 1 with respect to R (1) or R (~). Hence, A = 0 and T~ 4 T~+~. Now we are ready to prove Theorem 3.2.1.
Proof of Theorem 3.2.1. Denote by T*_1 = root(To). Since T*-1 = root(To) 4 r;, we have using Lemmas A.2, A.3 that T*k-1_4 Tk* and also R(Z)(k,T*kr < R(~)(k,t) Vt E T ~ - T~: and "7 = 1, 2, for k = 0, 1,... Clearly, we also have d ( k - 1, t) = d ( k , t) and j(k - 1, t) =j(k, t) VtE T~_I, f o r k = 1,2,... To prove Theorem 3.3.2 we will need the following lemma. A . 4 I f t E T * k-1 N T-*k then Tk+l, t : Tk_l,t, and d(k+l,s)=d(k-l,s), j(k+l,s)=j(k-l,s) VsE Tk+l, t = Tk_l,t, for k = 1,2,... Lemma
Proof. Let t E T*k-1 fl Tk. -* Since t E Tk, -* Tk+l,t -- {t} is generated by splitting t when Tk+l is generated from T~. Since t e l ? * k-a and T*k-2 4 T*k-1 by Theorem 3.2.1, Tk-l,t-- {t} is generated by splitting t when Tk_~ is generated from T~_ 2. But t E T~ implies that s~(k + 1, t) = sue(k, t) and j ( k + 1,t) =j(k,t), and t E T*k-1 implies that sg(k,t)= s J ( k - 1, t) and j(k, t) = j ( k - 1, t), and consequently d(k+l,t)=d(k-l,t) and j ( k + l , t ) = j ( k - l , t ) . Since the same data subset is used to generate Tk+l from T~ and Tk-1 from T~_2 it follows that Tk+l,t = Tk-l,t, and ~ ( k + 1,s) = s J ( k - 1,s), j ( k + 1,s) = j ( k - 1,s) Vs E Tk+l,t = Tk-l,t a n d j ( k + 1, t) = j ( k - 1, t).
Proof of Theorem 3.2.2. Part (a). Assume K(t) < co. Then t E i?)(~)_ I
-* TK(t). AtN , first we will show that if t E T~-I ( ? T [ then t E Tk+l, for k = 1,2,... So suppose t E Tk-1A T*k. Then by Lemma A.4 Tk+l,t = Tk-l,t, and d ( k + 1, s) = d ( k - 1, s), j(k + 1,s) = j ( k - 1,s) Vs'E Tk+l, t = Tk_l, t. Hence R('~)(k+ 1,s) = R('Y)(k - 1,s) Ks E Tk+a,t = Tk-l,t and 3' = 1, 2. Since t E T*k_lnTk+l, * by (A.1) we get T~+l,t = (Tk+a,t)* = (Tk-l,t)* = {t}. Hence t E Tk+l, as required. By induction we have t E i?~ Vk__> K(t). Obviously we also have sJ(k, t) = d(K(t), t), j(k, t) = -
j(K(t), t) Vk > K(t). Part (b). We first show that K < oc. Every T k has at least one sample in every terminal node. Since the data set is finite, it follows that supklT~l _-- 1; 9 = Ir~l)<~. * Irk-ll We have [T*K-l[ = IrKI, which together with T*K-1 4 TK * implies that TK-1 * = T}(. We next show that if T k. - l = T. k . then. T k = Tk+ 1 for k . . 1,2, . . So assume T*k-1 = T~. From part (a) it follows that K(t) < k Vt E T*k. Hence T~ C Tk+l which together with T~ 4 T~+I implies r ~ = T~+I. By induction we have T~ = T~c Vk > K. Clearly, we also have s~(k, t) = sff(K, t) aj(k, t) =j(K, t) Vt ~_ T*k and k __>K.
Acknowledgment
I am very much indebted to Dr Jaromir Antoch for reading the manuscript and for his helpful comments. ! would also like to express my gratitude to the referee for the valuable suggestions on improving the manuscript and for sending me several relevant papers. This research was partially supported by grant GA(~R 0472.