Journal of Classification 32 (2015) DOI: 10.1007/s00357-015-9172-4
Influence Measures for CART Classification Trees Avner Bar-Hen Laboratoire MAP5, Universit´e Paris Descartes, France
Servane Gey Laboratoire MAP5, Universit´e Paris Descartes, France
Jean-Michel Poggi Laboratoire de Math´ematiques, Universit´e Paris Sud and Universit´e Paris Descartes, France
Abstract: This paper deals with measuring the influence of observations on the results obtained with CART classification trees. To define the influence of individuals on the analysis, we use influence measures to propose criterions to quantify the sensitivity of the CART classification tree analysis. The proposals are based on predictions and use jackknife trees. The analysis is extended to the pruned sequences of CART trees to produce CART specific notions of influence. Using the framework of influence functions, distributional results are derived. A numerical example, the well known spam dataset, is presented to illustrate the notions developed throughout the paper. A real dataset relating the administrative classification of cities surrounding Paris, France, to the characteristics of their tax revenues distribution, is finally analyzed using the new influence-based tools. Keywords: Influential individuals; Influence functions; Decision trees; CART.
Authors’ Addresses: A. Bar-Hen, email:
[email protected]; S. Gey, email:
[email protected]; J.-M. Poggi, email: Jean-Michel.poggi@ math.u-psud.fr. Published online
A. Bar-Hen, S. Gey, and J.-M. Poggi
1.
Introduction
Classification And Regression Trees (CART; Breiman, Friedman, Olshen, and Stone 1993) have proven to be very useful in various applied contexts mainly because models can include numerical as well as nominal explanatory variables and because models can be easily represented (see for example Zhang and Singer 2010, or Bel, Allard, Laurent, Cheddadi, and Bar-Hen 2009). Because CART is a nonparametric method as well as it provides data partitioning into distinct groups, such tree models have several additional advantages over other techniques: for example input data do not need to be normally distributed, predictor variables are not supposed to be independent, and non-linear relationships between predictor variables and observed data can be handled. It is well known that CART appears to be sensitive to perturbations of the learning set. This drawback is even a key property to make resampling and ensemble-based methods (as bagging and boosting) effective (see Gey and Poggi 2006). To preserve interpretability of the obtained model, it is important in many practical situations to try to restrict to a single tree. The stability of decision trees is then clearly an important issue and then it is important to be able to evaluate the sensitivity of the data on the results. Briand, Ducharme, Parache, and Mercat-Rommens (2009) proposed a similarity measure between trees to quantify it and use it from an optimization perspective to build a less sensitive variant of CART. This view of instability related to bootstrap ideas can be also examined from a local perspective. Following this line, Bousquet and Elisseeff (2002) studied the stability of a given method by replacing one observation in the learning sample with another one coming from the same model. Many authors derived asymptotic normality of the influence functions under weak assumptions (see for example Campell 1978; Critchley and Vitiello 1991; Croux and Joossens 2005; Croux, Filzmoser, and Joossens 2008; and Croux, Haesbroeck, and Joossens 2008). These results can be used to obtain a threshold to decide whether an observation is influential or not. The aim of this paper is to focus on individual observations diagnosis issues rather than model properties or variable selection problems. The use of an influence measure is a classical diagnostic method to measure the perturbation induced by a single element, in other terms we examine stability issue through jackknife. We use decision trees to perform diagnosis on observations. We do not propose a solution to the eternal accuracy/interpretability dilemma which is typical within tree-based methods framework since our goal is the analysis of influence observations after the tree-structure is built. The outline is the following. Section 2 recalls first some general background on the so-called CART method. Then it introduces some influence
CART Classification Trees
measures for CART based on predictions and an influence measure more deeply related to the CART method involving sequences of pruned trees. Distributional results are derived in Section 3. Section 4 contains an illustrative numerical application on the spam dataset (see Hastie, Tibshirani and Friedman 2009). Section 5 explores an original dataset relating the administrative classification of cities surrounding Paris, France, to the characteristics of their tax revenues distribution, by using the new influence-based tools. Finally Section 6 opens some perspectives. 2.
Methods and Notations
2.1 CART Classification Trees
Let us briefly recall, following Bel et al. (2009), some general background on Classification And Regression Trees (CART). For more detailed presentation see Breiman et al. (1993) or, for a simple introduction, see Venables and Ripley (2002). The data are considered as an independent sample of the random variables (X 1 , . . . , X p , Y ), where the X k s are the explanatory variables and Y is the dependent variable to be explained. CART is a rule-based method that generates a binary tree through recursive partitioning that splits a subset (called a node) of the data set into two subsets (called sub-nodes) according to the minimization of a heterogeneity criterion computed on the resulting sub-nodes. Each split is based on a single variable. Remark that even if from a theoretical point of view CART methodology also allows for more general splits, most of the packages that implement CART only consider univariate splits, we adopt here this restriction. Some variables may be used several times while others may not be used at all. Let us consider a decision tree T . When Y is a categorical variable a class label is assigned to each terminal node (or leaf) of T . Hence T can be viewed as a mapping to assign a value Yi = T (Xi1 , . . . , Xip ) to each observation. The growing step leading to a deep maximal tree is obtained by recursive partitioning of the training sample by selecting the best split at each node according to some heterogeneity index, such that it is equal to 0 when there is only one class represented in the node to be split, and is maximum when all classes are equally frequent. The two most popular heterogeneity criteria are the Shannon entropy and the Gini index. Among all binary partitions of each set of values of the explanatory variables at a node t, the principle of CART is to split t into two sub-nodes t− and t+ according to a threshold on one of the variables (or a subset of the labels for categorical variables), such that the reduction of heterogeneity between a node and the two subnodes is maximized. The growing procedure is stopped when there is no more admissible splitting. Each leaf is assigned to the most frequent class
A. Bar-Hen, S. Gey, and J.-M. Poggi
of its observations. Of course, such a maximal tree (denoted by Tmax ) gen erally overfits the training data: the training error n1 ni=1 1lT (Xi1 ,...,Xip )=Yi (where 1l is the indicator function and n the total number of observations) goes down and down as the number of terminal nodes increases while the corresponding prediction error R(Tmax ), where R(T ) is defined by R(T ) = P(T (X 1 , . . . , X p ) = Y ),
(1)
is typically large. Since the goal is to build from the available data a tree T whose prediction error is as small as possible, in a second stage the tree Tmax is pruned to produce a subtree T whose expected performance is close to the minimum of R(T ) over all binary subtrees T of Tmax . Since the joint distribution P of (X 1 , . . . , X p , Y ) is unknown, the pruning is based on the ˆ pen (T ) to balance optimistic estimates given by penalized empirical risk R the empirical risk by adding a complexity term that penalizes larger subtrees. More precisely the empirical risk is penalized by a complexity term, which is linear in the number of leaves of the tree: ˆ pen (T ) = 1 R 1lT (Xi1 ,...,Xip )=Yi + α|T |, n n
(2)
i=1
where α is a positive penalty constant, |T | denotes the number of leaves of the tree T and Yi is the ith random realization of Y . Despite the fact that the penalty constant α is positive real, the solution of the minimization of the penalized criterion belongs to a finite sequence {T1 ; . . . ; TK } of nested subtrees (where TK is reduced to the root of the tree and T1 is a deep maximal tree) associated with a nondecreasing sequence of temperatures {cp1 ; . . . ; cpK }, where as often, we will denote the cost complexity parameter by cp corresponding to the temperature α used in the original criterion (2) divided by the misclassification rate of the tree restricted to the subtree given by the root of the tree. The pruning step provides a finite sequence of subtrees pruned from the maximal tree. Then, at the end of the algorithm, a tree is selected among the sequence by using a test sample, or a cross-validated cost complexity value (see Breiman et al. 1993 for more details). These two selection procedures are related to the minimization of two different estimators of R(T ). 2.2
Influence Measures for CART
Let X = (X 1 , . . . , X p ) ∈ X be the vector of the explanatory variables, and consider that the data are independent realizations L = {(x1 , y1 ); . . . ; (xn , yn )} of (X, Y ) ∈ X × {1; . . . ; J}. The dependent variable Y is assumed to be a categorical variable with J unordered categories. Influence measures quantify discrepancy between T , the tree computed with the com-
CART Classification Trees
plete sample L and T −i , the jackknife tree computed with L−i , the whole sample minus the observation (xi , yi ). It is important to note that an observation is influential or not with respect to a given tree. There exist various estimators of R(T ) which can lead to different optimal tree. The same estimator has to be used to construct the reference tree and each of the jackknife trees. Let T (xk ) (resp. T −i (xk )) be the the class prediction of xk based on T (resp. T −i ). Two situations have to be distinguished: (i) if k = i, the class prediction T −i (xi ) is performed on a tree constructed without xi ; (ii) if k = i, the class prediction T −i (xk ) is performed on a tree constructed with xk . Then the first case is closely related to leave-one out estimate of the prediction error while the second case is closely related to resubstitution estimate of the prediction error. 2.2.1 Influence on Predictions The first natural idea is to focus on class predictions. In this case, influence measure is directly related to 1lT (xk )=T (−i) (xk ) : I1 (xi ) =
1 n−1
n
1lT (xk )=T (−i) (xk ) ,
(3)
k=1;k=i
I2 (xi ) = 1lT (xi )=T (−i) (xi ) .
(4)
I1 (xi ) is the proportion of observations for which the predicted label changes using the jackknife tree T (−i) instead of the reference tree T . It is closely related to the resubstitution estimate of the prediction error. I2 (xi ) indicates if xi is classified in a different way by T and T (−i) and is closely related to the leave-one-out estimate of the prediction error.
Remark 1. I1 (xi ) and I2 (xi ) are based on class predictions. Of course it is possible to define distances between the distribution of the response variable in the nodes where xi falls. Any distance (or a divergence) between probability distributions such as total variation distance, Kullback-Leibler divergence (related to the Shannon entropy index) or Hellinger distance (related to the Gini index) can be used. Since distributional results would be strongly dependent of the distance between the probability distribution, we do not consider this kind of distance and we focus on influence measures based on class predictions. 2.2.2 Influence Based on Subtree Sequences Another way to look at the data is to consider the complexity cost constant (α in equation (2)), which penalizes deeper trees in the pruning
A. Bar-Hen, S. Gey, and J.-M. Poggi
step of the CART tree design, as a tuning parameter. It allows to scan the data and sort them with respect to their influence on the CART tree. Let us consider on the one hand the sequence of subtrees based on the complete dataset, denoted by Tcpj , and on the other hand the n jackknife sequences of subtrees based on the jackknife subsamples L−i , denoted (−i) by Tcpj . Suppose that the sequence Tcpj contains KT elements, and that (−i) each sequence Tcpj contains KT (−i) elements (i = 1, . . . , n). This leads to a total of Ncp KT + 1in KT (−i) distinct values {cp1 ; . . . ; cpNcp } of the cost complexity parameter in increasing order from cp1 to cpNcp = max1j Ncp cpj . Then, for each value cpj of the complexity and each observation xi , (−i) we compute the binary variable 1lTcp (xi )=Tcp that tells us if the refj (xi ) j erence and jackknife subtrees corresponding to the same complexity cpj provide different predicted labels for the removed observation xi . Thus we define influence measures I3 and I4 as the number of complexities for which these predicted labels differ: for i = 1, . . . , n
I3 (xi ) = I4 (xi ) =
1 (n − 1)Ncp
Ncp n k=1;k=i j=1
1lTcp
Ncp 1 1lTcp (xi )=Tcp (−i) . j (xi ) j Ncp
(xk )=Tcpj (xk ) , (−i)
j
(5)
(6)
j=1
Remark 2. Since the jackknife sequences of subtrees donot change for many observations, usually we obtain that Ncp << KT + 1in KT (−i) . Remark 3. I3 (resp. I4 ) is a weighted version of I2 (resp. I1 ) since I3 (resp. I4 ) is a sum of the I2 (resp. I1 ) over the various values of cp . Remark 4. In this section we focus on influence functions based on predictions. It is also possible to derive influence measures on partitions for CART based on jackknife trees. This point will be developed in the discussion. These various indices are complementary. They give different kind of information and detection of outlier should be based on the study of the various indices even if, from a practical point of view, information on prediction is generally more useful than information on partition of CART classification tree.
CART Classification Trees
3.
Distributional Results
Let us first recall the framework of influence functions and then derive distributional results for the indices defined in the previous section. 3.1 Definition and Connection Between Influence Functions and Jackknife
Let Z = (Zk ), with k = 1, . . . , d = |T ], be a random vector such that Zi = 1 when a randomly chosen subject is classified in leaf i of the tree T (and Zj = 0 for j = i). |T | denotes the number of leaves of the tree. Let F be the distribution function of Z : F is a multinomial distribution M(1, P ), and P = z F (dz) . The empirical estimator of F , Fn , is defined as 1 Fn = δZ i , n n
i=1
where Z i is a random sample of Z and δZ i is the point mass 1 at Z i . The empirical mean is the observed probability vector: Pn = z Fn (dz) . To evaluate the importance of an additional observation z ∈ Rd , we can define, under condition of existence, the quantity I (1 − )F + δz − I F ICI,F (z) = lim , (7) →0 which measures the influence of an infinitesimal perturbation along the direction δz (see Hampel 1988). I(F ) can be expressed, as often in statistics, as a functional of the generating distribution function. The influence function (or influence curve) ICI,F (·) is defined pointwise by (7), if the limit exists for every z . There is a strong connection between influence function and jackknife (Miller 1974). Let (−i) 1 Fn−1 = n−1 j,j=i δzj , then Fn =
n−1 (−i) n Fn−1
1 + n1 δzi . If = − n−1 ,
A. Bar-Hen, S. Gey, and J.-M. Poggi
we have:
I (1 − )Fn + δzi − I Fn ICI,Fn (zi ) ≈ (−i) = (n − 1)(I(Fn ) − I(Fn−1 )) ∗ = In,i − I(Fn ) ,
(8)
(9)
∗ are the pseudo-values of the jackknife (i.e. values computed on n − 1 In,i observations).
3.2
Tools for Distributional Results
If I is “regular” enough and G is “close” to F , then it is possible to establish an analogue to a Taylor expansion: I(G) I(F ) + ICI,F (z) (G − F )(dz) (10) = I(F ) + ICI,F (z) G(dz) . This aspect, developed by Hampel (1988), has been especially used in the case of univariate distribution to prove asymptotic results for statistical functionals (Gill 1989; Cuevas and Romo 1995) or in robust estimation (see e.g. Huber 1981). Intuitively, if we take G = Fn in Croux, Filzmoser, and Joossens (2008) and if the remaining term√is asymptotically negligible, we obtain that the asymptotic distribution of n I(Fn ) − I(F ) is the same that of n √ 1 n ICI,F (z) Fn (dz) = √ ICI,F (Zi ) , n i=1
which is, by Central Limit Theorem (CLT), asymptotically distributed as a normal law with zero mean and variance 2 2 σ = ICI,F (z) F (dz) . To get this, we need to impose on I more than the mere existence of ICI,F (z). Several notions of differentiability have been used. In general, if B1 and B2 are normed vector spaces, we say that an operator I : B1 → B2 is differentiable at F ∈ B1 , with respect to a collection S of subsets of B1 , if there exists a continuous linear function DI(F, ·) : B1 → B2 (called the differential of I at F ) such that for Δ in some neighborhood of zero, I(F + Δ) = I(F ) + DI(F, Δ) + R(G + Δ) ,
CART Classification Trees
where the remainder R satisfies, for every S ∈ S : R(G + Δ) −→ 0 uniformly in Δ ∈ S . →0
The following choices of S : S = all singletons of B1 , S = all compact subsets of B1 , and S = all bounded subsets of B1 , respectively lead to Gˆateaux (or directionally), Hadamard (or compactly), and Fr´echet (or boundedly) differentiability. Note that in Rd , d ≥ 1, Hadamard and Fr´echet differentiability are equivalent to ordinary differentiability, and that Hadamard differentiability is the weakest form of differentiation that satisfies the chain rule. So we will work with Hadamard differential. 3.3 Influence Functions for CART
Let I be any of the resubstitution indices (I1 , I3 ). The functional I that assigns to F an index is a priori defined on F : I: F
→
F
→
Rd
→
[0, 1]
z F (dz) → I(F ) ,
but we will need I to be defined on a normed vector space. Let S be the support of the probability measure F : S = {e(1) , . . . , e(d) } , (i)
where ej = δij (j = 1, . . . , d). Then I can be extended to the space B1 of finite signed measures whose support is S , endowed with the total variation norm. Since we assume that the observations are identically distributed (with finite variance), we can apply the Lindeberg theorem to obtain I(z) as the expectation with respect to Fn of a function of T (z) and T −i (zj ) (j = 1, . . . , n and j = i) when n tends to infinity. Therefore, under mild conditions, I(z) is Hadamard differentiable. √ L If n (Fn − F) −→ Z in B1 , then the delta-method (see Gill 1989) gives that √ L n (I(Fn ) − I(F)) −→ DT (F, Z) . We can remark that Fn and F are characterized by (respectively) Pn and P. The multidimensional CLT gives that √
L
n (Pn − P) −→ Nd (0, Σ),
A. Bar-Hen, S. Gey, and J.-M. Poggi
where Σ is a known matrix. This leads to √
L
n (Fn − F) −→ Z ,
where Z is the (random) measure concentrated on S with (random) probabilities (π1 , . . . , πd ) ∼ Nd (0, Σ) . In fact, DT (F, a normally distributed random variable with zero mean Z) is 2 (z) F (dz). and variance ICI,F 2 (z) F (dz), we obtain the folProposition 1. If σ ˆ 2 is an estimate of ICI,F lowing bilateral confidence interval for I > 0:
Iˆ − α σ ˆ ; Iˆ + α σ ˆ , (11) where α is the value of a standard Gaussian variable that has the probability α2 to be exceeded. Proof is direct. Unilateral confidence interval can be derived in the same way. This proposition is the main result of this section since it allows us to assess the stability of CART based on the resubstitution indices. Observations are, as always, subject to technical and sampling errors, and thus the results of CART may be quite different from one sample to another. The term ”stability” means here stability of the conclusions issued from CART. Using the property of jackknife estimation (see Miller 1974), an unbi∗ defined ased estimate of σ ˆ is given by the variance of the pseudo-values Tn,i by equation 9: 2 2 σ ≈ ICI,F (z) Fn (dz) n ∗ − I(Fn )) = Var(In,i ∗ = Var(In,i ) .
4. 4.1
Illustration on Spam Dataset
Spam Dataset
The spam data, publicly available at https://archive.ics.uci.edu/ml/ datasets/Spambase, consists of information from 4601 email messages, in a study to screen email for ”spam” (i.e. junk email). The data are presented in detail in Hastie, Tibshirani, and Friedman (2009, p. 301). The response
CART Classification Trees
variable is binary, with values nonspam or spam, and there are 57 predictors: 54 given by the percentage of words in the email that match a given word or a given character, and three related to uninterrupted sequences of capital letters: the average length, the length of the longest one and the sum of the lengths of uninterrupted sequences. The objective was to design an automatic spam detector that could filter out spam before clogging the users’ mailboxes. This is a supervised learning problem. 4.2 Reference Tree and Jackknife Trees
There are various programs such as IBM-SPSS, MATLAB classregtree, STATSOFT STATISTICA, etc. to obtain classification trees. In this article, the reference tree is obtained using the R package rpart (see R Development Core Team 2009; Venables and Ripley 2002) and accepting the default values for all the parameters (mainly the Gini heterogeneity function to grow the maximal tree, pruning thanks to 10-fold cross-validation and the 1-SE rule). Programs can be obtained on demand to the authors. The reference tree based on all observations, namely T , is given in Figure 1. This tree has 7 leaves, obtained from splits involving the variables charDollar , remove, hp, charExclamation, capitalT otal, and f ree (in order of appearance in the tree). Each leaf is labeled by the prediction of Y (spam or nonspam) and the distribution of Y inside the node (for example, the third leaf is almost pure: it contains 1 nonspam and 20 spams). These make the tree easy to interpret and to describe: indeed, from the direct interpretation of the path from the root to the third leaf: an email containing many occurrences of ”!” and ”free” is almost always a spam. To compute the influence of one observation we compute T (−i) the tree based on all observations except the observation i (i = 1, . . . , 4601). We then obtain a collection of 4601 trees, which can be described with respect to the reference tree T in many different ways. This description is carried out according to various filters, namely: the retained variables, the number of leaves of the final tree, the observations involved in the differences. The variables charDollar, charExclamation, hp and remove are always present. The variable capitalLong is present in 88 trees, the variable capitalTotal is present in 4513 trees and the variable free is present in 4441 trees. This indicates instable clades within T . In addition variables free and capitalLong as well as capitalLong and capitalTotal never appear simultaneously, while the 4441 trees containing free also contain capitalTotal. This highlights the variability among the 4601 trees. All the differences are explained by the presence (or not) of the variable free. Indeed, removing one observation from node generated from variable free is enough to remove the split and to merge the two nodes leading to misclassification.
A. Bar-Hen, S. Gey, and J.-M. Poggi
charDollar< 0.0555 |
hp>=0.4
remove< 0.055
charExclamation< 0.378
spam 30/300
nonspam 63/7
spam 70/990
capitalTotal< 55.5
nonspam 2462/275
free< 0.845
nonspam 129/32
spam 33/189
spam 1/20
Figure 1. CART reference tree for the spam dataset.
There are 77 observations xi classified differently by T and the corresponding jackknife tree T (−i) , and 160 jackknife trees with one less leaf than T . All the other jackknife trees have the same number of leaves as T . A careful examination of the highlighted observations leads to a spam and a nonspam mails that define the second split of the reference tree: the threshold on the variable remove is the middle of their corresponding values. 4.3
Influence Measures
The influence measures I1 , I2 , I3 , and I4 respectively defined by (3), (4), (5), and (6) are computed on the spam dataset by using the jackknife trees computed in Section 4.2. 4.3.1 Exploratory Analysis Indices I1 and I2 are given in Figure 2. There are 77 observations for which I2 = 1, that is for which xi is classified differently by T and the corresponding jackknife tree T (−i) , while 161 jackknife trees lead to a nonzero index I1 , that is to a nonzero average number of observations for which the predicted labels change.
CART Classification Trees
0.010
: I2 = 0 : I2 = 1
0.000
0.005
I1
0.015
0.020
Indices I1 and I2 for the spam data
0
1000
2000
3000
4000
Observation number
Figure 2. Influence indices I1 and I2 for the spam dataset. Line : threshold of the 95% level unilateral confidence interval for index I1 . Point symbol is a cross when I2 = 1 and a square when I2 = 0
The pruned subtree sequences contain around six elements and they represent Ncp = 27 distinct values {cp1 ; . . . ; cp27 } of the cost complexity parameter (from 0.01 to 0.48). There are 2139 observations for which index I3 is nonzero. The distribution of influence index I4 is given in Table 1. The number of actual values of I4 is small. These values organize the data as nested level sets in decreasing order of influence. For 60% of the observations, predictions are the same all along the pruned subtree sequences, making these observations not influential for I4 . There are 123 observations leading to different predictions for at least half of the pruned subtrees, and 2 observations change prediction labels of trees for 74% of the complexities. Let us remark that these two most influential mails for I4 are the spam and the nonspam mails that define the second split on the variable remove of the reference tree.
A. Bar-Hen, S. Gey, and J.-M. Poggi Table 1. Frequency distribution of the influence index I4 , expressed in percentage, over the 4601 emails. I4 (%) Nb. Obs.
0 2768
4 1467
7 116
11 70
15 55
22 1
44 1
52 96
59 2
67 23
74 2
4.3.2 Distributional Results Unilateral confidence intervals on I1 and I3 with confidence levels between 95% and 97.5% highlight the same 139 influential observations. Unilateral confidence intervals with confidence levels between 90% and 97.5% highlight 125 influential observations for I4 . The threshold value of I4 to decide if an observation is influential or not is 22% for all confidence levels between 90% and 97.5%. Among the 77 observations for which I2 = 1, two have a null index I1 , one has a null index I3 , and 54 are influential for I1 and I3 . The two observations classified differently by T and the corresponding jackknife tree T (−i) , but for which the corresponding jackknife trees does not change the classification of the others observations, are the same mails for which I4 is maximal. The mail for which I2 = 1, but leading to jackknife subtrees sequences classifying the other observations the same way, is again the nonspam mail defining the split on the variable remove in the reference tree. There are 6 observations detected as influential for I4 , but not influential for I3 . These are the following, in descending order of the values of I4 : • I4 = 74%: the nonspam mail and the spam mail defining the split on the variable remove in the reference tree. • I4 = 59%: the two nonspam mails on the border of the split on the variable capitalTotal in the reference tree. • I4 = 44%: the spam mail defining the split on the variable capitalTotal in the reference tree. • I4 = 22%: the nonspam mail classified as a spam by the reference tree in the leaf issued from the split on the variable free.
Index I4 can also be examined in a structural way by locating the observation on the topology of the reference tree. Of course, graphical software tools based on such idea can be useful to screen a given data set. To conclude, the cross analysis of influence measures I1 , I2 , I3 and I4 on the spam data set allows us to highlight the two most influential observations for the CART model. These observations are influential because of
CART Classification Trees
their prevailing participation to the construction of the CART predictor. I3 and I4 also allow to catch more observations involved in the construction of the predictor. Moreover, the confidence intervals obtained for I4 seem to be well defined, since they allow to catch the isolated observation misclassified by the reference tree. Remark 5. Let us recall that the influence is measured with respect to a reference model and of course a more unstable reference tree automatically increases the number of individuals that can be categorized as influential. Choosing a deeper tree in the pruned subtree sequence automatically promotes new observations as influential. 5.
Exploring the Paris Tax Revenues Dataset
5.1 Dataset and Reference Tree
5.1.1 Dataset We apply the tools presented in the previous section on tax revenues of households in 2007 from the 143 cities surrounding Paris. Cities are grouped into four counties (corresponding to the french administrative “d´epartement”). The PATARE data (PAris TAx REvenues) are freely available on http://www.data-publica.com/. For confidentiality reasons we do not have access to the tax revenues of the individual households but we have characteristics of the distribution of the tax revenues per city. Paris has 20 ”arrondissements” (corresponding to districts), Seine-Saint-Denis is located at the north of Paris and has 40 cities, Hauts-de-Seine is located at the west of Paris and has 36 cities, and Val-de-Marne is located at the south of Paris and has 48 cities. For each city, we have the first and the 9th deciles (named respectively D1 and D9), the quartiles (named respectively Q1, Q2 and Q3), the mean, and the percentage of the tax revenues coming from the salaries and treatments (named PtSal). Figure 3 gives the summary statistics for each variable per county. Basically we tried to predict the county of the city with the characteristics of the tax revenues distribution. This is a supervised classification problem where the explained variable is quaternary. We emphasize that this information cannot be easily retrieved from the explanatory variables considered without the county information. Indeed, the map (see Figure 4) of the cities drawn according to a k-means (k=4) clustering (each symbol is associated with a cluster) superimposed with the borders of the counties, exhibits a poor recovery of counties through clusters.
A. Bar-Hen, S. Gey, and J.-M. Poggi
2nd quartile Q2
1st quartile Q1
●
●
●
●
●
60000
35000
●
●
●
●
Paris
Seine Saint Denis
20000
15000
40000
25000
●
● ● ●
75
92
93
94
1st decile D1 20000
120000
3rd quartile Q3
● ●
●
●
● ●
10000
●
40000
●
75
92
93
94
5000
80000
● ●
75
92
●
●
● ●
8e+04
150000
●
●
●
●
● ●
● ●
4e+04
●
50000
●
85
94
Mean
9th decile D9 ● ●
93
75
92
93
94
% of salaries PtSal
75
92
93
94
Counties' Zip codes
75
75: Paris 92: Haut de Seine 65
93: Seine Saint Denis ●
55
●
94: Val de Marne
●
75
92
93
94
Figure 3. PATARE dataset: boxplots of the variables per county (“d´epartement” in France) and zip codes for counties.
CART Classification Trees
●
● ● ●
●
●
●
●
Figure 4. Spatial representation of k-means (k=4) clustering of the PATARE dataset cities (each symbol is associated with a cluster).
5.1.2 Reference Tree Figure 5 shows the reference tree. Each leaf is labelled by two informations: the predicted county and the distribution of the cities over the 4 counties. For example the second leaf gives 0/17/1/3 meaning that it contains 0 districts from Paris, 17 cities from Hauts-de-Seine, 1 from SeineSaint-Denis and 3 from Val-de-Marne. All the 5 terminal nodes located on the left subtree below the root are homogeneous since the distributions are almost pure, while half the nodes of the right subtree are highly heterogeneous. The first split involves the last decile of the income distribution and then discriminates the counties with rich cities from counties mainly constituted with poorer cities. More precisely, the labels mainly distinguish Paris and Hauts-de-Seine on the left from Seine-Saint-Denis on the right, while Val-de-Marne appears in both sides. The extreme quantiles are sufficient to separate richest from poorest counties while more global predictors are useful to further discriminate between intermediate cities. Indeed the splits on the left part are mainly based on the deciles D1, D9 and PtSal is only used to separate Hauts-de-Seine from Val-de-Marne. The splits on the right part are based on all the dependent variables but involve PtSal and mean variables to separate Seine-Saint-
A. Bar-Hen, S. Gey, and J.-M. Poggi D9>=7.45e+04
|
Q3< 3.804e+04
D1< 1.054e+04
D1< 1.578e+04
PtSal>=71.03
Paris 17/0/0/0
Seine Saint Denis 0/1/16/0
D9>=9.014e+04
PtSal>=78.28 Val de Marne 0/2/2/8
Seine Saint De 0/0/6/2
D1< 7637
PtSal>=70.23 Haut de Seine 0/17/1/3
Haut de Seine 0/3/3/1
Mean< 3.078e+04 Haut de Seine Val de Marne 0/10/0/5 0/0/0/7
Paris 3/0/1/3
Seine Saint Denis Val de Marne 0/0/6/3 0/3/5/15
Figure 5. CART reference tree for the PATARE dataset.
Denis from Val-de-Marne. Let us notice that predictors Q1 and Q2 do not appear. Surprisingly, the predictions given by the reference tree are generally correct (the resubstitution misclassification rate calculated from the confusion matrix given in Table 2, is equal to 24.3%). Since the cities within each county are very heterogeneous, we look for the cities which perturb the reference tree. After this quick inspection of the reference tree avoiding careful inspection of the cities inside the leaves, let us focus on the influential cities highlighted by the previously defined indices. In the sequel the cities (which are the individuals) are written in italics to be clearly distinguished from counties which are written between quotation marks. 5.2
Influential Observations
5.2.1 Presentation There are 45 observations classified differently by T and T (−i) , while 75 jackknife trees lead to observations for which predicted labels change.
CART Classification Trees Table 2. Confusion matrix: actual versus predicted county, using the CART reference tree. Predicted Actual Paris Haut de Seine Seine Saint Denis Val de Marne
Paris
Haut de Seine
Seine Saint Denis
Val de Marne
20 0 1 3
0 30 4 9
0 1 28 5
0 5 7 30
Table 3. Frequency distribution of influence index I4 , expressed in percentage, over the 143 cities. I4 (%) 3 7 10 13 20 23 33 37 40 43 47 53 57 70 87 90 93 Nb. Obs. 30 31 6 14 5 14 1 1 7 5 11 7 6 2 1 1 1
Influence indices I1 and I2 of the observations of the PATARE dataset are given in Figure 6. For each city, the graph gives the value of index I1 and the corresponding point is represented by a cross if I2 is equal to 1 and a square if 0. The horizontal line represents the threshold given by the critical value of the unilateral test at level 95%. According to this rule, ten cities are selected. The separation is clear and the same selection occurs for level 97.5%. The influential cities are in descending order of influence: Villemomble, Neuilly-Plaisance, Chevilly-Larue, Vitry-sur-Seine, Villejuif, Creteil, Choisyle-Roi, Champigny-sur-Marne, Arcueil and Noisy-le-Grand. There are 29 different values of complexities in the reference and jackknife tree sequences. The frequency distribution of influence index I4 , expressed in percentage of the number of different complexities, over the 143 cities of the PATARE dataset is given in Table 3. Index I3 selects nine cities using the threshold given by the critical value of the unilateral test at level 97.5%: roughly the same except: NeuillyPlaisance and Noisy-le-Grand are not selected while Bonneuil-sur-Marne is selected. It should be noted that this index is more sensitive, leading to select 18 cities (instead of 13) for the critical value of the unilateral test at level 95%. The influential cities, according to index I4 at level 95% et 97.5%, are in descending order of influence the 5 following cities: Paris 13eme, Villemomble, Asnieres-sur-Seine, Rueil Malmaison and Bry-sur-Marne. Since these cities are different from those highlighted by the two previous indices, coming from the same line, it is interesting to see the list of the additional cities selected by decreasing the level to 90%: Vanves, Livry-Gargan, Gagny, Fontenay-aux-Roses, Clamart, Chatenay-Malabry, Vincennes,
A. Bar-Hen, S. Gey, and J.-M. Poggi
0.25
Indices I1 and I2 for the PATARE data
0.00
0.05
0.10
I1
0.15
0.20
: I2 = 0 : I2 = 1
0
20
40
60
80
100
120
140
Observation number
Figure 6. Influence index based on partitions for PATARE dataset cities. Point symbol is a cross when I2 = 1 and a square when I2 = 0
Villeneuve-le-Roi, Maisons-Alfort, Le-Pre-Saint-Gervais, Le Perreux-surMarne, Gentilly and Chevilly-Larue. So, influence indices I1 and I3 deliver similar conclusions while I4 highlights a different subset of observations. 5.2.2 Interpretation One can find in Figure 7 the influential cities, with respect to I4 , located in the reference tree. Let us emphasize that all cities among the 18 influential cities quoted in Figure 7 are well-classified when using the reference tree. Index I4 highlights cities for which two parts of the city can be distinguished: a popular one with a low social level and a rich one of high social level. They are located in the right part of the reference tree (for the higher values of I4 : Asnieres sur Seine, Villemomble, Paris 13eme, Bry-surMarne and Rueil-Malmaison) as well as in the left part (for moderate val-
CART Classification Trees D9>=7.45e+04
|
Q3< 3.804e+04
D1< 1.054e+04
D1< 1.578e+04
PtSal>=71.03 Seine Saint Denis 0/1/16/0
Paris 17/0/0/0
#$
D9>=9.014e+04
PtSal>=78.28 Seine Saint Denis 0/0/6/2
Val de Marne 0/2/2/9
! "
D1< 7637
PtSal>=70.23 Haut de Seine 0/3/3/1
Haut de Seine 0/17/1/3
" Mean< 3.078e+04 Haut de SeineVal de Marne 0/10/0/5 0/0/0/7
! !
! !
Paris 3/0/1/3
Seine Saint Denis Val de Marne 0/0/6/3 0/3/5/15
!
Figure 7. Influential cities located on the CART reference tree.
ues of I4 : Chatenay-Malabry, Clamart, Fontenay aux Roses, Gagny, LivryGargan, Vanves, Chevilly-Larue, Gentilly, Le Perreux sur Marne, Le PreSaint-Gervais, Maisons-Alfort, Villeneuve-le-Roi, Vincennes. To explore the converse, we inspect now the list of the 51 cities associated with lowest values of I4 which can be considered as the less influential, the more stable. It can be easily seen that it corresponds to the 16 rich districts of Paris downtown (Paris 1er to 12eme and Paris 14eme to Paris 16eme) and mainly cities near Paris or directly connected by the urban express train line (named RER in French). It should be noted that the influence indices cannot be easily explained neither by central descriptors like the mean or the median Q2 nor by dispersion descriptors as Q3-Q1 and D9-D1. Bimodality seems the key property to explain high values of the influence indices. In addition, coming back to the non-supervised analysis, one may notice that influential observations for PCA (Principal Component analysis) are not related to influential cities detected using I4 index. Indeed, Figure 8 contains the two first principal components capturing more than 95% of
3
A. Bar-Hen, S. Gey, and J.-M. Poggi
●
1 0 −1
● ● ● ● ● ●
●
●
●
● ●
−2
●
● ● ●
−3
2nd Principal Component (17.1 %)
2
●
Val de Marne Haut de Seine Seine Saint Denis Paris
● ●
●
−4
●
−4
−2
0
2
4
6
1st Principal Component (78.5 %)
Figure 8. Plane of the two first principal components: Cities are represented by symbols proportional to influence index I4 .
the total variance. PCA has been performed on the correlation matrix of all the predictors. Each city is represented in this plane by a symbol of size proportional to the I4 index. Hence one can see that the points influential for PCA (those far from the origin) are generally of small influence for influence index I4 . 5.2.3 Spatial Interpretation To end this study, a map is useful to capture the spatial interpretation and complement the previous comments which need some prior knowledge about the sociology of the Paris area. In Figure 9, the 143 cities are represented by a circle proportional to the influence index I4 and a spatial interpolation is performed using 4 grey levels. This map shows that Paris is stable, and that each surrounding county contains a stable area: the richest or the poorest cities. What is remarkable is that the white as well as the gray areas are clustered.
CART Classification Trees
● ●
●
● ●
● ● ●
● ● ● ●
● ● ● ●
● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
● ●
●
● ●
● ●
● ● ● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ●
●●
● ●
●
● ●
●
● ● ●
●
●
●
●
● ●
●
●
● ● ●
● ● ● ●
●
● ● ● ●
●
●
●
●
●
● ●
●
● ● ●
● ●
●
●
● ●
●
●
● ●
● ●
● ● ● ●
●
●
● ● ●
●
●
●
● ●
● ●
●
● ●
●
●
●
●
● ●
●
● ●
● ● ●
●
●
● ●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
● ● ●
●
● ● ● ●
●
● ● ● ●
● ● ●
●
● ●
●
● ●
●
●
●
●
● ●
●
● ●
● ●
●●
● ● ● ●
● ● ●
● ● ●
●
● ● ● ●
●
●
● ●●
● ●
●
● ● ●
●
●
●
● ●
● ●●
●
●
●
● ● ●
● ●
● ● ●
●
●
●
●
●
● ●
●
● ●
● ●
●●
● ● ●
●
● ●
●
●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
●
● ●
● ● ● ●
● ● ●
● ●
●
●
● ●
● ● ●
● ● ●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
● ●
● ●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
● ● ●
● ●
●
●
● ● ● ●
●
●
●
●
● ●
● ●
●
● ●
● ●
●
● ●
● ●
●
● ●
●
● ●
● ●
●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
●
●
●
●
● ●
● ●
●
●
●
●
● ●
● ●
● ●
● ●
● ●
●
● ●
●
● ● ●
● ● ●
● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ● ●
● ●
● ● ● ●
●
●
●
●
● ●
● ●
● ● ●
●
● ● ● ●
Figure 9. The 143 cities are represented by a circle proportional to the influence index I4 and a spatial interpolation is performed using 4 grey levels.
6.
Perspectives
First, the tools developed in this paper for the classification case can be generalized for the regression case. The instability is smoother in the regression case since the data are adjusted thanks to a surface rather than a frontier. Then the number of false classifications is replaced with the sum of squares residuals typically which is more sensitive to perturbations but the differences between the full tree and the jackknife ones are more stringent in the classification case. Some classical problems, like outlier detection have been intensively studied in the regression case and a lot of solutions have been developed around the ideas of robust regression (see Rousseeuw 1984). A complete scheme for comparison can be retrieved from Ch`eze and Poggi (2006), where a tree-based algorithm has been developed and compared intensively to well known competitive methods including robust regression. Another direction is to focus on model stability and robustness rather than centering the analysis around individuals. A first idea could be, follow-
A. Bar-Hen, S. Gey, and J.-M. Poggi
ing Bar-Hen, Mariadassou, Poursat, and Vandenkoornhuyse (2008), to build the most robust tree by iteratively removing the most influential observation until stabilization between reference and jackknife trees. A second one is to consider, starting from the I4 index but summing on the observations instead of the complexities, the percentage of observations differently classified by the reference and jackknife subtrees at fixed complexity. This is out of the scope of this article. Finally, one may notice that, for a given tree T , two different main aspects are of interest: the predictions delivered by the tree or the partition associated with the tree; the second highlights the tree structure while the first focuses on its predictive performance. This distinction is classical and already examined for example by Miglio and Soffritti (2004) recalling some proximity measures between classification trees and promoting the use of a new one mixing the two aspects. Even if we focused on influence measures based on prediction errors, we can propose two influence measures on partitions for CART based on jackknife trees: I5 measuring the variations on the number of leaves in each partition, and I6 based on the quantification of the difference between the two partitions. Denoting by T the partition induced by the tree T , these indices are computed in the following way: for i = 1, . . . , n I5 (xi ) = |T (−i) | − |T |,
(12)
I6 (xi ) = 1 − J T, T(−i) ,
(13)
where J T, T(−i) is any similarity index between the partitions of L respectively defined by T(−i) and T (for a detailed analysis and comparisons, see Youness and Saporta 2009). Note that we proposed influence measures for CART classification tree but this work could be easily extended to resampled version of CART, for example bagged tree. References BAR-HEN, A., MARIADASSOU, M., POURSAT, M.-A., and VANDENKOORNHUYSE, P.H. (2008), “Influence Function for Robust Phylogenetic Reconstructions”, Molecular Biology and Evolution, 25(5), 869–873. BEL, L., ALLARD, D., LAURENT, J.M., CHEDDADI, R., and BAR-HEN, A. (2009), “CART Algorithm for Spatial Data: Application to Environmental and Ecological Data”, Computational Statistics and Data Analaysis, 53(8), 3082–3093. BOUSQUET, O., and ELISSEEFF, A. (2002). “Stability and Generalization”, Journal of Machine Learning Research 2, 499–526.
CART Classification Trees
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A., and STONE, C.J. (1993), Classification and Regression Trees, Boca Raton FL: Chapman and Hall. BRIAND, B., DUCHARME, G.R., PARACHE, V., and MERCAT-ROMMENS, C. (2009), “A Similarity Measure to Assess the Stability of Classification Trees”, Computational Statistics and Data Analysis, 53(4), 1208–1217. CAMPBELL, N.A. (1978), “The Influence Function as an Aid in Outlier Detection in Discriminant Analysis”, Applied Statistics, 27, 251–258. ` CHEZE, N., and POGGI, J.M. (2006), “Outlier Detection by Boosting Regression Trees”, Journal of Statistical Research of Iran (JSRI), 3, 1–21. CRITCHLEY, F., and VITIELLO, C. (1991), “The Influence of Observations on Misclassification Probability Estimates in Linear Discriminant Analysis”, Biometrika, 78, 677–690. CROUX, C., and JOOSSENS, K. (2005), “Influence of Observations on the Misclassification Probability in Quadratic Discriminant Analysis”, Journal of Multivariate Analysis, 96(2), 384–403. CROUX, C., FILZMOSER, P., and JOOSSENS, K. (2008), “Classification Efficiencies for Robust Linear Discriminant Analysis”, Statistica Sinica, 18(2), 581–599. CROUX, C., HAESBROECK, G., and JOOSSENS, K. (2008), ”Logistic Discrimination using Robust Estimators: An Influence Function Approach”, The Canadian Journal of Statistics, 36(1), 157–174. CUEVAS, A., and ROMO, J. (1995), “On the Estimation of the Influence Curve”, The Canadian Journal of Statistics, 23, 1–9. GEY, S., and POGGI, J.M. (2006), “Boosting and Instability for Regression Trees”, Computational Statistical and Data Analysis, 50(2), 533–550. GILL, R.D. (1989), “Non- and Semi-Parametric Maximum Likelihood Estimators and the Von Mises Method (Part 1)”, Scandinavian Journal of Statistics, 16, 97–128. HAMPEL, F.R. (1988), “The Influence Curve and Its Role in Robust Estimation”, Journal of the American Statistical Association, 69, 383–393. HASTIE, T.J., TIBSHIRANI, R.J., and FRIEDMAN, J.H. (2009), The Elements of Statistical Learning: Data Mining, Inference and Prediction (3rd ed.), New York: Springer. HUBER, P.J. (1981), ”Robust Statistics”, New York: Wiley and Sons. MIGLIO, R., and SOFFRITTI, G. (2004), “The Comparison Between Classification Trees Through Proximity Measures”, Computational Statistics and Data Analysis, 45(3), 577–593. MILLER, R.G. (1974), “The Jackknife - A Review”, Biometrika, 61, 1–15. R DEVELOPMENT CORE TEAM (2009), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, ISBN 3-900051-07-0, http:// www.R-project.org/. ROUSSEEUW, P. (1984), “Least Median of Squares Regression”, Journal of the Amererican Statistical Association, 79, 871–880. YOUNESS, G., and SAPORTA, G. (2009), “Comparing Partitions of Two Sets of Units Based on the Same Variables”, Advances in Data Analysis and Classification, 4(1), 53–64. VENABLES, W.N., and RIPLEY, B.D. (2002), Modern Applied Statistics with S (4th ed.), New York: Springer. ZHANG, H., and SINGER, B.H. (2010), Recursive Partitioning and Applications (2nd ed.), New York: Springer.