Qual Quant (2017) 51:641–655 DOI 10.1007/s11135-016-0430-2
Classification trees for multivariate ordinal response: an application to Student Evaluation Teaching Mariangela Sciandra1 • Antonella Plaia1
•
Vincenza Capursi1
Published online: 6 October 2016 Springer Science+Business Media Dordrecht 2016
Abstract Data from multiple items on an ordinal scale are commonly collected when qualitative variables, such as feelings, attitudes and many other behavioral and healthrelated variables are observed. In this paper we introduce a method to derive a distancebased tree for multivariate ordinal response that allows, when subject-specific characteristics are available, to derive common profiles for respondents giving the same/similar multivariate ratings. Special attention will be paid to the performance comparison in terms of AUC, for three different distances used as splitting criteria. Simulated data an a dataset from a Student Evaluation of Teaching survey will be used as illustrative examples. The latter will be used to show the performance of the procedure in profiling students by identifying which features of their experience are most closely related to their expressed satisfaction. Keywords Decision tree Ordinal response Student Evaluation of Teaching Distances
1 Introduction In many fields, such as Education, Health and Quality of life, when opinion surveys are carried out, researchers collect data from multiple items on an ordinal scale. When subjectspecific characteristics are available, an important issue relies on the identification of the profiles of respondents giving the same/similar multivariate responses. Random or mixed
& Antonella Plaia
[email protected] Mariangela Sciandra
[email protected] Vincenza Capursi
[email protected] 1
Department of Scienze Economiche, Aziendali e Statistiche, University of Palermo, Palermo, Italy
123
642
M. Sciandra et al.
effect models are the basis for a parametric modelling of ordinal responses; nevertheless, for the analysis of complex data, that may include lack of balance, missing values, nonlinear relationships and/or the presence of significant high-order interactions, non-parametric techniques could be preferable, especially for dealing with correlated ordinal responses (Zhang and Ye 2008). Within this framework, a tree-based approach to the analysis of multivariate ordinal data is a promising and little explored field. For univariate ordinal response variable, Piccarreta (2008) extends Breiman’s twoing (Breiman et al. 1984) and the Gini-Simpson criteria to the ordinal case. Several attempts have been made to extend tree-based approaches to multi-response data: Zhang (1998) extends the CART (Breiman et al. 1984) procedure to multiple binary response variables; Siciliano and Mola (2000) propose a multivariate approach to binary segmentation through the definition of a multivariate classification/prediction rule that allows to study how more than one response variables could depend on a given set of predictors; De’ath (2002) proposes an algorithm for multivariate regression tree (MRT) in the case of numeric responses; finally, Zhang and Ye (2008) extend Zhang’s idea to ordinal responses in a semiparametric way. In this paper we introduce a method to derive a distance-based tree for multivariate ordinal response focusing on the performance comparison for three different distances used as splitting criteria: Spearman’s footrule, Kendall’s tau and Kemeny distances. The comparison is carried out in terms of Area Under the receiver operating characteristics (ROC) curve (Fawcett 2006) as it allows to use a common scale for different distances. Moreover, in the pruning tree phase, AUC can also be used to establish a stopping criterion. The paper is organized as follows: Sect. 2 reviews classical decision tree methodology; Sect. 3 introduces some distance measures proposed in the literature as distances between rankings and used here to compare multivariate ordinal vectors; Sect. 4 is devoted to the description of the proposed extension of CART methodology to multivariate ordinal responses and to the implementation of the proposed procedure in the R statistical software environment. Finally, in Sect. 5, the proposed procedure is applied to a simulated and a real data set concerning Student Evaluation Teaching, while certain conclusions and further developments are set out in Sect. 6.
2 Decision trees: an overview Classification and Regression Tree (CART) analysis, or more generally decision trees, is a simple non-parametric regression approach; its main characteristic is that the space spanned by predictors is recursively partitioned into a set of rectangular areas such that the observations with similar response values are grouped together and labeled with a constant value for the response variable predicted within each area at the end of the procedure. The CART algorithm was described for the first time by Breiman et al. (1984); it is the most popular decision tree methodology because it is particularly valuable in multidisciplinary fields. It is mainly characterized by two stages: a growing phase and a pruning one. The CART procedure, starting from the root node that contains the whole learning sample, uses a recursive partitioning algorithm to create a tree where each node t represents a subset of the partition. In a binary tree structure, all internal nodes have two child nodes whereas leaf nodes are the ones with no descendants. At each step of the partitioning process, a splitting rule is used to split the N(t) objects in node t into a left and a right node containing NL ðtÞ and NR ðtÞ objects respectively. Decision tree node splitting is an important step, the core
123
Classification trees for multivariate ordinal response…
643
issue being how to choose the splitting rule that performs the partitioning of learning sample into smaller parts adequately. Generally, the splitting rule is chosen on the basis of the quality of split criterion, which is equivalent to choosing a split among all possible splits at each node so that the resulting child nodes are the ‘‘purest’’, where pureness is meant in terms of homogeneity between observations in the same node. The degree of similarity of units in the same child node is defined in terms of impurity function, i(t), a generic function satisfying the following three properties: (a) it is minimum when the node is pure; (b) it is maximum when the node is the most impure; (c) its value does not change if items are renamed. Specifically, a splitting criterion is based on the reduction in impurity resulting from the split s of node t (s belonging to Xt , the space of possible splits in node t), with the best split chosen as the one maximizing reduction in impurity (Shih, 2001). Maximum trees resulting from the growing phase may turn out to be very complex and consist of too many levels. Therefore, they have to be optimized before being used in practice. Tree optimization implies cutting off insignificant nodes through a pruning process that will lead to the final size of the tree. Once the tree has been built, the terminal nodes must be associated with a response label: for classification trees, the assignment is based on a simple majority rule, while for regression trees a simple mean of the response variable of the objects in the node can be considered. Further details on decision tree theory can be found in Rokach and Manimon (2014).
3 Distance measures between multivariate ordinal data In order to show the analogy between multivariate ordinal data and ranking data (necessary to introduce the opportune distances), let us introduce the latter. Ranking data (Espejo 2015) arise when a group of individuals is asked to rank a fixed set of objects according to their preferences. When K items, labeled 1; . . .K, are ranked than the resulting ranking p represents a mapping function from 1; . . .K to 1; . . .K, with pðiÞ representing the rank given by the judge to item i. Once all the K items are ranked in K distinct ranks, a complete ranking or linear ordering follows. Yet, it is also possible that a subject fails to distinguish between two or more objects and assigns them equally; this results in a tied ranking or weak ordering. Besides complete and tied rankings, partial and incomplete rankings exist: the first occurs when only a specific subset of q\K objects are ranked by judges, while incomplete ranking occurs when judges are free to rank different subsets of K objects. Obviously, different types of ordering generate different sample spaces for ranking data. With K objects, there are K factorial possible complete rankings; this number gets even larger when ties are allowed. In order to classify subjects into C homogeneous clusters according to their expressed preferences, a first possibility is to consider a dissimilarity or a distance measure properly defined on all rankings (Kumar and Vassilvitskii 2010); such a measure d should have to meet the usual properties of a distance function: – Reflexivity: dðp; pÞ ¼ 0, – Positivity: dðp; rÞ [ 0 if p 6¼ r, – Symmetry: dðp; rÞ ¼ dðr; pÞ, Moreover, a desirable property of any distance is its invariance toward a renumbering of the elements (the so called label invariance or right invariance (Cheng et al., 2009)). Thus, a distance between two rankings will be a non-negative value, ranging in ½0 Dmax , where 0 is the distance between a ranking and itself and Dmax is the maximum
123
644
M. Sciandra et al.
distance according to the considered metric. Another possible way of measuring agreement between rankings is in terms of correlation coefficients: rankings in full agreement are assigned a correlation of þ1, those in full disagreement are assigned a correlation of 1, and all others lie somewhere in between. However, it is possible to show that a generic distance measure d can be transformed into a correlation coefficient c using the linear d (Edmond and Mason 2002). For the distances considered in transformation c ¼ 1 D2max this paper, (1), (2) and (3), and their corresponding ‘‘correlation coefficients’’, refer to (Alvo and Yu 2014, Sect. 3.1) and (Tarsitano 2009, Table 2). Data consisting of the opinions, expressed on an ordinal scale, of a group of respondents for a set of items, present a structure that is very close to a weak ordering. For example, if we have a set of K items evaluated on a 1 h scale, the response given by the i-th respondent will be a sequence of k numbers, each between 1 and h, that can be considered as a weak ordering. Such correspondence between ranked preferences and data on an ordinal scale allows extending the distance measure proposed for ranking data to the case of multivariate ordinal data. One of the best known metrics able to evaluate the distance between two subjects, i and j, according to the value of a K-dimensional ordinal variable y (treated as numerical values) is Spearman’s Footrule distance (Genest et al. 2010), defined as Fij ¼
K X
jyik yjk j
ð1Þ
k¼1
Alternatively, Kendall’s tau distance, defined as the number of discordant pairs between rankings, is easily applicable to a set of ordinal variables: Tij ¼
K X
I ðyik yil Þðyjk yjl Þ\0
i\j
ð2Þ
k; l ¼ 1 where I is the indicator function. Less known, but surely suitable, is the Kemeny distance (Kemeny and Snell 1962) defined as: Kij ¼
K X signðyik yil Þ signðyjk yjl Þ
i\j k; l ¼ 1
ð3Þ
These three distances constitute the core of the procedure proposed in the next section as they will be used to define an impurity measure that could be applied to multivariate ordinal data.
4 Decision trees for multivariate ordinal responses The aim of this paper is to propose an extension of the standard CART procedure that can be used to deal with multivariate ordinal response. The main problem in extending classical univariate regression trees to multivariate response case lies in generalizing the definition of the partitioning metric. In order to avoid the problems related to the
123
Classification trees for multivariate ordinal response…
645
multidimensional nature of the response, a way to univocally link each value of the multivariate ordinal variable to a univariate one has to be established. An easy solution consists in assigning a label to each specific combination of the multivariate response, i.e. yi ¼ fyi1 ; yi2 ; . . .; yiK g; in this way, each multivariate vector can be identified through a label that, used as response variable, will allow the application of classical univariate classification tree methodologies to multivariate data. As it is known, the recursive binary partitioning process used by the CART methodology, starting from the root node, creates a nested sequence of subtrees Tm ¼ froot nodeg T0 ¼ f full treeg derived by using, at each step, a splitting criterion given by the maximization of Diðs; tÞ, which represents the reduction in the impurity, i(t), resulting from the split s in node t: Diðs; tÞ ¼ iðtÞ pL iðtL Þ pR iðtR Þ where pL and pR are the proportions of units in node t assigned to the left child node tL and to the right child node tR respectively at the s-th split. When impurity is evaluated on multivariate ordinal data, it can be properly modified as: X dðyi ; yj Þ iðtÞ ¼ ð4Þ i;j2t
where d is a distance measure between two generic units i and j in the same node t and the sum is extended over all the possible couples of units in the node. Decrease in node impurity at each step will be evaluated according to all covariates and respective split points. In order to make the impurity function able to take into account the ordinal nature of the data, we propose to use, as distance d in Eq. (4), one of three distances (Fij , Tij , Kij ) defined in Sect. 3. It is possible to show that the impurity functions thus defined satisfy the properties discussed in Sect. 2. According to the CART methodology, tree optimization implies cutting off insignificant nodes through a pruning process based on some costcomplexity criterion that will led to the definition of the optimum tree size. In this paper, following the approach proposed by Yu et al. (2011), the choice of an ‘‘optimal tree’’ will be carried out through the inspection of the area under the ROC curve graphical representation. As it will be shown in the next section, the area under the ROC curve will be used both as a measure of performance, to compare the decision trees derived by using different distances and in a pruning phase, as a criterion for finding the optimum tree size. Finally, once the tree has grown, a class label has to be associated to each node, i.e. a vector y^ in the best agreement with all the units in the node. It will be a K-vector representing the predicted value for all the units (respondents) falling within the same node. Among the several measures proposed in the literature (Kemeny and Snell 1962), here we consider a multivariate median, which is a point with minimal distance to all other units in the set. The class label choice can be considered arbitrary, because the performance comparison between different class assignment approaches is beyond the interests of this work.
4.1 Decision trees: the AUC as a measure of performance As emphasized before, the comparison between trees, generated by using different splitting criteria, plays an important role in selecting the ‘‘best’’ distance measure to use. A classical measure of tree model assessment is the misclassification rate, defined as the ratio between
123
646
M. Sciandra et al.
misclassified observations and the total number of objects. The misclassification rate cannot be considered a good measure of performance when the multivariate ordinal response decision tree has to be compared: it fails to consider cases of partial agreement with the predicted values. Thus, the misclassification rate does not ensure unbiased comparisons. As discussed in Ripley (1996), measures of global goodness-of-fit represent possible alternatives to the classical misclassification rate. A drawback of such a measure is the strong dependence on the chosen splitting criterion. Therefore, in order to avoid this problem, following the idea of Yu et al. (2011), the Area Under the ROC Curve (AUC) will be used as a measure for performance comparisons of trees using the three distances defined in Sect. 3. In particular, since the decision tree for multivariate ordinal data will result in more than two node trees, the extension proposed by Hand and Till (2001) in the context of multiclass classification problems will be adopted. A ROC curve is a graphical tool that is useful for selecting classifiers based on their overall performance (Fawcett 2006). In medical research this is extensively applied to diagnostic testing (Kumar and Indrayan 2011). The ROC curve is conceptually simple; for each classifier, in our case, for each splitting criterion, it provides a visualization of its performance by plotting sensitivity (true positive rate) on the y-axis versus 1-specificity (false positive rate) on the x-axis for varying cut-off points. This is generally depicted in a square box for convenience and its axes are both from 0 to 1. As one moves along a ROC curve from the lower left corner (0, 0) to the upper right corner (1, 1) several trade off values between false positive and false negative are reached. Point (0, 1) represents perfect classification. When classifiers assign subjects to a class randomly, the ROC curve coincides with the diagonal line between (0, 0) and (1, 1). Starting from this extreme case, the performance of a classifier improves as the ROC curve approaches the top-left corner of the plot. The main disadvantage of using ROC curves for performance comparisons is that, for some threshold values, the curves could cross; that means that a classifier does not outperform another one over the entire threshold support. In terms of proposed splitting criteria, this means that there might not be a generally preferred distance and that it could depend on tree complexity. A solution for overcoming the crossing problem consists in approaching performance comparison using the Area Under the ROC Curve. AUC represents a global measure of performance based on the ROC curve graphical representation. Since the AUC is a portion of the area of the unit square, its value will always be between 0 and 1. Moreover, as random guessing produces the diagonal line, which has an area of 0.5, it will assume values in ½0:5; 1. More properties of this performance measure and its relation with other statistical quantities can be found in Hand and Till (2001). In order to compute the AUC for multiple class classification problems, Yu et al. (2011) compute the AUC by averaging pairwaise comparisons.A detailed description of the steps used for computing AUC is shown below. ~ leaves has grown, then, using a testing If we assume that a decision tree T with jTj dataset of size M, with M\N (N is the number of respondents), the AUC of a generic pair of variables (l; k) is calculated as follows: 1.
2.
Calculate the proportion of respondents who evaluate item l higher than item k as ~ for every leaf node (where N(t) is the total number p^ðl; jjkÞ ¼ Nlkt =NðtÞ, t ¼ 1; . . .; jTj of vectors in node t and Nlkt is the number of respondents who evaluate item l higher than item k in node t). Assign p^ðl; kjtÞ to respondents who fall in terminal node t;
123
Classification trees for multivariate ordinal response…
3.
4.
5. 6.
7.
647
Rank the respondents in the testing dataset in increasing order according to p^ðl; kjtÞ and assign rank pm to the m-th respondent who evaluate item l higher than item k for m ¼ 1; . . .; R. Note that equal rank is assumed when tied. Let ct (ct ¼ Nlkt ) be the number of respondents who evaluate item l higher than item k , and dt (dt ¼ Nlkt ) the number of respondents who evaluate item k higher than item l, for ~ t ¼ 1; . . .; jTj. PjTj ~ Compute the sum of the ranks as S ¼ t¼1 ct pt PjTj ~ Let c0 ¼ t¼1 ct be the number of respondents who evaluate item l higher than item k, PjTj ~ and d0 ¼ t¼1 dt be the number of respondents who evaluate item k higher than item l; Evaluate the AUC for each item pair l, k as Alk ¼ Sc0 cðc0 0d1Þ=2 0
The overall measure for tree T is defined as the average of AUC over all item pairs: P 2 ~ AUC ¼ KðK1Þ l\k Alk . The plot of the AUC against tree size jTj can have a double interpretation: if attention is focused on the curves, it can be used for comparing the predictive ability of trees generated with different split functions; on the other hand, once the curve with the largest AUC has been selected, the plot can be used as a pruning tool that helps to identify the best tree size. The whole procedure, from the computation of distances [(1), (2), (3)], to the final tree growing and AUC estimate, has been implemented in the R statistical environment. In particular, the distances were used to implement three ‘‘user written split functions’’ (for the rpart package, Therneau and Clinic 2015) capable of dealing with both linear and weak ordering (authors’ submitted paper), or multivariate ordinal data. The rpart package gives users the possibility to include more splitting functions in addition to and sharing the same structure of the split functions available in the package. The implementation of user defined splitting rules requires the definition of a set of R functions that will be used in the various steps of the tree-growing process.
5 Application 5.1 A simulation study In order to appreciate the advantages of the proposed criteria a simulation study is carried out. The structure of the simulated dataset (size, number of covariates, output dimension) is very close to that of the SET dataset analyzed in the next subsection but, since structure is known a-priori, the performance of the approach can be evaluated. We generated a 600size dataset, with 7 explanatory variables x (3 numerical and 4 dichotomous), 1 multivariate (7 component) ordinal response (each component yk in the range 1–4). The dataset has been generated as follows: – We started generating 10 ‘‘centroids’’ (Table 1); P ni ¼ 600; each subgroup – We considered 10 subgroups of different size ni , with contains initially ni identical units, equal to one ‘‘centroid’’; – We introduce ‘‘noise’’ in each subgroup perturbing half of the units ni by randomly changing the values of two randomly chosen x and two randomly chosen yk ; – We randomly partitioned the whole dataset in a training set (500 units) and a test-set (the remaining 100 units).
123
648
M. Sciandra et al.
Table 1 Simulated ‘‘centroids’’ x1
x2
x3
x4
x5
18
1
1
1
1
30
0
0
0
0
18
1
1
1
1
30
0
1
0
1
22
1
0
1
18
1
1
30
0
0
25
1
20 23
x6
x7
y1
y2
y3
y4
y5
y6
y7
y (label)
30
1
1
1
1
1
1
1
1
1
270
5
4
4
4
4
4
4
4
34
270
1
1
2
1
2
1
2
1
58
270
5
4
4
1
4
1
4
4
78
1
180
2
4
1
2
1
1
3
1
105
1
1
150
3
4
1
1
1
1
3
1
139
0
0
150
4
4
3
4
4
2
4
4
155
0
1
1
60
1
1
2
1
1
1
4
1
206
0
0
1
0
200
3
4
2
4
4
1
4
4
161
0
0
1
0
170
2
4
2
3
4
2
4
4
231
The final training set presents 250 different y. For each distance measure, the recursive partitioning process generated a nested sequence of subtrees: AUC has been computed for each tree in the sequence, using the test sample. Figure 1 shows the AUC values computed for the test sample against tree size corresponding to the tree sequences relative to the considered distances [for details about the implementation of the procedure in R see
AUC
0.5
0.6
0.7
0.8
0.9
AUC vs Tree Size
Kemeny 0
5
10
Kendall 15
20 size
Fig. 1 AUC plot—simulated dataset
123
Spearman 25
30
35
Classification trees for multivariate ordinal response…
649
(authors’ submitted paper)]. The Kendall’s distance outperforms the other two distances, but the gain with respect to the maximum tree obtained with Spearman Footrule measure is very low; moreover, the Kendall’s tree-sequence presents the disadvantage of a sudden change in the tree size, from 6 to 30 leaves, making the pruning phase impossible. For this reason we prefer to plot in Fig. 2 the maximum tree obtained using the Spearman Footrule Distance as impurity measure, which presents 31 leaves (vector label and the number of units are reported for each leaf). The vectors corresponding to each leaf are reported in Table 2. Examining the tree plot, or similarly, looking at Table 2, it clear that the 10 initial ‘‘centroids’’ (marked with an asterisk in the table) correspond to leaves in the tree. The maximum tree presents 16 other leaves, containing 106 out of the 500 initial units (remember that 250 out of the initial 500 units have been ‘‘perturbed’’) The leaves are not pure (the minimum tree size should be 250 in order to get pure leaves), but dealing with ordinal variables, the % of ‘‘false positive/negative’’, is not the best solution to evaluate the tree. The last column in Table 2 reports, for each leaf, the reduction in ‘‘deviance’’, (a Spearman Footrule distance in our case) between the tree-root and the ‘‘deviance’’ in the leaf: the ‘‘deviance’’ is calculated as the sum of the distances between each y value and the multivariate median (defined in Sect. 4), while the reduction is set as (root.deviance-leaf.deviance)/root.deviance 100. As it is possible to see, the most ‘‘impure’’ leaf allows for a reduction of 95:2 %. If we eant to use the plot in Fig. 1 to prune the tree, we can consider a pruned tree with 15 leaves (after this value the red line keeps growing but at a lower slope): this pruned tree (not shown) presents leaves corresponding to 9 of the initial 10 ‘‘centroids’’, with a minimum reduction in ‘‘deviance’’ of 90:2 %.
x6 >= 32
yes
233 n=7
x6 >= 61 206 n=45
x6 >= 74 234 n=7
x6 >= 102 192 n=7
x6 >= 122 183 n=7
x7 >= 4.5 x6 >= 216
54 n=7
x1 >= 28 78 n=7
x4 = 0 78 n=7
x5 = 0 34 n=31
x6 >= 268 96 n=7
x1 >= 22
x7 >= 3.5 47 n=8
x6 >= 242 73 n=7
x3 = 0 60 n=11 58 n=7
x6 >= 208 138 n=8
x5 = 0 x4 = 0 67 n=4
x3 = 0 99 n=2
no
1 n=20
x6 >= 57
x7 >= 1.5 59 n=8
x6 >= 200 58 n=51
161 n=15
78 n=32
x6 >= 175 105 n=11
x6 >= 167
x4 = 0 238 n=3
x3 = 0 231 n=32
155 n=80
x2 = 0 155 n=19
x1 >= 22 181 n=10
x4 = 0 0 n=1
x5 = 0 123 n=3
139 n=36
Fig. 2 Maximum tree—simulated dataset
123
650
M. Sciandra et al.
Table 2 Multivariate responses (and Y labels), sizes and ‘‘deviance’’ reduction corresponding to tree leaves—simulated data y1
y2
y3
y4
y5
y6
y7
1
1
1
1
1
1
1
1
20
99.4
*
4
4
4
4
4
4
4
34
31
98.8
*
1
2
1
2
3
3
1
47
8
98.5
1
2
1
2
2
2
4
54
7
98.5
1
2
1
2
1
2
1
58
7
99.0
*
1
2
1
2
1
2
1
58
51
99.5
*
1
2
1
2
1
3
1
59
8
98.9
1
2
4
2
1
2
1
60
11
98.1
1
2
1
2
1
4
3
67
4
99.3
4
2
3
2
1
2
1
73
7
98.8
4
4
1
4
1
4
4
78
32
99.7
*
4
4
1
4
1
4
4
78
7
99.0
*
4
4
1
4
1
4
4
78
7
99.4
*
4
4
1
4
1
2
4
96
7
98.9
4
1
4
4
1
4
4
99
2
99.7
4
1
2
1
1
3
1
105
11
98.9
4
1
1
1
1
2
3
123
3
99.9
4
1
1
1
1
4
3
138
8
98.6
4
1
1
1
1
3
1
139
36
100.0
*
4
3
4
4
2
4
4
155
80
95.2
*
4
3
4
4
2
4
4
155
19
96.7
*
4
2
4
4
1
4
4
161
15
99.6
*
4
1
2
4
2
4
4
181
10
98.4
2
3
4
4
2
4
2
183
7
98.6
2
2
1
3
1
4
1
192
7
98.9
1
2
1
1
1
4
1
206
45
98.4
*
4
2
3
4
2
4
4
231
32
99.2
*
4
2
4
4
2
4
4
233
7
98.8
4
2
4
1
2
4
4
234
7
98.8
4
2
3
4
4
4
4
238
3
99.9
y (label)
Size
Dev Reduc %
Centroid
*
5.2 An illustrative example: the Student Evaluation Teaching survey The results of the 2004 Student Evaluation Teaching (SET) survey, carried out at the University of Palermo, are considered as illustrative example. SET plays an important role in the context of university rankings, because it offers evaluators valuable insights into identifying areas of strengths and weaknesses within courses. The Evaluation process gives students opportunities to express an opinion about their experiences and impressions, rating their satisfaction with different aspects of their training, grouped under three main themes: teaching, service and logistics assessment and generic skills and learning experiences. Here we try to identify aspects of the educational
123
Classification trees for multivariate ordinal response… Table 3 Subset of questionnaire items used for the analysis; item responses are on a four point likert scale: 1 (yes), 2 (more yes than no), 3 (more no than yes), 4 (no)
651
(C)
Interest and general satisfaction
C2
Total teaching satisfaction
(F)
Teacher professional skills
F2
Teacher—respect for teaching timetable
F3
Teacher punctuality
F4
Teacher—traceableness for explanations
F5
Teacher willingness for clarifications
F6
Teacher-interest motivation
F7
Teacher—lecture clarity
1
2
3
4
h
h
h
h
4
1
2
3
h
h
h
h
1
2
3
4
h
h
h
h
1
2
3
4
h
h
h
h
1
2
3
4
h
h
h
h
1
2
3
4
h
h
h
h
1
2
3
4
h
h
h
h
0.80
AUC vs Tree Size ●
●
Kemeny
●
Kendall
●
Spearman
● ● ●
●
0.75
● ● ●
0.70
● ● ●
●
0.65
AUC
● ●
●
●
●
● ●
0.50
0.55
0.60
●
0
5
10
15
20
25
30
size
Fig. 3 AUC plot—SET dataset
123
652
M. Sciandra et al. A7 >= 4.5
yes
no
282 n=18
A6 >= 135
138 A5 = Ongong
3 n=3
138 n=17
A6 >= 105
A5 = Ongong
A7 >= 1.5
15 617 n=6
A5 = Ongong
A5 = Ongong
A7 >= 2.5
30 3 n=7
45 n=17
419 n=12
A6 >= 45
A2 = F
A2 = F
15 n=19
A7 >= 3.5
45 n=13
A3 = Other
313 n=11
434 n=14
A4 = Indoor
A4 = Indoor
3 n=10
A7 >= 3.5
15 n=9
54 n=23
A7 >= 3.5
41 n=9
85 n=13
A7 >= 3.5
18 n=18
30 n=37
89 n=26
21 n=15
A2 = F
44 n=28
A2 = F
A3 = Other
A4 = Indoor
A3 = Other
A3 = Other
30 n=25
15 n=9
21 n=10
A6 >= 45
A3 = Other
21 n=9
30 n=27
A7 >= 2.5
A4 = Indoor
71 n=12
369 n=17
A6 >= 45
A6 >= 75
27 n=19
32 n=9
27 n=17
32 n=21
Fig. 4 Maximum tree—SET dataset
experience that are associated with the students’overall expression of satisfaction and, at the same time, we determine student profiles that could be used to identify which features of the student experience are most closely related to their expressed satisfaction. The data analyzed in this paper (2004 SET survey) considers only the 1765 questionnaires of Economics and Statistics programmes of the University of Palermo. Table 3 shows the questionnaire section containing the items used for the analysis (y), while the following student characteristics have been included as covariates: (A1) age, (A2) gender (M, F), (A3) school type (Lyce´e, other), (A4) indoor or outdoor status, (A5) ongoing status (ongoing, off-course), (A6) number of credits, (A7) number of attended courses. From the whole dataset, we sampled 500 respondents, to be considered as training set, and 100 respondents, to be considered as validation set. This procedure could be repeated more than once in order to obtain stable results by averaging over the performance. For each distance measure, the recursive partitioning process generated a nested sequence of subtrees: AUC has been computed for each tree in the sequence, using the test sample. The most used covariates (the most discriminant ones) were found to be (A6) number of credits (46 %), (A7) number of attended courses (39 %), and (A5) Ongoing status (11 %). Figure 3 shows the AUC values computed for the test sample against tree size corresponding to the tree sequences relative to the considered distances. As it appears, the Spearman Footrule distance outperforms the other two distances, while Kemeny’s one performs worst [for details about the implementation of the procedure in R see (authors’ submitted paper)]. Figure 4 shows the maximum tree obtained using the Spearman Footrule Distance as impurity measure, which produces 32 leaves (vector label and the number of judges are reported for each leaf). The vectors corresponding to each leaf are reported in Table 4.
123
Classification trees for multivariate ordinal response… Table 4 Multivariate responses (and Y labels), sizes and ‘‘deviance’’ reduction corresponding to tree leaves—SET dataset
653
Y (label)
Size
Dev Reduc %
C2
F2
F3
F4
F5
F6
F7
1
1
1
1
1
1
1
3
20
96.9
2
2
2
2
2
2
2
15
37
92.5
2
2
2
1
1
2
2
18
18
96.3
2
1
1
1
1
1
1
21
34
94.7
1
1
1
1
1
2
1
27
36
94.4
2
1
1
1
1
2
2
30
89
83.9
2
1
1
1
1
2
1
32
30
93.8
2
1
1
2
1
1
1
41
9
98.7
3
1
1
1
1
2
2
44
28
93.5
2
1
1
2
1
2
2
45
30
94.4
2
2
2
2
1
2
2
54
23
95.9
2
1
1
2
1
1
2
71
12
98.2
2
3
1
2
1
2
2
85
13
97.6
2
1
1
2
1
2
3
89
26
94.6
2
1
1
1
1
1
2
138
17
97.1
2
1
2
2
2
2
2
282
18
96.8
2
1
1
1
1
2
3
313
11
97.9
3
2
2
2
2
2
3
369
17
95.6
3
2
1
2
2
2
2
419
12
97.9
2
2
2
2
2
2
3
434
14
2
2
2
2
2
2
1
617
6
97 99.1
Table 5 Students profiles for Y = 3 and Y = 27 A7 4 & A6 [ 120 & A5 = ONGOING OR 3 A7 4&90 A6 120 & A5 = OFF-COURSE
Y=3
OR 3 A7 4 & A6 = 60 & A5 = ONGOING & A4 = INDOOR 3 A7 4 & A6 30 & A5 = ONGOING & A4 = OUTDOOR & A3 = L & A2 = M OR
Y = 27
A7 = 2 & A6 30 & A5 = ONGOING & A3 =L
Table 6 Students profiles for Y = 44, Y = 369 and Y = 419 3 A7 4 & A6 30 & A5 = ONGOING & A4 = OUTDOOR & A3 = L & A2 = F
Y = 44
A7 = 1 & A6 90 & A5 = ONGOING
Y = 369
2 A7 4 & A6 = 90&A4 = OUTDOOR & A3 = L
Y = 419
123
654
M. Sciandra et al.
Examining the tree plot, or similarly, looking at Table 4, it is clear that the opinion of students, i.e. the degree of students’ satisfaction, can be summarized through 21 y values only (the training set of 500 students presents 268 different y values). The leaves are not pure, but last column in Table 4 reports the reduction in ‘‘deviance’’, computed as described in sec 5.1 As it is possible to see, the most impure leaf allows for a reduction of 83:9 %. The tree plot allows identifying the student profiles corresponding to each of y values. For example, focusing on the two y values that present maximum satisfaction according to item C2 (overall satisfaction), say Y ¼ 3 and Y ¼ 27 in Table 4, the profiles of the students are reported in Table 5. It is important to note that, highly satisfied profiles are generally associated to strong positive opinions of items F5 and F7, concerning to properties and capacities of the teacher. Similarly, focusing on the three y values that present minimum (observed) satisfaction according to item C2 (overall satisfaction), say Y ¼ 44, Y ¼ 369 and Y ¼ 419 in Table 4, the profiles of the students are reported in Table 6. Moreover, low levels of satisfaction appear to be highly related to bad opinions about teacher features (Y ¼ 369). A pruning procedure can be performed considering AUC vs tree size (Fig. 3) or, alternatively, looking at the distance matrix (according to the best performing distance measure) of the only responses corresponding to a tree leaf. Examining this distance matrix (not reported for lack of space) we ca observe, for example,that referring to the maximum tree plotted in Fig. 4, each couple of colored leaves could be pruned to a single leaf: leaves labeled y = 3 and y = 138 (red leaves) both present a minimum distance to the profile labeled y = 21 and, therefore, could be joined to a single leaf labeled 21. This is not simply cutting the split on A5 above the two leaves, because in that case the predicted value (class labels, in red) would be y = 138. Similarly, yellow leaves could be joined to a leaf labeled 434 and green leaves could be joined to a leaf labeled 32. Even in these cases labels 434 and 32 are not the the predicted values (class labels, colored text) of the nodes ‘‘father’’.
6 Conclusions and future work In this paper we have proposed a procedure for obtaining a decision tree when the data present a multivariate ordinal response. A different way to define the impurity function has been proposed, in accordance with the distance measure used in impurity evaluation. In particular, three distance measures have been considered to be used with multivariate ordinal data. Finally, AUC has been introduced as a tool able to assess the predictive performance of the final tree and select the best tree according to the desired tree size and the best impurity measure. Both a simulated dataset and a SET dataset have been used to test the performance of the procedure. In this latter case, we have shown how the final tree representation facilitates the understanding and explanation of student satisfaction, especially when no standard conditions are involved. We have shown how the application of the proposed methodology can be considered a useful tool for deriving profiles of students having similar levels of satisfaction. The possibility to obtain student’s profiles has to be considered an important tool for policy makers that work on the identification of aspects of the teaching process requiring for intervention strategy. Yet, the proposed methodology is going to be improved with respect to several aspects. A pruning procedure based on the distances between the median vectors in the leaves is touched upon in Sect. 5, but the idea deserves in-depth analysis. Neverthless, as it has been
123
Classification trees for multivariate ordinal response…
655
shown in Sect. 5.2, this is not simply cutting the split above two leaves, because the new predicted value (class labels) could be (and is, in our case) different from the one in the node above the two leaves. Moreover, different weights for the variables constituting the response vector y can be considered using weighted version of the distances (authors’ submitted paper). Actually, the distances introduced in Sect. 3 fail to take into account two important aspects: element and positional information (Kumar and Vassilvitskii 2010). In some practical applications, one or several of the k elements could be more important than others, or, similarly, the top of the ordered items in a ranking could deserve more attention than the bottom. In these situations, changing the item position of a very important element or changing the top of the item analyzed need to be ‘‘weighted’’ in a different way. Finally, the performance of the different class assignment rules (besides the multivariate median considered here) deserves further study.
References Alvo, M., Yu, P.: Statistical Methods for Ranking Data. Springer, New York (2014) Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984) Cheng, W., Hu¨hn, J., Hu¨llermeier, E.: Decision tree and instance-based learning for label ranking. In: Bottou L., and Littman M. (eds.) Proceedings of the 26th International Conference on Machine Learning, Montreal, pp. 161–168. Omnipress, Madison (2009) De’ath, G.: Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83(4), 1105–1117 (2002) Edmond, E., Mason, D.: A new rank correlation coefficient with application to the concensus ranking problem. J. Multi-Criteria Decis Anal 11, 17–28 (2002) Espejo, M.R.: Statistical methods for ranking data. Int. Stat. Rev. 83(1), 172–173 (2015) Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006) Genest, C., Nelehov, J., Ghorbal, N.B.: Spearman’s footrule and gini’s gamma: a review with complements. J. Nonparametr. Stat. 22(8), 937–954 (2010) Hand, D.J., Till, R.J.: A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001) Kemeny, J.G., Snell, J.L.: Preference Rankings an Axiomatic Approach. MIT Press, Cambridge (1962) Kumar, R., Indrayan, A.: Receiver operating characteristic (ROC) curve for medical researchers. Indian Pediatr. 48(4), 277–287 (2011) Kumar, R., Vassilvitskii, S.: Generalized distances between rankings. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pp. 571–580. ACM, New York (2010) Piccarreta, R.: Classification trees for ordinal variables. Comput. Stat. 23(3), 407–427 (2008) Ripley, B.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996) Rokach, L.: Data Mining with Decision Trees: Theory and Applications. World Scientific Publishing Company Inc., New York (2014) Shih, Y.: Selecting the best splits for classification trees with categorical variables. Stat. Prob. Lett. 54(4), 341–345 (2001) Siciliano, R., Mola, F.: Multivariate data analysis and modeling through classification and regression trees. Comput. Stat. Data Anal. 32(3–4), 285–301 (2000) Tarsitano, A.: Comparing the effectiveness of rank correlation statistics. Working Papers, Universit della Calabria, Dipartimento di Economia e Statistica, 200906 (2009) Therneau, T., Clinic, M.: User written splitting functions for RPART (2015) Yu, P., Wan, W., Lee, P.: Preference learning. In: Decision Tree Modeling for Ranking Data, pp. 83–106. Springer, Berlin Zhang, H.: Classification trees for multiple binary responses. J. Am. Stat. Assoc. 93(441), 180–193 (1998) Zhang, H., Ye, Y.: A tree-based method for modeling a multivariate ordinal response. Stat. Interface 1(1), 169–178 (2008)
123