Knowl Inf Syst https://doi.org/10.1007/s10115-018-1229-3 REGULAR PAPER
Two improved attribute weighting schemes for value difference metric Liangxiao Jiang1 · Chaoqun Li2
Received: 11 November 2016 / Revised: 11 May 2018 / Accepted: 26 May 2018 © Springer-Verlag London Ltd., part of Springer Nature 2018
Abstract Due to its simplicity, efficiency and efficacy, value difference metric (VDM) has continued to perform well against more sophisticated newcomers and thus has remained of great interest to the distance metric learning community. Of numerous approaches to improving VDM by weakening its attribute independence assumption, attribute weighting has received less attention (only two attribute weighting schemes) but demonstrated remarkable class probability estimation performance. Among two existing attribute weighting schemes, one is non-symmetric and the other is symmetric. In this paper, we propose two simple improvements for setting attribute weights for use with VDM. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM). We performed extensive evaluations on a large number of datasets and found that KLD-VDM and GR-VDM significantly outperform two existing attribute weighting schemes in terms of the negative conditional log likelihood and root relative squared error, yet at the same time maintain the computational simplicity and robustness that characterize VDM. Keywords Distance metric learning · Value difference metric · Attribute weighting · Gain ratio · Kullback–Leibler divergence
1 Introduction Distance metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data [45]. Many machine learning algorithms, such as the k-nearestneighbor algorithm (KNN) and its variants [7,8,17,42], heavily rely on learning good
B
Liangxiao Jiang
[email protected]
1
Department of Computer Science, China University of Geosciences, Wuhan 430074, China
2
Department of Mathematics, China University of Geosciences, Wuhan 430074, China
123
L. Jiang, C. Li
distance metrics from training data. For example, the large margin nearest-neighbor algorithm (LMNN) [42] attempts to learn a Mahalanobis distance metric for KNN classification from labeled instances. For another example, the adaptive metric nearest-neighbor algorithm (ADAMENN) [8] tries to learn a flexible Chi-squared distance metric for producing neighborhoods that are highly adaptive to query locations. In a word, learning distance metrics from data has been actively studied in the machine learning community and been successfully applied to many real-world application domains including information retrieval, image processing, face recognition, cognitive psychology and bioinformatics [2,18,20,39]. Although there are many distance metrics that have been proposed to decide which instance is closest to a given test instance, many of these metrics work well for numerical attributes rather than nominal attributes [5,43]. To appropriately handle nominal attributes, value difference metric (VDM) [41] was proposed. It is noteworthy that the original definition of VDM actually has an attribute weighting scheme. However, this attribute weighting scheme makes VDM non-symmetric [4]. In order to make VDM symmetric, a simplified version of VDM (without the attribute weighting scheme) is widely used and numerous enhancements start from it [4,19,27,28,30,43]. In the simplified definition of VDM, the distance between two instances is the straightforward superposition of the total differences between each pair of attribute values, and thus it requires a strong assumption, that is, all attributes are totally independent from each other. This assumption was explained by Kasif et al. [21] from the viewpoint of the probability memory-based reasoning (MBR) transform, which transforms a nominal attribute space into a probability distribution. Thanks to the attribute independence assumption, VDM is very easy to learn. However, recent research in distance metric learning has shown that even with such a strong assumption, VDM is surprisingly competitive with state-of-the-art metrics such as short and Fukunaga metric (SFM) [29,32,40] and minimum risk metric (MRM) [3]. This fact raises the question of whether VDM with a less restrictive assumption can perform even better. To answer this question, learning improved VDM by weakening its attribute independence assumption has attracted much attention from researchers. Related work can be broadly divided into five main categories: (1) structure extension [19,28]; (2) instance selection, also called local learning [25]; (3) instance weighting [4,26]; (4) attribute selection [30]; (5) attribute weighting [27]. Of numerous approaches to improving VDM by weakening its attribute independence assumption, attribute weighting has received less attention from researchers. To the best of our knowledge, there exist only two attribute weighting schemes in the existing literature. One is the non-symmetric attribute weighting scheme introduced by the original VDM [41], and the other is the symmetric mutual information-based attribute weighting scheme [27]. However, they all have demonstrated remarkable class probability estimation performance. To further scale up their performance, we propose two new improved attribute weighting schemes for setting attribute weights for use with VDM in this paper. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM). Classification is a fundamental issue in data mining and machine learning. In classification, the goal of learning (constructing) a classifier is to correctly classify an input test instance. So the performance of a classifier is of course measured by its classification accuracy (the percentage of test instances correctly classified). However, in many real-world applications, classification alone is not always adequate [34]. For example, in target marketing, we may hope to evaluate various offer propositions by the class probability that a customer will respond to an offer [38]. In cost-sensitive decision-making, the class probability estimation is used to minimize the conditional risk [10]. Thus, accurate class probability estimation is
123
Two improved attribute weighting schemes for value. . .
often required instead of classification in many real-world applications. For the task of class probability estimation, the negative conditional log likelihood (−CLL) [13,14] instead of the classification accuracy, is a more reasonable performance measure for the learned (constructed) classifiers. Besides, the root relative squared error (RRSE) or the root-mean-square error (RMSE) of the class probability estimates is widely used as a performance measure to evaluate the probability-based classifiers [15,27,46]. In this paper, we also use −CLL and RRSE to evaluate the distance-related classifiers and therefore to evaluate the defined distance metrics. The experimental results on 56 widely used problems from the University of California at Irvine (UCI) machine learning repository [11] validate the effectiveness of the proposed KLD-VDM and GR-VDM. The rest of the paper is organized as follows. Firstly, we revisit the attribute independence assumption in VDM and two existing attribute weighting schemes for VDM. Then, we propose Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and gain ratio weighted value difference metric (GR-VDM). Followed by the detailed description of our experiments and results, we conclude the paper and outline the main directions for future research.
2 Related work When all attributes are nominal only, VDM [41] is a well-known distance metric. Because the original definition of VDM has an attribute weighting form, which makes VDM nonsymmetric [4]. In order to make VDM symmetric, a simplified version of VDM (without the attribute weighting scheme) [43] is widely used and numerous enhancements start from it. Given a training data set, assume that A1 , A2 ,. . ., Am are m attribute variables and C is the class variable, a training instance x can be represented by an attribute value vector ax1 , ax2 , . . . , axm , where axi is the ith attribute value of x. The VDM without the attribute weighting scheme [43] defines the distance from a training instance x to a given test instance y as: m d(x, y) = |P(c|axi ) − P(c|a yi )|, (1) i=1
c
where m is the number of attributes, c is each possible class label of C, P(c|axi ) and P(c|a yi ) are the conditional probabilities that the output class is c given that the attribute Ai has the values axi and a yi , respectively, which can be easily estimated using the Laplace estimate as follows: 1 + nj=1 δ(a ji , axi )δ(c j , c) , (2) P(c|axi ) = t + nj=1 δ(a ji , axi ) 1 + nj=1 δ(a ji , a yi )δ(c j , c) , (3) P(c|a yi ) = t + nj=1 δ(a ji , a yi ) where t is the number of classes, n is the number of training instances, a ji is the ith attribute value of the jth training instance, and the indicator functionδ(α, β) is one ifα = β and zero otherwise. After VDM was proposed, Wilson and Martinez [43] proposed three new heterogeneous distance metrics: heterogeneous value difference metric (HVDM), the interpolated value difference metric (IVDM) and the windowed value difference metric (WVDM) to extend VDM to handle both nominal and numeric attributes. Their experimental results show that
123
L. Jiang, C. Li
their new distance metrics achieve higher classification accuracy on average than Euclidean distance metric, Heterogeneous Euclidean-overlap metric (HEOM), and discretized value difference metric (DVDM). Short and Fukunaga metric [40], simply SFM, is another distance metric for nominal attributes only, which was originally presented by Short and Fukunaga [40] and then extended by Myles and Hand [32]. SFM addresses the issue of selecting the best distance measure by minimizing the difference between the finite sample nearest-neighbor classification error and the asymptotic nearest-neighbor error based on probabilities. Therefore, SFM with multiple classes is defined as: d(x, y) = |P(c|x) − P(c|y)|. (4) c
Minimum risk metric [3], simply MRM, is a simple distance function that directly minimizes the risk of misclassification. Assume that an instance x has the instance y as its nearest neighbor. Given a class label c, the finite risk of misclassifying x is P(c|x)(1 − P(c|y)). In the multiple-class case, the total finite risk is the sum of the risks of all the different classes. Therefore, MRM is defined as: d(x, y) = P(c|x)(1 − P(c|y)). (5) c
In Eqs. 4 and 5, a major challenge is how to estimate the class membership probabilities P(c|x) and P(c|y). A practical way is to use naive Bayes (NB) [9].
2.1 The attribute independence assumption in VDM According to the memory-based reasoning (MBR) transform [21], when the number of classes is t, an m-dimensional input space can be transformed into a t × m-dimensional space. Therefore, an m-dimensional input vector ax1 , ax2 , . . . , axm can be transformed into a t × m-dimensional vector: P(c1 |ax1 ), . . . , P(ct |ax1 ), . . . , P(c1 |axm ), . . . , P(ct |axm ). Then, according to the Manhattan distance, the distance between x and y in the MBR transform space is just the simplified version of VDM defined by Eq. 1. From the MBR transform perspective, in VDM, the difference between two values of an attribute is considered to be closer if they have more similar correlations with the output classes ck , (k = 1, 2, . . . , t). Then VDM defines the distance between two instances x and y as the straightforward superposition of the value differences across all attributes. Kasif et al. [21] call the transformation of nominal attributes into the probability distribution as the probability MBR transform. According to their observations, the basis of the transformation is just the attribute independence assumption of the joint probability distribution, i.e. m P(ax1 , ax2 , . . . , axm , c) = P(c) P(axi |c). (6) i=1
Thus, to our knowledge, VDM makes a strong assumption that all attributes are fully independent, simply called the attribute independence assumption, and seeks the nearest neighbors in this transformed space.
2.2 The existing attribute weighting schemes Recent work in distance learning has shown that the simplified VDM is competitive with state-of-the-art distance metrics such as SFM and MRM. However, it is obvious that the
123
Two improved attribute weighting schemes for value. . .
attribute independence assumption in VDM is rarely true in reality, which would harm its performance in applications with complex attribute dependencies. To weaken the attribute independence assumption in VDM, a large number of improved approaches are proposed. In this paper, we focus on the attribute weighting approach. Different from the attribute selection approach which completely eliminates irrelevant or redundant attributes from the training data, the attribute weighting approach aims to weight each attribute according to its importance, which defines the distance from a training instance x to a given test instance y as: m d(x, y) = |P(c|axi ) − P(c|a yi )| , (7) wi i=1
c
where wi is the weight of attribute Ai , which indicates the importance of attribute Ai . Now, the only question left to answer is how to define wi (i = 1, 2, . . . , m), which is crucial in learning attribute weighted value difference metric but surprisingly has received less attention than it warrants. To the best of our knowledge, there exist only two attribute weighting schemes in the existing literature. One is the attribute weighting scheme introduced by the original VDM [41], and the other is the mutual information-based attribute weighting scheme [27]. As already pointed out above, the original VDM [41] has actually an attribute weighting scheme. Since we focus on attribute weighted value difference metric in this paper, we would like to revisit the attribute weighting scheme in the original VDM at first. Stanfill and Waltz [41] argued that attributes differ in importance because they differ in the strength with which they constrain the class values and different values of a single attribute can also differ in how strongly they constrain the class values. Therefore, they define the attribute’s weight wi by measuring the strength of the constraint imposed by the ith attribute value a yi of the given test instance y on the class. The detailed method is to restrict the values of the ith attribute of all training instances to a yi , find the frequencies of the various class values, square them, sum them, and at last take the square root of the result. The resulting weight wi is shown below: wi = P(c|a yi )2 . (8) c
It can be seen that the above attribute weighting scheme makes the original VDM nonsymmetric because the distance from x to y is in general not equal to the distance from y tox, i.e. d(x, y) = d(y, x). It is noteworthy that, when we define d(y, x), the weight wi 2 which is in general not equal to 2 is P(c|a ) xi c c P(c|a yi ) . Besides, this scheme uses different weights for different attribute values and therefore is strictly performing attribute value weighting rather than attribute weighting. To make the original VDM symmetric and strictly perform attribute weighting, Li et al. [27] proposed a mutual information-based attribute weighting scheme and called the resulted metric mutual information weighted value difference metric (MI-VDM). They argued that the importance of an attribute can be roughly measured by the correlation between it and the class variable, and therefore they use I P (Ai ; C), the mutual information between attribute Ai and the class variable C, to define the weight wi . The detailed formula is: wi = I P (Ai ; C) =
ai ,c
P(ai , c) log
P(ai , c) , P(ai )P(c)
(9)
123
L. Jiang, C. Li
where ai is each possible attribute value of Ai and c is each possible class label of C. Roughly speaking, I P (Ai ; C) measures how much information Ai provides about C and therefore is a symmetric attribute weighting scheme.
3 The proposed attribute weighting schemes In attribute weighting schemes, the difference between each pair of attribute values (axi and a yi ) is weighted by the importance wi of the attribute Ai . This kind of attribute weighting corresponds to stretching the axes in the attribute space, and the amount by which each axis is stretched is determined by the importance of each attribute. This process of stretching the axes in order to optimize the performance of distance metrics provides a mechanism for inspiring more relevant attributes and suppressing less relevant attributes. Appropriate attribute weights can reduce the error that results from violations of the attribute independence assumption in VDM. Obviously, if a set of training data include a set of attributes that are identical to one another, the error due to the violation of the attribute independence assumption can be removed by assigning weights that sum to 1.0 to the set of attributes. For example, the weight for one of the attributes, Ai could be set to 1.0, and that of the remaining attributes that are identical to Ai set to 0.0. This is equivalent to removing the remaining attributes from the training data. Attribute weighting is strictly more powerful than attribute selection, as it is possible to obtain identical results to attribute selection by setting the weights of selected attributes to 1.0 and of unselected attributes to 0.0, and assignment of other weights can result in distance metrics that cannot be expressed using attribute selection. To further scale up the performance, we propose two new improved attribute weighting schemes for setting attribute weights for use with the simplified VDM in this section. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLDVDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM).
3.1 KLD-VDM Our first proposed attribute weighting scheme is inspired by the Kullback–Leibler divergence [23,24], which is a measure of the difference between two probability distributions. Let P and Q be two distributions, then the Kullback–Leibler divergence from discrete probability P(i) Q to P is i P(i) log Q(i) . Specifically, the Kullback–Leibler divergence from Q to P is a measure of the information gained when one revises one’s beliefs from the prior probability distribution Q to the posterior probability distribution P. The Kullback–Leibler divergence appears in the information theoretic literature under various guises. For instance, it can be viewed as a special case of the cross-entropy or the discrimination, a measure which defines the information theoretic similarity between two probability distributions. Here we assume that these two probability distributions are the prior probability distribution of C and the posterior probability distribution of C after observing Ai = a yi , respectively. Then, the Kullback–Leibler divergence, denoted as KLD(C|a yi ), from the prior probability distribution of C to the posterior probability distribution of C after observing Ai = a yi is: KLD(C|a yi ) =
c
P(c|a yi ) log
P(c|a yi ) . P(c)
(10)
Now, our first proposed attribute weighting scheme is the Kullback–Leibler divergencebased attribute weighting scheme, which defines the weight wi of attribute Ai as the Kullback– Leibler divergence for the attribute value a yi . The detailed formula is:
123
Two improved attribute weighting schemes for value. . .
wi = KLD(C|a yi ) =
P(c|a yi ) log
c
P(c|a yi ) . P(c)
(11)
Similar to the original VDM, our Kullback–Leibler divergence weighted VDM also performs non-symmetric attribute value weighting. At first, when we define d(y, x), the weight wi is KLD(C|axi ) which is in general not equal to KLD(C|a yi ). Secondly, this scheme also uses different weights for different attribute values and therefore is an attribute value weighting scheme.
3.2 GR-VDM Our research on the second proposed attribute weighting scheme starts from our comments to the existing mutual information-based attribute weighting scheme [27]. First, the mutual information between attribute Ai and the class variable C is actually the information gain about C after observing Ai . Proof The information gain IG(Ai ) about C after observing Ai is simply the expected reduction in entropy of C before and after observing Ai : IG(Ai ) = H (C) − H (C|Ai ) P(ai ) P(c|ai ) log P(c|ai ) − P(c) log P(c) = ai
= =
c
c
ai ,c
P(ai , c) P(ai , c) log P(ai , c) log P(c) − P(ai ) a ,c
ai ,c
P(ai , c) P(ai , c) log P(ai )P(c)
i
= I P (Ai ; C).
(12)
Unfortunately, there is a natural bias in the mutual information-based attribute weighting scheme that favors attributes with more values, that is, the attributes with greater numbers of values will appear to gain more information than those with fewer values even if they are actually no more informative [16,31,35,36]. As an extreme example, when an attribute happens to have almost a different value for each training instance, it has a very high information gain relative to the training instances, despite being a very poor predictor over unknown test instances. To overcome the bias of the information gain measure, in decision tree learning [31,35,36], the gain ratio measure is widely used for improvement. Based on this premise, in this paper, we propose a gain ratio-based attribute weighting scheme GR(Ai ) that penalizes attributes with many values by incorporating a term called split information (the entropy of Ai ) into the following formula: IG(Ai ) I P (Ai ; C) wi = GR(Ai ) = = = H (Ai ) H (Ai )
P(ai ,c) P(ai , c) log P(a i )P(c) . − ai P(ai ) log P(ai ) ai ,c
(13)
The split information of attribute Ai measures how broadly and uniformly it splits the given training data. However, when an attribute happens to have the same value for nearly all training instances, its split information will be zero or very small. This either makes GR(Ai ) undefined or very large. To settle down this practical issue, we can adopt some heuristic such
123
L. Jiang, C. Li
as at first computing IG(Ai ), then applying GR(Ai ) only when H (Ai ) is above a small enough numberε. In our current implementation and experiments, we simply set ε to 0.000001.
4 Experiments and results The purpose of this section is to validate the effectiveness of our proposed KLD-VDM and GR-VDM, and therefore, we compare them with two existing attribute weighting schemes: the original attribute weighting scheme in the original VDM (VDM for short) [41] and the mutual information-based attribute weighting scheme (MI-VDM for short) [27], respectively, in terms of the negative conditional log likelihood (−CLL) and root relative squared error (RRSE) of the probability estimates produced by the distance-weighted k-nearest-neighbor algorithm (DWKNN) [31]. To the best of our knowledge, DWKNN should be the first test bed to demonstrate the effectiveness of the used distance metric, in which the distance metric is used twice. Specifically, DWKNN uses a distance metric to find the k nearest neighbors of the test instance at first, and then each of the k nearest neighbors is weighted in terms of its distance to the test instance. According to the reference [31], once distance weighting is used, DWKNN is relatively insensitive to the choice of k. Thus, in our implementation, we choose a relative larger k value, namely we set the value of k in DWKNN to 10. In fact, we test a few other k values such as 9, 15, 20 and 30 in our experiments, and obtain almost the same conclusions. To save space, we have not presented the detailed experimental results in this paper. We employ Waikato Environment for Knowledge Analysis (WEKA) [44] to implement DWKNN using four distance metrics: VDM, MI-VDM, KLD-VDM and GR-VDM, respectively. Note that, in our implementations, we smooth all of the estimated probabilities using Laplace estimate. Our experiments are conducted on a collection of 56 benchmark datasets from the University of California at Irvine (UCI) machine learning repository [11], which represent a wide range of domains and data characteristics. The properties of these datasets are shown in Table 1. We replace all missing values with the means or modes from the available data and then use the unsupervised Discretize with ten bins in WEKA [44] to discretize all numeric attributes into equal-width ten bins. Besides, to save the time of running experiments, we use the unsupervised Resample with a size of 20% in WEKA to resample the datasets with more than 5000 instances. Tables 2 and 3 show the detailed comparison results. The detailed negative conditional log likelihood (−CLL) and root relative squared error (RRSE) scores are obtained by averaging the results from 10 separate runs of stratified 10-fold cross-validation. The • and ◦ in Tables 2 and 3 denote significant improvement or degradation over VDM based on two-tailed t-tests at a confidence level of p = 0.05 [33], respectively. It is worth noting that, for −CLL and RRSE, a small number is better than a large number, and thus • denotes significantly better than VDM and ◦ denotes worse. The Win/Tie/Lose (W /T /L) values and averages are summarized at the bottom of the tables. Each entry’s W /T /L in the tables implies that, compared to VDM, MI-VDM, KLD-VDM and GR-VDM win on W datasets, tie on T datasets, and lose on L datasets. The average across all datasets provides a gross indication of relative performance in addition to other statistics. From these comparison results, we can see that: 1. In terms of −CLL, KLD-VDM is notably better than VDM on 28 datasets and notably worse on 4 datasets. Further, the average −CLL value (71.25) of KLD-VDM is notably lower than that of VDM (78.02).
123
Two improved attribute weighting schemes for value. . . Table 1 Datasets used in the experiments Dataset
#Instances
#Attributes
#Classes
Missing
Numerical
anneal
898
38
6
Y
Y
anneal.ORIG
898
38
6
Y
Y
audiology
226
69
24
y
n
autos
205
25
7
y
y
balance-scale
625
4
3
n
y
breast-cancer
286
9
2
y
n
breast-w
699
9
2
y
n
2126
21
3
n
y
368
22
2
y
y
cardiotocography-2 colic colic.ORIG
368
27
2
Y
Y
connectionist-vowel
528
10
11
n
y
climate
540
20
2
n
y
credit-a
690
15
2
y
y
credit-g
1000
20
2
n
y
cylinder-bands
540
40
2
y
y
diabetes
768
8
2
n
y
ecoli
336
7
8
n
y
energy-y1
768
8
37
n
y
energy-y2
768
8
38
n
y
glass
214
9
7
n
y
heart-c
303
13
5
y
y
heart-h
294
13
5
y
y
heart-statlog
270
13
2
n
y
hepatitis
155
19
2
y
y
hypothyroid
3772
29
4
y
y
ionosphere
351
34
2
n
y
iris kr-vs-kp
150
4
3
n
y
3196
36
2
n
n
labor
57
16
2
y
y
letter
20,000
16
26
n
y
lymph
148
18
4
n
y
mfeat-f
2000
76
10
n
y
monks
432
6
2
n
n
movement-libras
360
90
15
n
y
mushroom
8124
22
2
y
n
newthyroid
215
5
3
n
y
optdigits
5620
62
10
n
y
page-blocks
5473
10
5
n
y
parkinsons pendigits primary-tumor
195
22
2
y
y
10,992
16
10
n
y
339
17
21
y
n
123
L. Jiang, C. Li Table 1 continued Dataset
#Instances
#Attributes
#Classes
Missing
Numerical y
qar-biodegradation
1055
41
2
n
robot-24
5456
24
4
n
y
segment
2310
19
7
n
y
sick
3772
29
2
y
y
sonar
208
60
2
n
y
soybean
683
35
19
y
n
spect
267
22
2
n
n
splice
3190
61
3
n
n
steel-plates-faults
1941
33
2
n
y
vehicle
846
18
4
n
y
vote
435
16
2
y
n
vowel
990
13
11
n
y
waveform-5000
5000
40
3
n
y
yeast
1484
8
10
n
y
101
17
7
N
Y
zoo
2. In terms of −CLL, GR-VDM is significantly better than VDM on 34 datasets and significantly worse on none. Furthermore, the average −CLL value (67.39) of GR-VDM is significantly lower than that of VDM. 3. In terms of RRSE, KLD-VDM is markedly better than VDM on 27 datasets and markedly worse on 6 datasets. In addition, the average RRSE value (72.39) of KLD-VDM is markedly lower than that of VDM (75.68). 4. In terms of RRSE, GR-VDM is remarkably better than VDM on 33 datasets and remarkably worse on none. Additionally, the average RRSE value (70.70) of GR-VDM is remarkably lower than that of VDM. 5. For additional insight into the results, we observed their performance on the datasets “colic.ORIG”, “splice” and “zoo”. The −CLL and RRSE values of KLD-VDM and GRVDM are much lower than those of MI-VDM. To our observation, in these three datasets, there exists such an useless attribute that varies so much that almost each instance enjoys an attribute value: the attribute “Hospital Number” in the dataset “colic.ORIG”, the attribute “instance name” in the dataset “splice” and the attribute “animal” in the dataset “zoo”. Although this kind of attributes gain more mutual information, they are actually no more informative. These results prove that our proposed attribute weighting schemes can overcome the natural bias of the previous MI-VDM. 6. From these comparison results, we can also see that, our proposed attribute weighting schemes are overall better than two existing attribute weighting schemes not only in terms of −CLL but also in terms of RRSE. Note that, KLD-VDM is an non-symmetric distance metric while GR-VDM is a symmetric distance metric. Thus, when non-symmetric distance is needed, KLD-VDM should be an appropriate alternative distance metric. However, when symmetric distance is needed, GR-VDM should be considered firstly. Then, we take advantage of KEEL software [1] to conduct the Friedman test with the Bergmann post hoc test for statistical comparisons of multiple algorithms over multiple datasets [6,12]. The average rankings of the algorithms obtained by applying the Friedman
123
Two improved attribute weighting schemes for value. . . Table 2 −CLL comparisons for VDM versus MI-VDM, KLD-VDM and GR-VDM Dataset
VDM
MI-VDM
KLD-VDM
GR-VDM
anneal
44.17
37.98 •
40.00 •
36.73 •
anneal.ORIG
51.07
49.98 •
51.38
48.20 •
audiology
44.11
43.60
43.61
43.43 •
autos
28.20
27.16 •
24.45 •
23.97 •
balance-scale
30.36
30.39
30.92
30.38
breast-cancer
16.55
16.64
17.31
16.69
breast-w
12.26
11.05 •
9.89 •
10.30 •
cardiotocography-2
73.94
65.27 •
69.53 •
65.04 •
climate
16.23
14.27 •
16.76
14.28 •
colic
16.54
15.52
15.85
15.36 •
colic.ORIG
18.09
24.70 ◦
19.69
18.61
connectionist-vowel
64.79
68.69 ◦
65.88
64.67
credit-a
25.33
25.61
25.73
25.21
credit-g
52.26
51.93
53.21
52.15
cylinder-bands
27.74
32.32 ◦
27.07
27.78
diabetes
37.93
38.38
37.97
38.39 29.18 •
ecoli
30.65
32.37
30.35
energy-y1
155.74
155.38
154.67
155.08
energy-y2
180.05
180.10
178.89 •
179.55
glass
25.78
25.13 •
25.47
24.44 •
heart-c
24.59
19.35 •
20.79 •
19.04 •
heart-h
20.56
18.91 •
19.56
18.67 •
heart-statlog
11.49
11.32
11.79
11.31
hepatitis
6.12
5.98
6.51
6.14
hypothyroid
136.43
136.70
140.90 ◦
136.22
ionosphere
18.14
10.94 •
11.49 •
9.86 •
iris
4.11
4.19
3.83 •
3.91 •
kr-vs-kp
49.59
47.97 •
49.36
48.55
labor
1.99
1.85
1.77
1.86
letter
726.43
815.27 ◦
740.53 ◦
706.43 •
lymph
11.03
8.95 •
9.33 •
8.68 • 318.59 •
mfeat-f
436.52
422.50 •
374.93 •
monks
3.83
3.79
3.84
3.80
movement-libras
91.99
93.54 ◦
90.00 •
91.25 •
mushroom
16.82
15.26 •
14.87 •
15.13 •
newthyroid
6.09
5.98
5.22 •
5.64 •
optdigits
222.54
228.30 ◦
196.11 •
177.59 •
page-blocks
40.75
41.32
41.95 ◦
40.48
parkinsons
5.87
5.55
5.82
5.53
pendigits
255.69
314.49 ◦
239.15 •
200.21 •
123
L. Jiang, C. Li Table 2 continued Dataset
VDM
MI-VDM
KLD-VDM
GR-VDM
primary-tumor
79.11
78.67
79.35
78.67
qar-biodegradation
45.47
41.42 •
40.31 •
40.80 •
robot-24
79.40
57.36 •
56.80 •
50.21 •
segment
148.67
158.85 ◦
142.73 •
136.54 •
sick
43.02
42.05
42.85
42.36
sonar
13.55
9.74 •
10.17 •
9.69 •
soybean
97.34
101.56 ◦
90.35 •
96.17 •
spect
12.43
11.68 •
10.73 •
11.66 •
splice
258.19
105.96 •
86.55 •
87.15 •
steel-plates-faults
33.86
24.97 •
38.09 ◦
16.79 •
vehicle
72.61
65.71 •
64.93 •
63.35 •
vote
9.40
7.63 •
7.77 •
7.65 •
vowel
126.83
134.45 ◦
120.84 •
118.74 •
waveform-5000
87.57
55.85 •
51.52 •
48.80 •
yeast
212.08
214.00
214.27
210.09
zoo
6.98
8.04 ◦
6.50 •
6.87 34/22/0
W/T /L
−
23/21/12
28/24/4
Average
78.02
76.37
71.25
67.39
Average rank
3.1964
2.6875
2.5536
1.5625
test are also summarized at the bottom of Tables 2 and 3. With four algorithms and 56 datasets, FF is distributed according to the F distribution with 3 and 165 degrees of freedom. FF calculated from the average ranks are 21.430186 and 15.726207, respectively, which are all greater than the critical value of F(3, 165) for a = 0.05. So we reject the null-hypothesis and proceed with the Bergmann post hoc test to find out exactly the significant difference among these algorithms. Tables 4 and 5 report post hoc comparisons on −CLL and RRSE, respectively. From these comparison results, we can see that: – In terms of −CLL, the average rankings of VDM, MI-VDM, KLD-VDM and GR-VDM are 3.2143, 2.6696, 2.5714 and 1.5446, respectively. KLD-VDM is significantly better than VDM and GR-VDM is significantly better than VDM, MI-VDM and KLD-VDM. – In terms of RRSE, the average rankings of VDM, MI-VDM, KLD-VDM and GR-VDM are 3.1786, 2.6518, 2.4911 and 1.6786, respectively. KLD-VDM is significantly better than VDM and GR-VDM is significantly better than VDM, MI-VDM and KLD-VDM. – Not only in terms of -CLL but also in terms of RRSE, the original VDM is in general the worst and GR-VDM is the best. Besides, KLD-VDM is a little better than MI-VDM. – Not only in terms of -CLL but also in terms of RRSE, GR-VDM significantly outperforms MI-VDM, which indicates that GR-VDM can compensate for MI-VDM’s bias toward the attributes with more values. To further validate the effectiveness of our proposed distance metrics, we conducted another group of experiments to compare the proposed distance metrics with several other state-of-the-art metric learning methods such as SFM (short and Fukunaga metric) [40], MRM (minimum risk metric) [3], RELIEF [22,37], IWVDM (instance weighted VDM)
123
Two improved attribute weighting schemes for value. . . Table 3 RRSE comparisons for VDM versus MI-VDM, KLD-VDM and GR-VDM Dataset
VDM
MI-VDM
KLD-VDM
GR-VDM
anneal
67.71
60.87 •
63.29 •
58.88 •
anneal.ORIG
78.14
77.52
78.62
75.67 •
audiology
92.04
91.83
91.44
91.46 •
autos
91.02
89.32 •
85.15 •
84.57 •
balance-scale
64.59
64.76
65.21
64.75
breast-cancer
96.55
96.77
99.11
96.85
breast-w
41.00
38.10 •
35.29 •
36.47 •
cardiotocography-2
63.96
60.04 •
63.52
60.04 •
climate
102.04
95.65 •
107.21 ◦
95.74 •
colic
76.76
74.01
74.89
73.40 •
colic.ORIG
84.12
102.00 ◦
88.93 ◦
85.56
connectionist-vowel
78.01
80.05 ◦
78.24
78.07
credit-a
66.10
66.81
67.06
65.99
credit-g
91.09
90.47
92.06
90.71
cylinder-bands
82.17
89.46 ◦
81.28
82.30
diabetes
84.38
85.10
84.49
85.11
ecoli
74.17
76.35 ◦
73.82
72.28 •
energy-y1
89.16
89.11
89.00 •
89.05
energy-y2
92.00
91.99
91.83 •
91.91
glass
88.67
87.84
87.90
86.85 •
heart-c
88.53
78.24 •
81.25 •
77.57 •
heart-h
84.42
82.27 •
83.60
81.21 •
heart-statlog
72.38
72.02
73.89
71.86
hepatitis
85.66
84.55
88.81
86.35
hypothyroid
101.35
101.50
102.76 ◦
101.28
ionosphere
84.97
60.04 •
63.54 •
56.81 •
iris
38.30
38.91
36.49 •
36.97
kr-vs-kp
35.31
35.21
35.78
35.60
labor
64.67
60.82
55.89
61.60
letter
85.29
88.58 ◦
85.61 ◦
84.18 •
lymph
83.81
75.16 •
76.74 •
74.05 •
mfeat-f
98.54
97.58 •
93.65 •
87.84 •
monks
17.20
16.95
17.29
16.99
movement-libras
98.57
98.92 ◦
98.07 •
98.39 •
mushroom
20.40
19.27 •
18.56 •
19.03 •
newthyroid
51.03
50.38
44.37 •
48.48 •
optdigits
95.58
96.34 ◦
91.41 •
87.93 •
page-blocks
87.06
88.35
88.61 ◦
86.79
parkinsons
66.28
64.77
66.92
64.94
pendigits
75.64
83.69 ◦
72.89 •
65.96 •
primary-tumor
96.21
96.17
96.11
96.11
qar-biodegradation
76.45
73.37 •
72.05 •
72.71 •
123
L. Jiang, C. Li Table 3 continued Dataset
VDM
MI-VDM
KLD-VDM
GR-VDM
robot-24
73.74
60.05 •
60.40 •
55.75 •
segment
55.62
57.99 ◦
54.25 •
52.63 •
sick
61.35
60.49
59.46
60.64
sonar
95.96
77.63 •
79.51 •
77.72 •
soybean
80.23
81.63 ◦
77.35 •
79.43 •
spect
94.15
91.50
86.48 •
91.61 41.78 •
splice
86.64
46.72 •
41.42 •
steel-plates-faults
36.57
30.85 •
45.25 ◦
17.43 •
vehicle
77.35
73.96 •
73.89 •
72.87 •
vote
46.47
40.22 •
40.85 •
40.28 •
vowel
79.22
81.28 ◦
77.44 •
77.07 •
waveform-5000
87.54
67.12 •
65.35 •
63.10 •
yeast
90.42
90.96 ◦
90.89
90.19
zoo
61.23
66.80 ◦
58.51 •
60.49 •
W/T /L
–
19/24/13
27/23/6
33/23/0
Average
75.68
73.18
72.39
70.70
Average rank
3.1607
2.6696
2.4732
1.6964
Table 4 Post hoc comparisons for VDM, MI-VDM, KLD-VDM and GR-VDM: -CLL Algorithms
z = (R0 − Ri )/SE
p
6
VDM versus GR-VDM
6.697114
0
5
MI-VDM versus GR-VDM
4.611128
0.000004
4
KLD-VDM versus GR-VDM
4.062184
0.000049
3
VDM versus KLD-VDM
2.63493
0.008415
2
VDM versus MI-VDM
2.085986
0.03698
1
MI-VDM versus KLD-VDM
0.548944
0.583044
i
Bergmann’s procedure rejects these hypotheses: VDM versus MI-VDM; VDM versus KLD-VDM; VDM versus GR-VDM; MI-VDM versus GR-VDM; KLD-VDM versus GR-VDM Table 5 Post hoc comparisons for VDM, MI-VDM, KLD-VDM and GR-VDM: RRSE Algorithms
z = (R0 − Ri )/SE
p
6
VDM versus GR-VDM
6.001785
0
5
MI-VDM versus GR-VDM
3.988992
0.000066
4
KLD-VDM versus GR-VDM
3.183874
0.001453
3
VDM versus KLD-VDM
2.817911
0.004834
2
VDM versus MI-VDM
2.012794
0.044136
1
MI-VDM versus KLD-VDM
0.805118
0.420752
i
Bergmann’s procedure rejects these hypotheses: VDM versus MI-VDM; VDM versus KLD-VDM; VDM versus GR-VDM; MI-VDM versus GR-VDM; KLD-VDM versus GR-VDM
123
Two improved attribute weighting schemes for value. . . Table 6 −CLL comparisons for GR-VDM versus KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM Dataset
GR-VDM
KLD-VDM
SFM
MRM
RELIEF
IWVDM
LVDM
anneal
36.73
40.00 ◦
39.15 ◦
45.88 ◦
48.55 ◦
45.54 ◦
47.56 ◦
anneal.ORIG
48.20
51.38 ◦
49.47
58.81 ◦
63.91 ◦
51.43 ◦
54.17 ◦
audiology
43.43
43.61
40.55 •
41.77
58.84 ◦
55.88 ◦
54.03 ◦
autos
23.97
24.45
22.84
25.89
32.15 ◦
31.28 ◦
31.97 ◦ 35.49 ◦
balance-scale
30.38
30.92
16.26 •
23.06 •
31.35 ◦
30.57
breast-cancer
16.69
17.31
16.42
20.43 ◦
17.07
16.83
16.67
breast-w
10.30
9.89
10.22
8.61 •
12.92 ◦
14.65 ◦
12.17 ◦
cardiotocography-2
65.04
69.53 ◦
78.12 ◦
100.76 ◦
84.11 ◦
67.79 ◦
72.50 ◦
climate
14.28
16.76 ◦
16.56 ◦
20.03 ◦
15.58 ◦
19.89 ◦
20.72 ◦
colic
15.36
15.85
17.65 ◦
21.62 ◦
15.58
16.04
15.38
colic.ORIG
16.97
18.49 ◦
19.69 ◦
23.77 ◦
18.11 ◦
17.23
18.55 ◦ 70.51 ◦
connectionist-vowel
64.67
65.88
58.72 •
69.69 ◦
74.80 ◦
73.59 ◦
credit-a
25.21
25.73
27.06 ◦
30.42 ◦
25.44
25.30
25.93
credit-g
52.15
53.21
52.71
59.84 ◦
52.39
52.58
53.99
cylinder-bands
27.78
27.07
27.67
35.83 ◦
29.93
27.03
29.77
diabetes
38.39
37.97
39.97
48.26 ◦
38.41
38.71
40.41 ◦
ecoli
29.18
30.35 ◦
29.31
30.73
35.93 ◦
36.00 ◦
35.19 ◦
energy-y1
155.08
154.67
163.53 ◦
202.43 ◦
134.88 •
162.62 ◦
155.63
energy-y2
179.55
178.89
182.31 ◦
205.68 ◦
158.14 •
187.59 ◦
181.39 ◦
glass
24.44
25.47 ◦
25.90 ◦
28.37 ◦
26.93 ◦
27.67 ◦
28.04 ◦
heart-c
19.04
20.79 ◦
19.27
21.33 ◦
26.12 ◦
22.51 ◦
27.39 ◦
heart-h
18.67
19.56
18.62
20.58
21.59 ◦
19.95 ◦
23.35 ◦ 11.78
heart-statlog
11.31
11.79
11.37
12.68
11.64
11.21
hepatitis
6.14
6.51
5.42
7.20
6.41
6.20
6.05
hypothyroid
136.22
140.90 ◦
140.07 ◦
144.26 ◦
139.28 ◦
135.33
150.30 ◦
ionosphere
9.86
11.49
10.43
11.33
19.62 ◦
18.32 ◦
16.13 ◦
iris
3.91
3.83
4.30
4.48
4.08 ◦
4.46 ◦
4.35 ◦
kr-vs-kp
48.55
49.36
104.59 ◦
118.42 ◦
42.05 •
39.88 •
39.12 •
labor
1.86
1.77
0.92 •
0.95 •
2.26
2.16
2.12
letter
706.43
740.53 ◦
744.66 ◦
803.25 ◦
945.80 ◦
955.58 ◦
979.55 ◦
lymph
8.68
9.33
8.42
8.91
12.44 ◦
10.66 ◦
11.11 ◦
mfeat-f
318.59
374.93 ◦
212.94 •
219.04 •
455.83 ◦
455.32 ◦
454.95 ◦
monks
3.80
3.84
4.13
6.66 ◦
3.55
3.80
3.56 •
movement-libras
91.25
90.00 •
57.90 •
59.47 •
93.65 ◦
93.67 ◦
93.55 ◦
mushroom
15.13
14.87
24.68 ◦
38.77 ◦
17.26 ◦
17.49 ◦
15.12
newthyroid
5.64
5.22
5.24
7.96 ◦
6.31 ◦
6.05
6.00
optdigits
177.59
196.11 ◦
88.06 •
91.44 •
243.68 ◦
241.64 ◦
239.47 ◦
page-blocks
40.48
41.95 ◦
40.16
46.53 ◦
41.93 ◦
40.66
41.35
parkinsons
5.53
5.82
6.26
9.39 ◦
6.21 ◦
4.99
5.24
pendigits
200.21
239.15 ◦
177.99 •
199.57
358.57 ◦
359.66 ◦
332.88 ◦
primary-tumor
78.67
79.35
77.58
78.46
84.91 ◦
86.53 ◦
84.95 ◦
123
L. Jiang, C. Li Table 6 continued Dataset
GR-VDM
KLD-VDM
SFM
MRM
RELIEF
IWVDM
LVDM
qar-biodegradation
40.80
40.31
46.05 ◦
59.61 ◦
50.25 ◦
44.66 ◦
46.50 ◦
robot-24
50.21
56.80 ◦
65.98 ◦
81.41 ◦
95.56 ◦
71.31 ◦
90.03 ◦ 159.31 ◦
segment
136.54
142.73 ◦
143.51 ◦
158.73 ◦
164.31 ◦
170.93 ◦
sick
42.36
42.85
51.81 ◦
55.10 ◦
42.55
42.37
43.15
sonar
9.69
10.17
10.91
13.53 ◦
13.93 ◦
13.76 ◦
13.98 ◦ 111.41 ◦
soybean
96.17
90.35 •
76.39 •
78.75 •
121.53 ◦
126.09 ◦
spect
11.66
10.73
10.84
11.90
12.74 ◦
11.47
11.70
splice
86.87
86.46
84.75
87.41
281.57 ◦
260.83 ◦
228.70 ◦
steel-plates-faults
16.79
38.09 ◦
36.98 ◦
39.64 ◦
41.73 ◦
24.50 ◦
22.31 ◦
vehicle
63.35
64.93
79.23 ◦
97.59 ◦
80.05 ◦
73.70 ◦
85.07 ◦
vote
7.65
7.77
11.33 ◦
14.22 ◦
8.94 ◦
7.79
7.62
vowel
118.74
120.84 ◦
124.41 ◦
143.56 ◦
172.96 ◦
166.01 ◦
170.25 ◦
waveform-5000
48.80
51.52 ◦
47.18
64.78 ◦
97.23 ◦
91.33 ◦
98.70 ◦
yeast
210.09
214.27 ◦
212.86
221.64 ◦
221.06 ◦
216.39 ◦
225.88 ◦
zoo
6.73
6.45
5.84 •
6.01 •
7.53 ◦
8.26 ◦
8.11 ◦
W/T /L
–
20/34/2
21/25/10
35/13/8
42/11/3
36/19/1
38/16/2
Average
67.35
71.23
66.16
74.04
88.04
86.85
86.98
Average rank
67.35
71.23
66.16
74.04
88.04
86.85
86.98
[25], and LVDM (local VDM) [26]. Note that, RELIEF denotes the simplified VDM with the ReliefF-based attribute weighting, which applies the ReliefF attribute selection algorithm [22,37] and then uses the resulting attribute relevance scores as weights. To save the memory of running experiments, we manually remove three useless attributes (“Hospital Number” in “colic.ORIG”, “instance name” in “splice” and “animal” in “zoo”). Tables 6, 7, 8, and 9 show the detailed comparison results. From these comparison results, we can see that our proposed distance metrics, especially GR-VDM, are overall better than their competitors not only in terms of -CLL but also in terms of RRSE.
5 Conclusions and further research Value difference metric (VDM) has continued to be one of the most widely used distance metrics for nominal attributes. Of numerous approaches to improving the simplified VDM (without the attribute weighting scheme) by mitigating its attribute independence assumption, attribute weighting has received less attention but demonstrated remarkable class probability estimation performance. In this paper we propose two simple but effective attribute weighting schemes for the simplified VDM. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM). The extensive experimental results validate the effectiveness of our proposed KLD-VDM and GR-VDM. Besides, KLD-VDM and GRVDM maintain the computational simplicity and robustness that characterize VDM. As with the original definition of VDM, our current versions of KLD-VDM and GR-VDM can not handle numerical attributes yet, and thus all numerical attributes need to be discretized
123
Two improved attribute weighting schemes for value. . . Table 7 RRSE comparisons for GR-VDM versus KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM Dataset
GR-VDM
KLD-VDM
SFM
MRM
RELIEF
IWVDM
LVDM
anneal
58.88
63.29 ◦
62.17 ◦
68.47 ◦
72.69 ◦
69.25 ◦
71.55 ◦
anneal.ORIG
75.67
78.62 ◦
76.37
80.94 ◦
89.36 ◦
78.14 ◦
80.79 ◦
audiology
91.46
91.44
88.01 •
87.61 •
100.75 ◦
99.60 ◦
98.10 ◦
autos
84.57
85.15
80.72 •
84.18
96.60 ◦
95.37 ◦
96.33 ◦
balance-scale
64.75
65.21
41.64 •
53.75 •
65.81 ◦
64.87
71.40 ◦
breast-cancer
96.85
99.11
95.94
103.56
98.14
97.61
96.85
breast-w
36.47
35.29
35.72
32.33 •
42.76 ◦
47.02 ◦
41.17 ◦
cardiotocography-2
60.04
63.52 ◦
69.51 ◦
79.47 ◦
69.40 ◦
60.45
63.64 ◦
climate
95.74
107.21 ◦
105.61 ◦
117.17 ◦
99.63 ◦
115.07 ◦
119.02 ◦
colic
73.40
74.89
80.47 ◦
87.53 ◦
74.14
75.57
73.57
colic.ORIG
81.33
85.86 ◦
88.83 ◦
96.07 ◦
84.28 ◦
81.49
85.35 80.75 ◦
connectionist-vowel
78.07
78.24
72.95 •
76.17 •
82.70 ◦
82.25 ◦
credit-a
65.99
67.06
69.25 ◦
72.63 ◦
66.34
66.13
67.12
credit-g
90.71
92.06
91.28
95.62 ◦
91.08
91.30
92.85
cylinder-bands
82.30
81.28
83.02
92.15 ◦
86.55
80.67
86.16
diabetes
85.11
84.49
87.12
93.70 ◦
85.22
85.21
87.74 ◦ 80.01 ◦
ecoli
72.28
73.82 ◦
71.37
71.37
80.65 ◦
81.17 ◦
energy-y1
89.05
89.00
90.06 ◦
93.81 ◦
84.83 •
90.46 ◦
88.85 •
energy-y2
91.91
91.83 •
92.12 ◦
93.80 ◦
88.13 •
93.28 ◦
91.94
glass
86.85
87.90
88.32
90.30
90.42 ◦
91.59 ◦
91.75 ◦
heart-c
77.57
81.25 ◦
78.01
78.06
91.49 ◦
84.51 ◦
94.03 ◦
heart-h
81.21
83.60
81.03
80.75
86.88 ◦
83.12
90.30 ◦
heart-statlog
71.86
73.89
72.41
74.55
73.02
71.61
74.05
hepatitis
86.35
88.81
78.48 •
87.28
88.18
86.03
85.00
hypothyroid
101.28
102.76 ◦
102.48 ◦
101.30
102.70 ◦
100.56
107.43 ◦
ionosphere
56.81
63.54
58.77
58.75
89.71 ◦
85.79 ◦
79.07 ◦
iris
36.97
36.49
40.59
39.18
38.24 ◦
40.50 ◦
40.09 ◦
kr-vs-kp
35.60
35.78
62.29 ◦
65.37 ◦
32.33 •
28.24 •
28.20 •
labor
61.60
55.89
30.38 •
29.37 •
70.82
67.99
67.71
letter
84.18
85.61 ◦
84.33
84.89 ◦
92.57 ◦
93.21 ◦
93.65 ◦
lymph
74.05
76.74
72.11
70.46
89.86 ◦
82.61 ◦
84.24 ◦ 99.67 ◦
mfeat-f
87.84
93.65 ◦
70.33 •
69.20 •
99.72 ◦
99.70 ◦
monks
16.99
17.29
19.46
32.94 ◦
27.25 ◦
16.96
16.54
movement-libras
98.39
98.07 •
80.64 •
80.80 •
98.95 ◦
98.95 ◦
98.92 ◦
mushroom
19.03
18.56
35.49 ◦
48.61 ◦
20.85 ◦
21.18 ◦
18.75
newthyroid
48.48
44.37
44.40
56.84 ◦
52.10
50.78
50.20 97.68 ◦
optdigits
87.93
91.41 ◦
58.68 •
59.14 •
98.15 ◦
97.93 ◦
page-blocks
86.79
88.61 ◦
86.58
91.02
88.49 ◦
86.48
86.99
parkinsons
64.94
66.92
70.13
83.50 ◦
68.71
59.15 •
61.44
pendigits
65.96
72.89 ◦
60.74 •
62.70 •
88.63 ◦
88.84 ◦
85.43 ◦
primary-tumor
96.11
96.11
94.90 •
94.60 •
98.35 ◦
98.94 ◦
98.36 ◦
qar-biodegradation
72.71
72.05
78.86 ◦
87.74 ◦
81.61 ◦
75.85 ◦
77.93 ◦
123
L. Jiang, C. Li Table 7 continued Dataset
GR-VDM
KLD-VDM
SFM
MRM
RELIEF
IWVDM
LVDM
robot-24
55.75
60.40 ◦
67.46 ◦
74.05 ◦
82.71 ◦
68.85 ◦
79.68 ◦
segment
52.63
54.25 ◦
54.46 ◦
55.94 ◦
59.49 ◦
60.87 ◦
57.84 ◦
sick
60.64
59.46
70.23 ◦
72.20 ◦
61.01
59.59
59.92
sonar
77.72
79.51
81.02
88.23 ◦
97.85 ◦
96.99 ◦
98.08 ◦ 84.15 ◦
soybean
79.43
77.35 •
71.15 •
70.88 •
87.42 ◦
88.83 ◦
spect
91.61
86.48
87.09
93.21
95.88 ◦
90.11
91.44
splice
41.78
41.44
40.71
41.15
91.47 ◦
87.21 ◦
80.09 ◦
steel-plates-faults
17.43
45.25 ◦
44.37 ◦
45.97 ◦
43.81 ◦
26.17 ◦
26.28 ◦
vehicle
72.87
73.89
81.90 ◦
90.01 ◦
81.66 ◦
78.10 ◦
84.40 ◦
vote
40.28
40.85
54.99 ◦
59.93 ◦
44.47 ◦
40.70
39.95
vowel
77.07
77.44
76.70
79.16 ◦
90.15 ◦
88.81 ◦
89.60 ◦
waveform-5000
63.10
65.35 ◦
61.79
71.79 ◦
93.16 ◦
89.79 ◦
93.96 ◦
yeast
90.19
90.89 ◦
90.26
90.11
92.22 ◦
91.21 ◦
92.90 ◦
zoo
59.73
58.21
53.96 •
55.40 •
63.95 ◦
68.03 ◦
67.23 ◦
W/T /L
–
18/35/3
18/25/13
28/16/12
41/12/3
34/20/2
36/18/2
Average
70.61
72.33
70.85
74.92
78.81
76.80
77.61
Average rank
67.35
71.23
66.16
74.04
88.04
86.85
86.98
Table 8 Post hoc comparisons for GR-VDM, KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM: -CLL i
Algorithms
z = (R0 − Ri )/SE
p
21
GR-VDM versus RELIEF
7.239117
0
20
GR-VDM versus MRM
7.151635
0
19
GR-VDM versus LVDM
6.670485
0
18
GR-VDM versus IWVDM
5.642575
0
17
SFM versus RELIEF
5.42387
0
16
SFM versus MRM
5.336388
0
15
SFM versus LVDM
4.855239
0.000001
14
KLD-VDM versus RELIEF
4.680275
0.000003
13
KLD-VDM versus MRM
4.592793
0.000004
12
KLD-VDM versus LVDM
4.111643
0.000039
11
SFM versus IWVDM
3.827328
0.00013
10
KLD-VDM versus IWVDM
3.083733
0.002044
9
GR-VDM versus KLD-VDM
2.558842
0.010502
8
GR-VDM versus SFM
1.815247
0.069486
7
RELIEF versus IWVDM
1.596542
0.110368
6
MRM versus IWVDM
1.509061
0.131283
5
IWVDM versus LVDM
1.027911
0.303992
4
KLD-VDM versus SFM
0.743595
0.457121
123
Two improved attribute weighting schemes for value. . . Table 8 continued Algorithms
z = (R0 − Ri )/SE
p
3
RELIEF versus LVDM
0.568632
0.569606
2
MRM versus LVDM
0.48115
0.63041
1
MRM versus RELIEF
0.087482
0.930289
i
Bergmann’s procedure rejects these hypotheses: GR-VDM versus MRM; GR-VDM versus RELIEF; GRVDM versus IWVDM; GR-VDM versus LVDM; KLD-VDM versus MRM; KLD-VDM versus RELIEF; KLD-VDM versus IWVDM; KLD-VDM versus LVDM; SFM versus MRM; SFM versus RELIEF; SFM versus IWVDM; SFM versus LVDM Table 9 Post hoc comparisons for GR-VDM, KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM: RRSE i
Algorithms
z = (R0 − Ri )/S E
p
21
GR-VDM versus RELIEF
6.932931
0
20
GR-VDM versus LVDM
5.795668
0
19
SFM versus RELIEF
5.292647
0
18
GR-VDM versus IWVDM
4.789627
0.000002
17
GR-VDM versus MRM
4.789627
0.000002
16
KLD-VDM versus RELIEF
4.549052
0.000005
15
SFM versus LVDM
4.155384
0.000032
14
KLD-VDM versus LVDM
3.411789
0.000645
13
SFM versus IWVDM
3.149344
0.001636
12
SFM versus MRM
3.149344
0.001636
11
KLD-VDM versus IWVDM
2.405749
0.016139
10
KLD-VDM versus MRM
2.405749
0.016139
9
GR-VDM versus KLD-VDM
2.383878
0.017131
8
MRM versus RELIEF
2.143304
0.032089
7
RELIEF versus IWVDM
2.143304
0.032089
6
GR-VDM versus SFM
1.640283
0.100946
5
RELIEF versus LVDM
1.137263
0.255428
4
MRM versus LVDM
1.00604
0.314396
3
IWVDM versus LVDM
1.00604
0.314396
2
KLD-VDM versus SFM
0.743595
0.457121
1
MRM versus IWVDM
0
1
Bergmann’s procedure rejects these hypotheses: GR-VDM versus MRM; GR-VDM versus RELIEF; GRVDM versus IWVDM; GR-VDM versus LVDM; KLD-VDM versus RELIEF; KLD-VDM versus LVDM; SFM versus MRM; SFM versus RELIEF; SFM versus IWVDM; SFM versus LVDM
using a preprocessing step. However, in many real-world applications, numerical attributes are widespread and, therefore, takeing advantage of the ideas of previous heterogeneous distance functions [43] to extend them to handle applications with numerical attributes is a main direction for our future research. Note that, a potential problem confronting the proposed schemes is that, when several attributes are redundant, they are still tend to get more than their share of the weight since attributes are handled independently. It is also a main direction for our future research. Besides, applying the proposed distance metrics to
123
L. Jiang, C. Li
improve distance-related algorithms, such as the k-nearest neighbor (KNN) algorithm and its variants [7,8,17,42], is another direction for our future research. Acknowledgements The work was partially supported by the National Natural Science Foundation of China (U1711267), the Program for New Century Excellent Talents in University (NCET-12-0953), the Chenguang Program of Science and Technology of Wuhan (2015070404010202), and the Open Research Project of Hubei Key Laboratory of Intelligent Geo-Information Processing (KLIGIP201601).
References 1. Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL datamining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(2–3):255–287 2. Bian W, Tao D (2012) Constrained empirical risk minimization framework for distance metric learning. IEEE Trans Neural Netw Learn Syst 23(8):1194–1205 3. Blanzieri E, Ricci F (1999) Probability based metrics for nearest neighbor classification and case-based reasoning. In: Proceedings of the 3rd International conference on case-based reasoning. Springer, pp 14–28 4. Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10:57–78 5. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of the 24th conference on machine learning. ACM Press, Corvalis, pp 209–216 6. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30 7. Domeniconi C, Gunopulos D (2001) Adaptive nearest neighbor classification using support vector machines. In: Advances in neural information processing systems 14. MIT Press, Cambridge, pp 665–672 8. Domeniconi C, Peng J, Gunopulos D (2000) Adaptive metric nearest-neighbor classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Hilton Head, p 1517 9. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130 10. Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 17th international joint conference on artificial intelligence, pp 973–978 11. Frank A, Asuncion A (2010) UCI machine learning repository. University of California, Irvine 12. Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694 13. Grossman D, Domingos P (2004) Learning Bayesian network classifiers by maximizing conditional likelihood. In: Proceedings of the 21st international conference on machine learning. ACM, pp 361–368 14. Guo Y, Greiner R (2005) Discriminative model selection for belief net structures. In: Proceedings of the 12th national conference on artificial intelligence. AAAI, pp 770–776 15. Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl Based Syst 20:120– 126 16. Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning. Morgan Kaufmann, Stanford, pp 359–366 17. Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616 18. Hu JH, Zhan DC, Wu X, Jiang Y, Zhou ZH (2015) Pairwised specific distance learning from physical linkages. ACM Trans Knowl Discov Data 9(3):20 19. Jiang L, Li C (2013) An augmented value difference measure. Pattern Recognit Lett 34(10):1169–1174 20. Jiang L, Li C, Zhang H, Cai Z (2014) A novel distance function: frequency difference metric. Int J Pattern Recognit Artif Intell 28(2):1451002 21. Kasif S, Salzberg S, Waltz D, Rachlin J, Aha D (1998) A probabilistic framework for memory-based reasoning. Artif Intell 104:287–311 22. Kira K, Rendell L (1992) A practical approach to feature selection. In: Proceedings of the 9th international conference on machine learning. Morgan Kaufman, Aberdeen, pp 249–256 23. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
123
Two improved attribute weighting schemes for value. . . 24. Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback–Leibler measure. In: Proceedings of the 11th IEEE international conference on data mining. IEEE, Vancouver, pp 1146-1151 25. Li C, Jiang L, Li H (2014) Local value difference metric. Pattern Recognit Lett 49:62–68 26. Li C, Jiang L, Li H (2014) Naive Bayes for value difference metric. Front Comput Sci 8(2):255–264 27. Li C, Jiang L, Li H, Wu J, Zhang P (2017) Toward value difference metric with attribute weighting. Knowl Inf Syst 50(3):795–825 28. Li C, Li H (2011) One dependence value difference metric. Knowl Based Syst 24(5):589–594 29. Li C, Li H (2012) A modified short and fukunaga metric based on the attribute independence assumption. Pattern Recognit Lett 33(9):1213–1218 30. Li C, Li H (2013) Selective value difference metric. J Comput 8(9):2232–2238 31. Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill, New York City 32. Myles JP, Hand DJ (1990) The multi-class metric problem in nearest neighbour discrimination rules. Pattern Recognit 23(11):1291–1297 33. Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281 34. Qiu C, Jiang L, Li C (2015) Not always simple classification: learning superparent for class probability estimation. Expert Syst Appl 42(13):5433–5440 35. Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106 36. Quinlan JR (1993) C4.5: programs for machine learning, 1st edn. Morgan Kaufmann, San Mateo 37. Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn 53(1–2):23–69 38. Saar-Tsechansky M, Provost F (2004) Active sampling for class probability estimation and ranking. Mach Learn 54:153–178 39. Sangineto E (2013) Pose and expression independent facial landmark localization using dense-surf and the Hausdorff distance. IEEE Trans Pattern Anal Mach Intell 35(3):624–638 40. Short RD, Fukunaga K (1981) The optimal distance measure for nearest neighbour classification. IEEE Trans Inf Theory 27:622–627 41. Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29:1213–1228 42. Weinberger KQ, Blitzer JC, Saul LK (2006) Distance metric learning for large margin nearest neighbor classification. Adv Neural Inf Process Syst 18:1473–1480 43. Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34 44. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco 45. Yang L (2006) Distance metric learning: a comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University 46. Zaidi NA, Cerquides J, Carman MJ, Webb GI (2013) Alleviating naive Bayes attribute independence assumption by attribute weighting. J Mach Learn Res 14:1947–1988
Liangxiao Jiang received his Ph.D. degree from China University of Geosciences, Wuhan, China, in 2009. He is currently a professor with the Department of Computer Science, China University of Geosciences (Wuhan). His current research interests include data mining and machine learning. Since 2005, he has published over 60 refereed journal and conference papers, such as in the IEEE Transactions on Knowledge and Data Engineering, Information Sciences, Knowledge and Information Systems, Engineering Applications of Artificial Intelligence, Knowledge-Based Systems, Expert Systems with Applications, Pattern Recognition Letters, Journal of Intelligent Information Systems, AAAI, ICML, ICDM and DASFAA, in the above areas.
123
L. Jiang, C. Li Chaoqun Li received her Ph.D. degree from China University of Geosciences, Wuhan, China, in 2012. She is currently an associate professor with the Department of Mathematics, China University of Geosciences (Wuhan). Her current research interests include data mining and machine learning. Since 2006, he has published over 30 refereed journal and conference papers, such as in the Information Sciences, Knowledge and Information Systems, Engineering Applications of Artificial Intelligence, Knowledge-Based Systems, Expert Systems with Applications, Pattern Recognition Letters, International Journal of Pattern Recognition and Artificial Intelligence, ICANN, ICTAI and PRICAI, in the above areas.
123