Two improved attribute weighting schemes for value difference metric

Due to its simplicity, efficiency and efficacy, value difference metric (VDM) has continued to perform well against more sophisticated newcomers and t...

2 downloads 168 Views 557KB Size

Download PDF

Knowl Inf Syst https://doi.org/10.1007/s10115-018-1229-3 REGULAR PAPER

Two improved attribute weighting schemes for value difference metric Liangxiao Jiang1 · Chaoqun Li2

Received: 11 November 2016 / Revised: 11 May 2018 / Accepted: 26 May 2018 © Springer-Verlag London Ltd., part of Springer Nature 2018

Abstract Due to its simplicity, efficiency and efficacy, value difference metric (VDM) has continued to perform well against more sophisticated newcomers and thus has remained of great interest to the distance metric learning community. Of numerous approaches to improving VDM by weakening its attribute independence assumption, attribute weighting has received less attention (only two attribute weighting schemes) but demonstrated remarkable class probability estimation performance. Among two existing attribute weighting schemes, one is non-symmetric and the other is symmetric. In this paper, we propose two simple improvements for setting attribute weights for use with VDM. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM). We performed extensive evaluations on a large number of datasets and found that KLD-VDM and GR-VDM significantly outperform two existing attribute weighting schemes in terms of the negative conditional log likelihood and root relative squared error, yet at the same time maintain the computational simplicity and robustness that characterize VDM. Keywords Distance metric learning · Value difference metric · Attribute weighting · Gain ratio · Kullback–Leibler divergence

1 Introduction Distance metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data [45]. Many machine learning algorithms, such as the k-nearestneighbor algorithm (KNN) and its variants [7,8,17,42], heavily rely on learning good

B

Liangxiao Jiang [email protected]

1

Department of Computer Science, China University of Geosciences, Wuhan 430074, China

2

Department of Mathematics, China University of Geosciences, Wuhan 430074, China

123

L. Jiang, C. Li

distance metrics from training data. For example, the large margin nearest-neighbor algorithm (LMNN) [42] attempts to learn a Mahalanobis distance metric for KNN classification from labeled instances. For another example, the adaptive metric nearest-neighbor algorithm (ADAMENN) [8] tries to learn a flexible Chi-squared distance metric for producing neighborhoods that are highly adaptive to query locations. In a word, learning distance metrics from data has been actively studied in the machine learning community and been successfully applied to many real-world application domains including information retrieval, image processing, face recognition, cognitive psychology and bioinformatics [2,18,20,39]. Although there are many distance metrics that have been proposed to decide which instance is closest to a given test instance, many of these metrics work well for numerical attributes rather than nominal attributes [5,43]. To appropriately handle nominal attributes, value difference metric (VDM) [41] was proposed. It is noteworthy that the original definition of VDM actually has an attribute weighting scheme. However, this attribute weighting scheme makes VDM non-symmetric [4]. In order to make VDM symmetric, a simplified version of VDM (without the attribute weighting scheme) is widely used and numerous enhancements start from it [4,19,27,28,30,43]. In the simplified definition of VDM, the distance between two instances is the straightforward superposition of the total differences between each pair of attribute values, and thus it requires a strong assumption, that is, all attributes are totally independent from each other. This assumption was explained by Kasif et al. [21] from the viewpoint of the probability memory-based reasoning (MBR) transform, which transforms a nominal attribute space into a probability distribution. Thanks to the attribute independence assumption, VDM is very easy to learn. However, recent research in distance metric learning has shown that even with such a strong assumption, VDM is surprisingly competitive with state-of-the-art metrics such as short and Fukunaga metric (SFM) [29,32,40] and minimum risk metric (MRM) [3]. This fact raises the question of whether VDM with a less restrictive assumption can perform even better. To answer this question, learning improved VDM by weakening its attribute independence assumption has attracted much attention from researchers. Related work can be broadly divided into five main categories: (1) structure extension [19,28]; (2) instance selection, also called local learning [25]; (3) instance weighting [4,26]; (4) attribute selection [30]; (5) attribute weighting [27]. Of numerous approaches to improving VDM by weakening its attribute independence assumption, attribute weighting has received less attention from researchers. To the best of our knowledge, there exist only two attribute weighting schemes in the existing literature. One is the non-symmetric attribute weighting scheme introduced by the original VDM [41], and the other is the symmetric mutual information-based attribute weighting scheme [27]. However, they all have demonstrated remarkable class probability estimation performance. To further scale up their performance, we propose two new improved attribute weighting schemes for setting attribute weights for use with VDM in this paper. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM). Classification is a fundamental issue in data mining and machine learning. In classification, the goal of learning (constructing) a classifier is to correctly classify an input test instance. So the performance of a classifier is of course measured by its classification accuracy (the percentage of test instances correctly classified). However, in many real-world applications, classification alone is not always adequate [34]. For example, in target marketing, we may hope to evaluate various offer propositions by the class probability that a customer will respond to an offer [38]. In cost-sensitive decision-making, the class probability estimation is used to minimize the conditional risk [10]. Thus, accurate class probability estimation is

123

Two improved attribute weighting schemes for value. . .

often required instead of classification in many real-world applications. For the task of class probability estimation, the negative conditional log likelihood (−CLL) [13,14] instead of the classification accuracy, is a more reasonable performance measure for the learned (constructed) classifiers. Besides, the root relative squared error (RRSE) or the root-mean-square error (RMSE) of the class probability estimates is widely used as a performance measure to evaluate the probability-based classifiers [15,27,46]. In this paper, we also use −CLL and RRSE to evaluate the distance-related classifiers and therefore to evaluate the defined distance metrics. The experimental results on 56 widely used problems from the University of California at Irvine (UCI) machine learning repository [11] validate the effectiveness of the proposed KLD-VDM and GR-VDM. The rest of the paper is organized as follows. Firstly, we revisit the attribute independence assumption in VDM and two existing attribute weighting schemes for VDM. Then, we propose Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and gain ratio weighted value difference metric (GR-VDM). Followed by the detailed description of our experiments and results, we conclude the paper and outline the main directions for future research.

2 Related work When all attributes are nominal only, VDM [41] is a well-known distance metric. Because the original definition of VDM has an attribute weighting form, which makes VDM nonsymmetric [4]. In order to make VDM symmetric, a simplified version of VDM (without the attribute weighting scheme) [43] is widely used and numerous enhancements start from it. Given a training data set, assume that A1 , A2 ,. . ., Am are m attribute variables and C is the class variable, a training instance x can be represented by an attribute value vector ax1 , ax2 , . . . , axm , where axi is the ith attribute value of x. The VDM without the attribute weighting scheme [43] defines the distance from a training instance x to a given test instance y as: m d(x, y) = |P(c|axi ) − P(c|a yi )|, (1) i=1

c

where m is the number of attributes, c is each possible class label of C, P(c|axi ) and P(c|a yi ) are the conditional probabilities that the output class is c given that the attribute Ai has the values axi and a yi , respectively, which can be easily estimated using the Laplace estimate as follows: 1 + nj=1 δ(a ji , axi )δ(c j , c) , (2) P(c|axi ) = t + nj=1 δ(a ji , axi ) 1 + nj=1 δ(a ji , a yi )δ(c j , c) , (3) P(c|a yi ) = t + nj=1 δ(a ji , a yi ) where t is the number of classes, n is the number of training instances, a ji is the ith attribute value of the jth training instance, and the indicator functionδ(α, β) is one ifα = β and zero otherwise. After VDM was proposed, Wilson and Martinez [43] proposed three new heterogeneous distance metrics: heterogeneous value difference metric (HVDM), the interpolated value difference metric (IVDM) and the windowed value difference metric (WVDM) to extend VDM to handle both nominal and numeric attributes. Their experimental results show that

123

L. Jiang, C. Li

their new distance metrics achieve higher classification accuracy on average than Euclidean distance metric, Heterogeneous Euclidean-overlap metric (HEOM), and discretized value difference metric (DVDM). Short and Fukunaga metric [40], simply SFM, is another distance metric for nominal attributes only, which was originally presented by Short and Fukunaga [40] and then extended by Myles and Hand [32]. SFM addresses the issue of selecting the best distance measure by minimizing the difference between the finite sample nearest-neighbor classification error and the asymptotic nearest-neighbor error based on probabilities. Therefore, SFM with multiple classes is defined as: d(x, y) = |P(c|x) − P(c|y)|. (4) c

Minimum risk metric [3], simply MRM, is a simple distance function that directly minimizes the risk of misclassification. Assume that an instance x has the instance y as its nearest neighbor. Given a class label c, the finite risk of misclassifying x is P(c|x)(1 − P(c|y)). In the multiple-class case, the total finite risk is the sum of the risks of all the different classes. Therefore, MRM is defined as: d(x, y) = P(c|x)(1 − P(c|y)). (5) c

In Eqs. 4 and 5, a major challenge is how to estimate the class membership probabilities P(c|x) and P(c|y). A practical way is to use naive Bayes (NB) [9].

2.1 The attribute independence assumption in VDM According to the memory-based reasoning (MBR) transform [21], when the number of classes is t, an m-dimensional input space can be transformed into a t × m-dimensional space. Therefore, an m-dimensional input vector ax1 , ax2 , . . . , axm can be transformed into a t × m-dimensional vector: P(c1 |ax1 ), . . . , P(ct |ax1 ), . . . , P(c1 |axm ), . . . , P(ct |axm ). Then, according to the Manhattan distance, the distance between x and y in the MBR transform space is just the simplified version of VDM defined by Eq. 1. From the MBR transform perspective, in VDM, the difference between two values of an attribute is considered to be closer if they have more similar correlations with the output classes ck , (k = 1, 2, . . . , t). Then VDM defines the distance between two instances x and y as the straightforward superposition of the value differences across all attributes. Kasif et al. [21] call the transformation of nominal attributes into the probability distribution as the probability MBR transform. According to their observations, the basis of the transformation is just the attribute independence assumption of the joint probability distribution, i.e. m P(ax1 , ax2 , . . . , axm , c) = P(c) P(axi |c). (6) i=1

Thus, to our knowledge, VDM makes a strong assumption that all attributes are fully independent, simply called the attribute independence assumption, and seeks the nearest neighbors in this transformed space.

2.2 The existing attribute weighting schemes Recent work in distance learning has shown that the simplified VDM is competitive with state-of-the-art distance metrics such as SFM and MRM. However, it is obvious that the

123

Two improved attribute weighting schemes for value. . .

attribute independence assumption in VDM is rarely true in reality, which would harm its performance in applications with complex attribute dependencies. To weaken the attribute independence assumption in VDM, a large number of improved approaches are proposed. In this paper, we focus on the attribute weighting approach. Different from the attribute selection approach which completely eliminates irrelevant or redundant attributes from the training data, the attribute weighting approach aims to weight each attribute according to its importance, which defines the distance from a training instance x to a given test instance y as: m d(x, y) = |P(c|axi ) − P(c|a yi )| , (7) wi i=1

c

where wi is the weight of attribute Ai , which indicates the importance of attribute Ai . Now, the only question left to answer is how to define wi (i = 1, 2, . . . , m), which is crucial in learning attribute weighted value difference metric but surprisingly has received less attention than it warrants. To the best of our knowledge, there exist only two attribute weighting schemes in the existing literature. One is the attribute weighting scheme introduced by the original VDM [41], and the other is the mutual information-based attribute weighting scheme [27]. As already pointed out above, the original VDM [41] has actually an attribute weighting scheme. Since we focus on attribute weighted value difference metric in this paper, we would like to revisit the attribute weighting scheme in the original VDM at first. Stanfill and Waltz [41] argued that attributes differ in importance because they differ in the strength with which they constrain the class values and different values of a single attribute can also differ in how strongly they constrain the class values. Therefore, they define the attribute’s weight wi by measuring the strength of the constraint imposed by the ith attribute value a yi of the given test instance y on the class. The detailed method is to restrict the values of the ith attribute of all training instances to a yi , find the frequencies of the various class values, square them, sum them, and at last take the square root of the result. The resulting weight wi is shown below: wi = P(c|a yi )2 . (8) c

It can be seen that the above attribute weighting scheme makes the original VDM nonsymmetric because the distance from x to y is in general not equal to the distance from y tox, i.e. d(x, y) = d(y, x). It is noteworthy that, when we define d(y, x), the weight wi 2 which is in general not equal to 2 is P(c|a ) xi c c P(c|a yi ) . Besides, this scheme uses different weights for different attribute values and therefore is strictly performing attribute value weighting rather than attribute weighting. To make the original VDM symmetric and strictly perform attribute weighting, Li et al. [27] proposed a mutual information-based attribute weighting scheme and called the resulted metric mutual information weighted value difference metric (MI-VDM). They argued that the importance of an attribute can be roughly measured by the correlation between it and the class variable, and therefore they use I P (Ai ; C), the mutual information between attribute Ai and the class variable C, to define the weight wi . The detailed formula is: wi = I P (Ai ; C) =

ai ,c

P(ai , c) log

P(ai , c) , P(ai )P(c)

(9)

123

L. Jiang, C. Li

where ai is each possible attribute value of Ai and c is each possible class label of C. Roughly speaking, I P (Ai ; C) measures how much information Ai provides about C and therefore is a symmetric attribute weighting scheme.

3 The proposed attribute weighting schemes In attribute weighting schemes, the difference between each pair of attribute values (axi and a yi ) is weighted by the importance wi of the attribute Ai . This kind of attribute weighting corresponds to stretching the axes in the attribute space, and the amount by which each axis is stretched is determined by the importance of each attribute. This process of stretching the axes in order to optimize the performance of distance metrics provides a mechanism for inspiring more relevant attributes and suppressing less relevant attributes. Appropriate attribute weights can reduce the error that results from violations of the attribute independence assumption in VDM. Obviously, if a set of training data include a set of attributes that are identical to one another, the error due to the violation of the attribute independence assumption can be removed by assigning weights that sum to 1.0 to the set of attributes. For example, the weight for one of the attributes, Ai could be set to 1.0, and that of the remaining attributes that are identical to Ai set to 0.0. This is equivalent to removing the remaining attributes from the training data. Attribute weighting is strictly more powerful than attribute selection, as it is possible to obtain identical results to attribute selection by setting the weights of selected attributes to 1.0 and of unselected attributes to 0.0, and assignment of other weights can result in distance metrics that cannot be expressed using attribute selection. To further scale up the performance, we propose two new improved attribute weighting schemes for setting attribute weights for use with the simplified VDM in this section. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLDVDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM).

3.1 KLD-VDM Our first proposed attribute weighting scheme is inspired by the Kullback–Leibler divergence [23,24], which is a measure of the difference between two probability distributions. Let P and Q be two distributions, then the Kullback–Leibler divergence from discrete probability P(i) Q to P is i P(i) log Q(i) . Specifically, the Kullback–Leibler divergence from Q to P is a measure of the information gained when one revises one’s beliefs from the prior probability distribution Q to the posterior probability distribution P. The Kullback–Leibler divergence appears in the information theoretic literature under various guises. For instance, it can be viewed as a special case of the cross-entropy or the discrimination, a measure which defines the information theoretic similarity between two probability distributions. Here we assume that these two probability distributions are the prior probability distribution of C and the posterior probability distribution of C after observing Ai = a yi , respectively. Then, the Kullback–Leibler divergence, denoted as KLD(C|a yi ), from the prior probability distribution of C to the posterior probability distribution of C after observing Ai = a yi is: KLD(C|a yi ) =

c

P(c|a yi ) log

P(c|a yi ) . P(c)

(10)

Now, our first proposed attribute weighting scheme is the Kullback–Leibler divergencebased attribute weighting scheme, which defines the weight wi of attribute Ai as the Kullback– Leibler divergence for the attribute value a yi . The detailed formula is:

123

Two improved attribute weighting schemes for value. . .

wi = KLD(C|a yi ) =

P(c|a yi ) log

c

P(c|a yi ) . P(c)

(11)

Similar to the original VDM, our Kullback–Leibler divergence weighted VDM also performs non-symmetric attribute value weighting. At first, when we define d(y, x), the weight wi is KLD(C|axi ) which is in general not equal to KLD(C|a yi ). Secondly, this scheme also uses different weights for different attribute values and therefore is an attribute value weighting scheme.

3.2 GR-VDM Our research on the second proposed attribute weighting scheme starts from our comments to the existing mutual information-based attribute weighting scheme [27]. First, the mutual information between attribute Ai and the class variable C is actually the information gain about C after observing Ai . Proof The information gain IG(Ai ) about C after observing Ai is simply the expected reduction in entropy of C before and after observing Ai : IG(Ai ) = H (C) − H (C|Ai ) P(ai ) P(c|ai ) log P(c|ai ) − P(c) log P(c) = ai

= =

c

c

ai ,c

P(ai , c) P(ai , c) log P(ai , c) log P(c) − P(ai ) a ,c

ai ,c

P(ai , c) P(ai , c) log P(ai )P(c)

i

= I P (Ai ; C).

(12)

Unfortunately, there is a natural bias in the mutual information-based attribute weighting scheme that favors attributes with more values, that is, the attributes with greater numbers of values will appear to gain more information than those with fewer values even if they are actually no more informative [16,31,35,36]. As an extreme example, when an attribute happens to have almost a different value for each training instance, it has a very high information gain relative to the training instances, despite being a very poor predictor over unknown test instances. To overcome the bias of the information gain measure, in decision tree learning [31,35,36], the gain ratio measure is widely used for improvement. Based on this premise, in this paper, we propose a gain ratio-based attribute weighting scheme GR(Ai ) that penalizes attributes with many values by incorporating a term called split information (the entropy of Ai ) into the following formula: IG(Ai ) I P (Ai ; C) wi = GR(Ai ) = = = H (Ai ) H (Ai )

P(ai ,c) P(ai , c) log P(a i )P(c) . − ai P(ai ) log P(ai ) ai ,c

(13)

The split information of attribute Ai measures how broadly and uniformly it splits the given training data. However, when an attribute happens to have the same value for nearly all training instances, its split information will be zero or very small. This either makes GR(Ai ) undefined or very large. To settle down this practical issue, we can adopt some heuristic such

123

L. Jiang, C. Li

as at first computing IG(Ai ), then applying GR(Ai ) only when H (Ai ) is above a small enough numberε. In our current implementation and experiments, we simply set ε to 0.000001.

4 Experiments and results The purpose of this section is to validate the effectiveness of our proposed KLD-VDM and GR-VDM, and therefore, we compare them with two existing attribute weighting schemes: the original attribute weighting scheme in the original VDM (VDM for short) [41] and the mutual information-based attribute weighting scheme (MI-VDM for short) [27], respectively, in terms of the negative conditional log likelihood (−CLL) and root relative squared error (RRSE) of the probability estimates produced by the distance-weighted k-nearest-neighbor algorithm (DWKNN) [31]. To the best of our knowledge, DWKNN should be the first test bed to demonstrate the effectiveness of the used distance metric, in which the distance metric is used twice. Specifically, DWKNN uses a distance metric to find the k nearest neighbors of the test instance at first, and then each of the k nearest neighbors is weighted in terms of its distance to the test instance. According to the reference [31], once distance weighting is used, DWKNN is relatively insensitive to the choice of k. Thus, in our implementation, we choose a relative larger k value, namely we set the value of k in DWKNN to 10. In fact, we test a few other k values such as 9, 15, 20 and 30 in our experiments, and obtain almost the same conclusions. To save space, we have not presented the detailed experimental results in this paper. We employ Waikato Environment for Knowledge Analysis (WEKA) [44] to implement DWKNN using four distance metrics: VDM, MI-VDM, KLD-VDM and GR-VDM, respectively. Note that, in our implementations, we smooth all of the estimated probabilities using Laplace estimate. Our experiments are conducted on a collection of 56 benchmark datasets from the University of California at Irvine (UCI) machine learning repository [11], which represent a wide range of domains and data characteristics. The properties of these datasets are shown in Table 1. We replace all missing values with the means or modes from the available data and then use the unsupervised Discretize with ten bins in WEKA [44] to discretize all numeric attributes into equal-width ten bins. Besides, to save the time of running experiments, we use the unsupervised Resample with a size of 20% in WEKA to resample the datasets with more than 5000 instances. Tables 2 and 3 show the detailed comparison results. The detailed negative conditional log likelihood (−CLL) and root relative squared error (RRSE) scores are obtained by averaging the results from 10 separate runs of stratified 10-fold cross-validation. The • and ◦ in Tables 2 and 3 denote significant improvement or degradation over VDM based on two-tailed t-tests at a confidence level of p = 0.05 [33], respectively. It is worth noting that, for −CLL and RRSE, a small number is better than a large number, and thus • denotes significantly better than VDM and ◦ denotes worse. The Win/Tie/Lose (W /T /L) values and averages are summarized at the bottom of the tables. Each entry’s W /T /L in the tables implies that, compared to VDM, MI-VDM, KLD-VDM and GR-VDM win on W datasets, tie on T datasets, and lose on L datasets. The average across all datasets provides a gross indication of relative performance in addition to other statistics. From these comparison results, we can see that: 1. In terms of −CLL, KLD-VDM is notably better than VDM on 28 datasets and notably worse on 4 datasets. Further, the average −CLL value (71.25) of KLD-VDM is notably lower than that of VDM (78.02).

123

Two improved attribute weighting schemes for value. . . Table 1 Datasets used in the experiments Dataset

#Instances

#Attributes

#Classes

Missing

Numerical

anneal

898

38

6

Y

Y

anneal.ORIG

898

38

6

Y

Y

audiology

226

69

24

y

n

autos

205

25

7

y

y

balance-scale

625

4

3

n

y

breast-cancer

286

9

2

y

n

breast-w

699

9

2

y

n

2126

21

3

n

y

368

22

2

y

y

cardiotocography-2 colic colic.ORIG

368

27

2

Y

Y

connectionist-vowel

528

10

11

n

y

climate

540

20

2

n

y

credit-a

690

15

2

y

y

credit-g

1000

20

2

n

y

cylinder-bands

540

40

2

y

y

diabetes

768

8

2

n

y

ecoli

336

7

8

n

y

energy-y1

768

8

37

n

y

energy-y2

768

8

38

n

y

glass

214

9

7

n

y

heart-c

303

13

5

y

y

heart-h

294

13

5

y

y

heart-statlog

270

13

2

n

y

hepatitis

155

19

2

y

y

hypothyroid

3772

29

4

y

y

ionosphere

351

34

2

n

y

iris kr-vs-kp

150

4

3

n

y

3196

36

2

n

n

labor

57

16

2

y

y

letter

20,000

16

26

n

y

lymph

148

18

4

n

y

mfeat-f

2000

76

10

n

y

monks

432

6

2

n

n

movement-libras

360

90

15

n

y

mushroom

8124

22

2

y

n

newthyroid

215

5

3

n

y

optdigits

5620

62

10

n

y

page-blocks

5473

10

5

n

y

parkinsons pendigits primary-tumor

195

22

2

y

y

10,992

16

10

n

y

339

17

21

y

n

123

L. Jiang, C. Li Table 1 continued Dataset

#Instances

#Attributes

#Classes

Missing

Numerical y

qar-biodegradation

1055

41

2

n

robot-24

5456

24

4

n

y

segment

2310

19

7

n

y

sick

3772

29

2

y

y

sonar

208

60

2

n

y

soybean

683

35

19

y

n

spect

267

22

2

n

n

splice

3190

61

3

n

n

steel-plates-faults

1941

33

2

n

y

vehicle

846

18

4

n

y

vote

435

16

2

y

n

vowel

990

13

11

n

y

waveform-5000

5000

40

3

n

y

yeast

1484

8

10

n

y

101

17

7

N

Y

zoo

2. In terms of −CLL, GR-VDM is significantly better than VDM on 34 datasets and significantly worse on none. Furthermore, the average −CLL value (67.39) of GR-VDM is significantly lower than that of VDM. 3. In terms of RRSE, KLD-VDM is markedly better than VDM on 27 datasets and markedly worse on 6 datasets. In addition, the average RRSE value (72.39) of KLD-VDM is markedly lower than that of VDM (75.68). 4. In terms of RRSE, GR-VDM is remarkably better than VDM on 33 datasets and remarkably worse on none. Additionally, the average RRSE value (70.70) of GR-VDM is remarkably lower than that of VDM. 5. For additional insight into the results, we observed their performance on the datasets “colic.ORIG”, “splice” and “zoo”. The −CLL and RRSE values of KLD-VDM and GRVDM are much lower than those of MI-VDM. To our observation, in these three datasets, there exists such an useless attribute that varies so much that almost each instance enjoys an attribute value: the attribute “Hospital Number” in the dataset “colic.ORIG”, the attribute “instance name” in the dataset “splice” and the attribute “animal” in the dataset “zoo”. Although this kind of attributes gain more mutual information, they are actually no more informative. These results prove that our proposed attribute weighting schemes can overcome the natural bias of the previous MI-VDM. 6. From these comparison results, we can also see that, our proposed attribute weighting schemes are overall better than two existing attribute weighting schemes not only in terms of −CLL but also in terms of RRSE. Note that, KLD-VDM is an non-symmetric distance metric while GR-VDM is a symmetric distance metric. Thus, when non-symmetric distance is needed, KLD-VDM should be an appropriate alternative distance metric. However, when symmetric distance is needed, GR-VDM should be considered firstly. Then, we take advantage of KEEL software [1] to conduct the Friedman test with the Bergmann post hoc test for statistical comparisons of multiple algorithms over multiple datasets [6,12]. The average rankings of the algorithms obtained by applying the Friedman

123

Two improved attribute weighting schemes for value. . . Table 2 −CLL comparisons for VDM versus MI-VDM, KLD-VDM and GR-VDM Dataset

VDM

MI-VDM

KLD-VDM

GR-VDM

anneal

44.17

37.98 •

40.00 •

36.73 •

anneal.ORIG

51.07

49.98 •

51.38

48.20 •

audiology

44.11

43.60

43.61

43.43 •

autos

28.20

27.16 •

24.45 •

23.97 •

balance-scale

30.36

30.39

30.92

30.38

breast-cancer

16.55

16.64

17.31

16.69

breast-w

12.26

11.05 •

9.89 •

10.30 •

cardiotocography-2

73.94

65.27 •

69.53 •

65.04 •

climate

16.23

14.27 •

16.76

14.28 •

colic

16.54

15.52

15.85

15.36 •

colic.ORIG

18.09

24.70 ◦

19.69

18.61

connectionist-vowel

64.79

68.69 ◦

65.88

64.67

credit-a

25.33

25.61

25.73

25.21

credit-g

52.26

51.93

53.21

52.15

cylinder-bands

27.74

32.32 ◦

27.07

27.78

diabetes

37.93

38.38

37.97

38.39 29.18 •

ecoli

30.65

32.37

30.35

energy-y1

155.74

155.38

154.67

155.08

energy-y2

180.05

180.10

178.89 •

179.55

glass

25.78

25.13 •

25.47

24.44 •

heart-c

24.59

19.35 •

20.79 •

19.04 •

heart-h

20.56

18.91 •

19.56

18.67 •

heart-statlog

11.49

11.32

11.79

11.31

hepatitis

6.12

5.98

6.51

6.14

hypothyroid

136.43

136.70

140.90 ◦

136.22

ionosphere

18.14

10.94 •

11.49 •

9.86 •

iris

4.11

4.19

3.83 •

3.91 •

kr-vs-kp

49.59

47.97 •

49.36

48.55

labor

1.99

1.85

1.77

1.86

letter

726.43

815.27 ◦

740.53 ◦

706.43 •

lymph

11.03

8.95 •

9.33 •

8.68 • 318.59 •

mfeat-f

436.52

422.50 •

374.93 •

monks

3.83

3.79

3.84

3.80

movement-libras

91.99

93.54 ◦

90.00 •

91.25 •

mushroom

16.82

15.26 •

14.87 •

15.13 •

newthyroid

6.09

5.98

5.22 •

5.64 •

optdigits

222.54

228.30 ◦

196.11 •

177.59 •

page-blocks

40.75

41.32

41.95 ◦

40.48

parkinsons

5.87

5.55

5.82

5.53

pendigits

255.69

314.49 ◦

239.15 •

200.21 •

123

L. Jiang, C. Li Table 2 continued Dataset

VDM

MI-VDM

KLD-VDM

GR-VDM

primary-tumor

79.11

78.67

79.35

78.67

qar-biodegradation

45.47

41.42 •

40.31 •

40.80 •

robot-24

79.40

57.36 •

56.80 •

50.21 •

segment

148.67

158.85 ◦

142.73 •

136.54 •

sick

43.02

42.05

42.85

42.36

sonar

13.55

9.74 •

10.17 •

9.69 •

soybean

97.34

101.56 ◦

90.35 •

96.17 •

spect

12.43

11.68 •

10.73 •

11.66 •

splice

258.19

105.96 •

86.55 •

87.15 •

steel-plates-faults

33.86

24.97 •

38.09 ◦

16.79 •

vehicle

72.61

65.71 •

64.93 •

63.35 •

vote

9.40

7.63 •

7.77 •

7.65 •

vowel

126.83

134.45 ◦

120.84 •

118.74 •

waveform-5000

87.57

55.85 •

51.52 •

48.80 •

yeast

212.08

214.00

214.27

210.09

zoo

6.98

8.04 ◦

6.50 •

6.87 34/22/0

W/T /L

−

23/21/12

28/24/4

Average

78.02

76.37

71.25

67.39

Average rank

3.1964

2.6875

2.5536

1.5625

test are also summarized at the bottom of Tables 2 and 3. With four algorithms and 56 datasets, FF is distributed according to the F distribution with 3 and 165 degrees of freedom. FF calculated from the average ranks are 21.430186 and 15.726207, respectively, which are all greater than the critical value of F(3, 165) for a = 0.05. So we reject the null-hypothesis and proceed with the Bergmann post hoc test to find out exactly the significant difference among these algorithms. Tables 4 and 5 report post hoc comparisons on −CLL and RRSE, respectively. From these comparison results, we can see that: – In terms of −CLL, the average rankings of VDM, MI-VDM, KLD-VDM and GR-VDM are 3.2143, 2.6696, 2.5714 and 1.5446, respectively. KLD-VDM is significantly better than VDM and GR-VDM is significantly better than VDM, MI-VDM and KLD-VDM. – In terms of RRSE, the average rankings of VDM, MI-VDM, KLD-VDM and GR-VDM are 3.1786, 2.6518, 2.4911 and 1.6786, respectively. KLD-VDM is significantly better than VDM and GR-VDM is significantly better than VDM, MI-VDM and KLD-VDM. – Not only in terms of -CLL but also in terms of RRSE, the original VDM is in general the worst and GR-VDM is the best. Besides, KLD-VDM is a little better than MI-VDM. – Not only in terms of -CLL but also in terms of RRSE, GR-VDM significantly outperforms MI-VDM, which indicates that GR-VDM can compensate for MI-VDM’s bias toward the attributes with more values. To further validate the effectiveness of our proposed distance metrics, we conducted another group of experiments to compare the proposed distance metrics with several other state-of-the-art metric learning methods such as SFM (short and Fukunaga metric) [40], MRM (minimum risk metric) [3], RELIEF [22,37], IWVDM (instance weighted VDM)

123

Two improved attribute weighting schemes for value. . . Table 3 RRSE comparisons for VDM versus MI-VDM, KLD-VDM and GR-VDM Dataset

VDM

MI-VDM

KLD-VDM

GR-VDM

anneal

67.71

60.87 •

63.29 •

58.88 •

anneal.ORIG

78.14

77.52

78.62

75.67 •

audiology

92.04

91.83

91.44

91.46 •

autos

91.02

89.32 •

85.15 •

84.57 •

balance-scale

64.59

64.76

65.21

64.75

breast-cancer

96.55

96.77

99.11

96.85

breast-w

41.00

38.10 •

35.29 •

36.47 •

cardiotocography-2

63.96

60.04 •

63.52

60.04 •

climate

102.04

95.65 •

107.21 ◦

95.74 •

colic

76.76

74.01

74.89

73.40 •

colic.ORIG

84.12

102.00 ◦

88.93 ◦

85.56

connectionist-vowel

78.01

80.05 ◦

78.24

78.07

credit-a

66.10

66.81

67.06

65.99

credit-g

91.09

90.47

92.06

90.71

cylinder-bands

82.17

89.46 ◦

81.28

82.30

diabetes

84.38

85.10

84.49

85.11

ecoli

74.17

76.35 ◦

73.82

72.28 •

energy-y1

89.16

89.11

89.00 •

89.05

energy-y2

92.00

91.99

91.83 •

91.91

glass

88.67

87.84

87.90

86.85 •

heart-c

88.53

78.24 •

81.25 •

77.57 •

heart-h

84.42

82.27 •

83.60

81.21 •

heart-statlog

72.38

72.02

73.89

71.86

hepatitis

85.66

84.55

88.81

86.35

hypothyroid

101.35

101.50

102.76 ◦

101.28

ionosphere

84.97

60.04 •

63.54 •

56.81 •

iris

38.30

38.91

36.49 •

36.97

kr-vs-kp

35.31

35.21

35.78

35.60

labor

64.67

60.82

55.89

61.60

letter

85.29

88.58 ◦

85.61 ◦

84.18 •

lymph

83.81

75.16 •

76.74 •

74.05 •

mfeat-f

98.54

97.58 •

93.65 •

87.84 •

monks

17.20

16.95

17.29

16.99

movement-libras

98.57

98.92 ◦

98.07 •

98.39 •

mushroom

20.40

19.27 •

18.56 •

19.03 •

newthyroid

51.03

50.38

44.37 •

48.48 •

optdigits

95.58

96.34 ◦

91.41 •

87.93 •

page-blocks

87.06

88.35

88.61 ◦

86.79

parkinsons

66.28

64.77

66.92

64.94

pendigits

75.64

83.69 ◦

72.89 •

65.96 •

primary-tumor

96.21

96.17

96.11

96.11

qar-biodegradation

76.45

73.37 •

72.05 •

72.71 •

123

L. Jiang, C. Li Table 3 continued Dataset

VDM

MI-VDM

KLD-VDM

GR-VDM

robot-24

73.74

60.05 •

60.40 •

55.75 •

segment

55.62

57.99 ◦

54.25 •

52.63 •

sick

61.35

60.49

59.46

60.64

sonar

95.96

77.63 •

79.51 •

77.72 •

soybean

80.23

81.63 ◦

77.35 •

79.43 •

spect

94.15

91.50

86.48 •

91.61 41.78 •

splice

86.64

46.72 •

41.42 •

steel-plates-faults

36.57

30.85 •

45.25 ◦

17.43 •

vehicle

77.35

73.96 •

73.89 •

72.87 •

vote

46.47

40.22 •

40.85 •

40.28 •

vowel

79.22

81.28 ◦

77.44 •

77.07 •

waveform-5000

87.54

67.12 •

65.35 •

63.10 •

yeast

90.42

90.96 ◦

90.89

90.19

zoo

61.23

66.80 ◦

58.51 •

60.49 •

W/T /L

–

19/24/13

27/23/6

33/23/0

Average

75.68

73.18

72.39

70.70

Average rank

3.1607

2.6696

2.4732

1.6964

Table 4 Post hoc comparisons for VDM, MI-VDM, KLD-VDM and GR-VDM: -CLL Algorithms

z = (R0 − Ri )/SE

p

6

VDM versus GR-VDM

6.697114

0

5

MI-VDM versus GR-VDM

4.611128

0.000004

4

KLD-VDM versus GR-VDM

4.062184

0.000049

3

VDM versus KLD-VDM

2.63493

0.008415

2

VDM versus MI-VDM

2.085986

0.03698

1

MI-VDM versus KLD-VDM

0.548944

0.583044

i

Bergmann’s procedure rejects these hypotheses: VDM versus MI-VDM; VDM versus KLD-VDM; VDM versus GR-VDM; MI-VDM versus GR-VDM; KLD-VDM versus GR-VDM Table 5 Post hoc comparisons for VDM, MI-VDM, KLD-VDM and GR-VDM: RRSE Algorithms

z = (R0 − Ri )/SE

p

6

VDM versus GR-VDM

6.001785

0

5

MI-VDM versus GR-VDM

3.988992

0.000066

4

KLD-VDM versus GR-VDM

3.183874

0.001453

3

VDM versus KLD-VDM

2.817911

0.004834

2

VDM versus MI-VDM

2.012794

0.044136

1

MI-VDM versus KLD-VDM

0.805118

0.420752

i

Bergmann’s procedure rejects these hypotheses: VDM versus MI-VDM; VDM versus KLD-VDM; VDM versus GR-VDM; MI-VDM versus GR-VDM; KLD-VDM versus GR-VDM

123

Two improved attribute weighting schemes for value. . . Table 6 −CLL comparisons for GR-VDM versus KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM Dataset

GR-VDM

KLD-VDM

SFM

MRM

RELIEF

IWVDM

LVDM

anneal

36.73

40.00 ◦

39.15 ◦

45.88 ◦

48.55 ◦

45.54 ◦

47.56 ◦

anneal.ORIG

48.20

51.38 ◦

49.47

58.81 ◦

63.91 ◦

51.43 ◦

54.17 ◦

audiology

43.43

43.61

40.55 •

41.77

58.84 ◦

55.88 ◦

54.03 ◦

autos

23.97

24.45

22.84

25.89

32.15 ◦

31.28 ◦

31.97 ◦ 35.49 ◦

balance-scale

30.38

30.92

16.26 •

23.06 •

31.35 ◦

30.57

breast-cancer

16.69

17.31

16.42

20.43 ◦

17.07

16.83

16.67

breast-w

10.30

9.89

10.22

8.61 •

12.92 ◦

14.65 ◦

12.17 ◦

cardiotocography-2

65.04

69.53 ◦

78.12 ◦

100.76 ◦

84.11 ◦

67.79 ◦

72.50 ◦

climate

14.28

16.76 ◦

16.56 ◦

20.03 ◦

15.58 ◦

19.89 ◦

20.72 ◦

colic

15.36

15.85

17.65 ◦

21.62 ◦

15.58

16.04

15.38

colic.ORIG

16.97

18.49 ◦

19.69 ◦

23.77 ◦

18.11 ◦

17.23

18.55 ◦ 70.51 ◦

connectionist-vowel

64.67

65.88

58.72 •

69.69 ◦

74.80 ◦

73.59 ◦

credit-a

25.21

25.73

27.06 ◦

30.42 ◦

25.44

25.30

25.93

credit-g

52.15

53.21

52.71

59.84 ◦

52.39

52.58

53.99

cylinder-bands

27.78

27.07

27.67

35.83 ◦

29.93

27.03

29.77

diabetes

38.39

37.97

39.97

48.26 ◦

38.41

38.71

40.41 ◦

ecoli

29.18

30.35 ◦

29.31

30.73

35.93 ◦

36.00 ◦

35.19 ◦

energy-y1

155.08

154.67

163.53 ◦

202.43 ◦

134.88 •

162.62 ◦

155.63

energy-y2

179.55

178.89

182.31 ◦

205.68 ◦

158.14 •

187.59 ◦

181.39 ◦

glass

24.44

25.47 ◦

25.90 ◦

28.37 ◦

26.93 ◦

27.67 ◦

28.04 ◦

heart-c

19.04

20.79 ◦

19.27

21.33 ◦

26.12 ◦

22.51 ◦

27.39 ◦

heart-h

18.67

19.56

18.62

20.58

21.59 ◦

19.95 ◦

23.35 ◦ 11.78

heart-statlog

11.31

11.79

11.37

12.68

11.64

11.21

hepatitis

6.14

6.51

5.42

7.20

6.41

6.20

6.05

hypothyroid

136.22

140.90 ◦

140.07 ◦

144.26 ◦

139.28 ◦

135.33

150.30 ◦

ionosphere

9.86

11.49

10.43

11.33

19.62 ◦

18.32 ◦

16.13 ◦

iris

3.91

3.83

4.30

4.48

4.08 ◦

4.46 ◦

4.35 ◦

kr-vs-kp

48.55

49.36

104.59 ◦

118.42 ◦

42.05 •

39.88 •

39.12 •

labor

1.86

1.77

0.92 •

0.95 •

2.26

2.16

2.12

letter

706.43

740.53 ◦

744.66 ◦

803.25 ◦

945.80 ◦

955.58 ◦

979.55 ◦

lymph

8.68

9.33

8.42

8.91

12.44 ◦

10.66 ◦

11.11 ◦

mfeat-f

318.59

374.93 ◦

212.94 •

219.04 •

455.83 ◦

455.32 ◦

454.95 ◦

monks

3.80

3.84

4.13

6.66 ◦

3.55

3.80

3.56 •

movement-libras

91.25

90.00 •

57.90 •

59.47 •

93.65 ◦

93.67 ◦

93.55 ◦

mushroom

15.13

14.87

24.68 ◦

38.77 ◦

17.26 ◦

17.49 ◦

15.12

newthyroid

5.64

5.22

5.24

7.96 ◦

6.31 ◦

6.05

6.00

optdigits

177.59

196.11 ◦

88.06 •

91.44 •

243.68 ◦

241.64 ◦

239.47 ◦

page-blocks

40.48

41.95 ◦

40.16

46.53 ◦

41.93 ◦

40.66

41.35

parkinsons

5.53

5.82

6.26

9.39 ◦

6.21 ◦

4.99

5.24

pendigits

200.21

239.15 ◦

177.99 •

199.57

358.57 ◦

359.66 ◦

332.88 ◦

primary-tumor

78.67

79.35

77.58

78.46

84.91 ◦

86.53 ◦

84.95 ◦

123

L. Jiang, C. Li Table 6 continued Dataset

GR-VDM

KLD-VDM

SFM

MRM

RELIEF

IWVDM

LVDM

qar-biodegradation

40.80

40.31

46.05 ◦

59.61 ◦

50.25 ◦

44.66 ◦

46.50 ◦

robot-24

50.21

56.80 ◦

65.98 ◦

81.41 ◦

95.56 ◦

71.31 ◦

90.03 ◦ 159.31 ◦

segment

136.54

142.73 ◦

143.51 ◦

158.73 ◦

164.31 ◦

170.93 ◦

sick

42.36

42.85

51.81 ◦

55.10 ◦

42.55

42.37

43.15

sonar

9.69

10.17

10.91

13.53 ◦

13.93 ◦

13.76 ◦

13.98 ◦ 111.41 ◦

soybean

96.17

90.35 •

76.39 •

78.75 •

121.53 ◦

126.09 ◦

spect

11.66

10.73

10.84

11.90

12.74 ◦

11.47

11.70

splice

86.87

86.46

84.75

87.41

281.57 ◦

260.83 ◦

228.70 ◦

steel-plates-faults

16.79

38.09 ◦

36.98 ◦

39.64 ◦

41.73 ◦

24.50 ◦

22.31 ◦

vehicle

63.35

64.93

79.23 ◦

97.59 ◦

80.05 ◦

73.70 ◦

85.07 ◦

vote

7.65

7.77

11.33 ◦

14.22 ◦

8.94 ◦

7.79

7.62

vowel

118.74

120.84 ◦

124.41 ◦

143.56 ◦

172.96 ◦

166.01 ◦

170.25 ◦

waveform-5000

48.80

51.52 ◦

47.18

64.78 ◦

97.23 ◦

91.33 ◦

98.70 ◦

yeast

210.09

214.27 ◦

212.86

221.64 ◦

221.06 ◦

216.39 ◦

225.88 ◦

zoo

6.73

6.45

5.84 •

6.01 •

7.53 ◦

8.26 ◦

8.11 ◦

W/T /L

–

20/34/2

21/25/10

35/13/8

42/11/3

36/19/1

38/16/2

Average

67.35

71.23

66.16

74.04

88.04

86.85

86.98

Average rank

67.35

71.23

66.16

74.04

88.04

86.85

86.98

[25], and LVDM (local VDM) [26]. Note that, RELIEF denotes the simplified VDM with the ReliefF-based attribute weighting, which applies the ReliefF attribute selection algorithm [22,37] and then uses the resulting attribute relevance scores as weights. To save the memory of running experiments, we manually remove three useless attributes (“Hospital Number” in “colic.ORIG”, “instance name” in “splice” and “animal” in “zoo”). Tables 6, 7, 8, and 9 show the detailed comparison results. From these comparison results, we can see that our proposed distance metrics, especially GR-VDM, are overall better than their competitors not only in terms of -CLL but also in terms of RRSE.

5 Conclusions and further research Value difference metric (VDM) has continued to be one of the most widely used distance metrics for nominal attributes. Of numerous approaches to improving the simplified VDM (without the attribute weighting scheme) by mitigating its attribute independence assumption, attribute weighting has received less attention but demonstrated remarkable class probability estimation performance. In this paper we propose two simple but effective attribute weighting schemes for the simplified VDM. One is the non-symmetric Kullback–Leibler divergence weighted value difference metric (KLD-VDM) and the other is the symmetric gain ratio weighted value difference metric (GR-VDM). The extensive experimental results validate the effectiveness of our proposed KLD-VDM and GR-VDM. Besides, KLD-VDM and GRVDM maintain the computational simplicity and robustness that characterize VDM. As with the original definition of VDM, our current versions of KLD-VDM and GR-VDM can not handle numerical attributes yet, and thus all numerical attributes need to be discretized

123

Two improved attribute weighting schemes for value. . . Table 7 RRSE comparisons for GR-VDM versus KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM Dataset

GR-VDM

KLD-VDM

SFM

MRM

RELIEF

IWVDM

LVDM

anneal

58.88

63.29 ◦

62.17 ◦

68.47 ◦

72.69 ◦

69.25 ◦

71.55 ◦

anneal.ORIG

75.67

78.62 ◦

76.37

80.94 ◦

89.36 ◦

78.14 ◦

80.79 ◦

audiology

91.46

91.44

88.01 •

87.61 •

100.75 ◦

99.60 ◦

98.10 ◦

autos

84.57

85.15

80.72 •

84.18

96.60 ◦

95.37 ◦

96.33 ◦

balance-scale

64.75

65.21

41.64 •

53.75 •

65.81 ◦

64.87

71.40 ◦

breast-cancer

96.85

99.11

95.94

103.56

98.14

97.61

96.85

breast-w

36.47

35.29

35.72

32.33 •

42.76 ◦

47.02 ◦

41.17 ◦

cardiotocography-2

60.04

63.52 ◦

69.51 ◦

79.47 ◦

69.40 ◦

60.45

63.64 ◦

climate

95.74

107.21 ◦

105.61 ◦

117.17 ◦

99.63 ◦

115.07 ◦

119.02 ◦

colic

73.40

74.89

80.47 ◦

87.53 ◦

74.14

75.57

73.57

colic.ORIG

81.33

85.86 ◦

88.83 ◦

96.07 ◦

84.28 ◦

81.49

85.35 80.75 ◦

connectionist-vowel

78.07

78.24

72.95 •

76.17 •

82.70 ◦

82.25 ◦

credit-a

65.99

67.06

69.25 ◦

72.63 ◦

66.34

66.13

67.12

credit-g

90.71

92.06

91.28

95.62 ◦

91.08

91.30

92.85

cylinder-bands

82.30

81.28

83.02

92.15 ◦

86.55

80.67

86.16

diabetes

85.11

84.49

87.12

93.70 ◦

85.22

85.21

87.74 ◦ 80.01 ◦

ecoli

72.28

73.82 ◦

71.37

71.37

80.65 ◦

81.17 ◦

energy-y1

89.05

89.00

90.06 ◦

93.81 ◦

84.83 •

90.46 ◦

88.85 •

energy-y2

91.91

91.83 •

92.12 ◦

93.80 ◦

88.13 •

93.28 ◦

91.94

glass

86.85

87.90

88.32

90.30

90.42 ◦

91.59 ◦

91.75 ◦

heart-c

77.57

81.25 ◦

78.01

78.06

91.49 ◦

84.51 ◦

94.03 ◦

heart-h

81.21

83.60

81.03

80.75

86.88 ◦

83.12

90.30 ◦

heart-statlog

71.86

73.89

72.41

74.55

73.02

71.61

74.05

hepatitis

86.35

88.81

78.48 •

87.28

88.18

86.03

85.00

hypothyroid

101.28

102.76 ◦

102.48 ◦

101.30

102.70 ◦

100.56

107.43 ◦

ionosphere

56.81

63.54

58.77

58.75

89.71 ◦

85.79 ◦

79.07 ◦

iris

36.97

36.49

40.59

39.18

38.24 ◦

40.50 ◦

40.09 ◦

kr-vs-kp

35.60

35.78

62.29 ◦

65.37 ◦

32.33 •

28.24 •

28.20 •

labor

61.60

55.89

30.38 •

29.37 •

70.82

67.99

67.71

letter

84.18

85.61 ◦

84.33

84.89 ◦

92.57 ◦

93.21 ◦

93.65 ◦

lymph

74.05

76.74

72.11

70.46

89.86 ◦

82.61 ◦

84.24 ◦ 99.67 ◦

mfeat-f

87.84

93.65 ◦

70.33 •

69.20 •

99.72 ◦

99.70 ◦

monks

16.99

17.29

19.46

32.94 ◦

27.25 ◦

16.96

16.54

movement-libras

98.39

98.07 •

80.64 •

80.80 •

98.95 ◦

98.95 ◦

98.92 ◦

mushroom

19.03

18.56

35.49 ◦

48.61 ◦

20.85 ◦

21.18 ◦

18.75

newthyroid

48.48

44.37

44.40

56.84 ◦

52.10

50.78

50.20 97.68 ◦

optdigits

87.93

91.41 ◦

58.68 •

59.14 •

98.15 ◦

97.93 ◦

page-blocks

86.79

88.61 ◦

86.58

91.02

88.49 ◦

86.48

86.99

parkinsons

64.94

66.92

70.13

83.50 ◦

68.71

59.15 •

61.44

pendigits

65.96

72.89 ◦

60.74 •

62.70 •

88.63 ◦

88.84 ◦

85.43 ◦

primary-tumor

96.11

96.11

94.90 •

94.60 •

98.35 ◦

98.94 ◦

98.36 ◦

qar-biodegradation

72.71

72.05

78.86 ◦

87.74 ◦

81.61 ◦

75.85 ◦

77.93 ◦

123

L. Jiang, C. Li Table 7 continued Dataset

GR-VDM

KLD-VDM

SFM

MRM

RELIEF

IWVDM

LVDM

robot-24

55.75

60.40 ◦

67.46 ◦

74.05 ◦

82.71 ◦

68.85 ◦

79.68 ◦

segment

52.63

54.25 ◦

54.46 ◦

55.94 ◦

59.49 ◦

60.87 ◦

57.84 ◦

sick

60.64

59.46

70.23 ◦

72.20 ◦

61.01

59.59

59.92

sonar

77.72

79.51

81.02

88.23 ◦

97.85 ◦

96.99 ◦

98.08 ◦ 84.15 ◦

soybean

79.43

77.35 •

71.15 •

70.88 •

87.42 ◦

88.83 ◦

spect

91.61

86.48

87.09

93.21

95.88 ◦

90.11

91.44

splice

41.78

41.44

40.71

41.15

91.47 ◦

87.21 ◦

80.09 ◦

steel-plates-faults

17.43

45.25 ◦

44.37 ◦

45.97 ◦

43.81 ◦

26.17 ◦

26.28 ◦

vehicle

72.87

73.89

81.90 ◦

90.01 ◦

81.66 ◦

78.10 ◦

84.40 ◦

vote

40.28

40.85

54.99 ◦

59.93 ◦

44.47 ◦

40.70

39.95

vowel

77.07

77.44

76.70

79.16 ◦

90.15 ◦

88.81 ◦

89.60 ◦

waveform-5000

63.10

65.35 ◦

61.79

71.79 ◦

93.16 ◦

89.79 ◦

93.96 ◦

yeast

90.19

90.89 ◦

90.26

90.11

92.22 ◦

91.21 ◦

92.90 ◦

zoo

59.73

58.21

53.96 •

55.40 •

63.95 ◦

68.03 ◦

67.23 ◦

W/T /L

–

18/35/3

18/25/13

28/16/12

41/12/3

34/20/2

36/18/2

Average

70.61

72.33

70.85

74.92

78.81

76.80

77.61

Average rank

67.35

71.23

66.16

74.04

88.04

86.85

86.98

Table 8 Post hoc comparisons for GR-VDM, KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM: -CLL i

Algorithms

z = (R0 − Ri )/SE

p

21

GR-VDM versus RELIEF

7.239117

0

20

GR-VDM versus MRM

7.151635

0

19

GR-VDM versus LVDM

6.670485

0

18

GR-VDM versus IWVDM

5.642575

0

17

SFM versus RELIEF

5.42387

0

16

SFM versus MRM

5.336388

0

15

SFM versus LVDM

4.855239

0.000001

14

KLD-VDM versus RELIEF

4.680275

0.000003

13

KLD-VDM versus MRM

4.592793

0.000004

12

KLD-VDM versus LVDM

4.111643

0.000039

11

SFM versus IWVDM

3.827328

0.00013

10

KLD-VDM versus IWVDM

3.083733

0.002044

9

GR-VDM versus KLD-VDM

2.558842

0.010502

8

GR-VDM versus SFM

1.815247

0.069486

7

RELIEF versus IWVDM

1.596542

0.110368

6

MRM versus IWVDM

1.509061

0.131283

5

IWVDM versus LVDM

1.027911

0.303992

4

KLD-VDM versus SFM

0.743595

0.457121

123

Two improved attribute weighting schemes for value. . . Table 8 continued Algorithms

z = (R0 − Ri )/SE

p

3

RELIEF versus LVDM

0.568632

0.569606

2

MRM versus LVDM

0.48115

0.63041

1

MRM versus RELIEF

0.087482

0.930289

i

Bergmann’s procedure rejects these hypotheses: GR-VDM versus MRM; GR-VDM versus RELIEF; GRVDM versus IWVDM; GR-VDM versus LVDM; KLD-VDM versus MRM; KLD-VDM versus RELIEF; KLD-VDM versus IWVDM; KLD-VDM versus LVDM; SFM versus MRM; SFM versus RELIEF; SFM versus IWVDM; SFM versus LVDM Table 9 Post hoc comparisons for GR-VDM, KLD-VDM, SFM, MRM, RELIEF, IWVDM and LVDM: RRSE i

Algorithms

z = (R0 − Ri )/S E

p

21

GR-VDM versus RELIEF

6.932931

0

20

GR-VDM versus LVDM

5.795668

0

19

SFM versus RELIEF

5.292647

0

18

GR-VDM versus IWVDM

4.789627

0.000002

17

GR-VDM versus MRM

4.789627

0.000002

16

KLD-VDM versus RELIEF

4.549052

0.000005

15

SFM versus LVDM

4.155384

0.000032

14

KLD-VDM versus LVDM

3.411789

0.000645

13

SFM versus IWVDM

3.149344

0.001636

12

SFM versus MRM

3.149344

0.001636

11

KLD-VDM versus IWVDM

2.405749

0.016139

10

KLD-VDM versus MRM

2.405749

0.016139

9

GR-VDM versus KLD-VDM

2.383878

0.017131

8

MRM versus RELIEF

2.143304

0.032089

7

RELIEF versus IWVDM

2.143304

0.032089

6

GR-VDM versus SFM

1.640283

0.100946

5

RELIEF versus LVDM

1.137263

0.255428

4

MRM versus LVDM

1.00604

0.314396

3

IWVDM versus LVDM

1.00604

0.314396

2

KLD-VDM versus SFM

0.743595

0.457121

1

MRM versus IWVDM

0

1

Bergmann’s procedure rejects these hypotheses: GR-VDM versus MRM; GR-VDM versus RELIEF; GRVDM versus IWVDM; GR-VDM versus LVDM; KLD-VDM versus RELIEF; KLD-VDM versus LVDM; SFM versus MRM; SFM versus RELIEF; SFM versus IWVDM; SFM versus LVDM

using a preprocessing step. However, in many real-world applications, numerical attributes are widespread and, therefore, takeing advantage of the ideas of previous heterogeneous distance functions [43] to extend them to handle applications with numerical attributes is a main direction for our future research. Note that, a potential problem confronting the proposed schemes is that, when several attributes are redundant, they are still tend to get more than their share of the weight since attributes are handled independently. It is also a main direction for our future research. Besides, applying the proposed distance metrics to

123

L. Jiang, C. Li

improve distance-related algorithms, such as the k-nearest neighbor (KNN) algorithm and its variants [7,8,17,42], is another direction for our future research. Acknowledgements The work was partially supported by the National Natural Science Foundation of China (U1711267), the Program for New Century Excellent Talents in University (NCET-12-0953), the Chenguang Program of Science and Technology of Wuhan (2015070404010202), and the Open Research Project of Hubei Key Laboratory of Intelligent Geo-Information Processing (KLIGIP201601).

References 1. Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL datamining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(2–3):255–287 2. Bian W, Tao D (2012) Constrained empirical risk minimization framework for distance metric learning. IEEE Trans Neural Netw Learn Syst 23(8):1194–1205 3. Blanzieri E, Ricci F (1999) Probability based metrics for nearest neighbor classification and case-based reasoning. In: Proceedings of the 3rd International conference on case-based reasoning. Springer, pp 14–28 4. Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10:57–78 5. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of the 24th conference on machine learning. ACM Press, Corvalis, pp 209–216 6. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30 7. Domeniconi C, Gunopulos D (2001) Adaptive nearest neighbor classification using support vector machines. In: Advances in neural information processing systems 14. MIT Press, Cambridge, pp 665–672 8. Domeniconi C, Peng J, Gunopulos D (2000) Adaptive metric nearest-neighbor classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Hilton Head, p 1517 9. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130 10. Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 17th international joint conference on artificial intelligence, pp 973–978 11. Frank A, Asuncion A (2010) UCI machine learning repository. University of California, Irvine 12. Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694 13. Grossman D, Domingos P (2004) Learning Bayesian network classifiers by maximizing conditional likelihood. In: Proceedings of the 21st international conference on machine learning. ACM, pp 361–368 14. Guo Y, Greiner R (2005) Discriminative model selection for belief net structures. In: Proceedings of the 12th national conference on artificial intelligence. AAAI, pp 770–776 15. Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl Based Syst 20:120– 126 16. Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning. Morgan Kaufmann, Stanford, pp 359–366 17. Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616 18. Hu JH, Zhan DC, Wu X, Jiang Y, Zhou ZH (2015) Pairwised specific distance learning from physical linkages. ACM Trans Knowl Discov Data 9(3):20 19. Jiang L, Li C (2013) An augmented value difference measure. Pattern Recognit Lett 34(10):1169–1174 20. Jiang L, Li C, Zhang H, Cai Z (2014) A novel distance function: frequency difference metric. Int J Pattern Recognit Artif Intell 28(2):1451002 21. Kasif S, Salzberg S, Waltz D, Rachlin J, Aha D (1998) A probabilistic framework for memory-based reasoning. Artif Intell 104:287–311 22. Kira K, Rendell L (1992) A practical approach to feature selection. In: Proceedings of the 9th international conference on machine learning. Morgan Kaufman, Aberdeen, pp 249–256 23. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

123

Two improved attribute weighting schemes for value. . . 24. Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback–Leibler measure. In: Proceedings of the 11th IEEE international conference on data mining. IEEE, Vancouver, pp 1146-1151 25. Li C, Jiang L, Li H (2014) Local value difference metric. Pattern Recognit Lett 49:62–68 26. Li C, Jiang L, Li H (2014) Naive Bayes for value difference metric. Front Comput Sci 8(2):255–264 27. Li C, Jiang L, Li H, Wu J, Zhang P (2017) Toward value difference metric with attribute weighting. Knowl Inf Syst 50(3):795–825 28. Li C, Li H (2011) One dependence value difference metric. Knowl Based Syst 24(5):589–594 29. Li C, Li H (2012) A modified short and fukunaga metric based on the attribute independence assumption. Pattern Recognit Lett 33(9):1213–1218 30. Li C, Li H (2013) Selective value difference metric. J Comput 8(9):2232–2238 31. Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill, New York City 32. Myles JP, Hand DJ (1990) The multi-class metric problem in nearest neighbour discrimination rules. Pattern Recognit 23(11):1291–1297 33. Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281 34. Qiu C, Jiang L, Li C (2015) Not always simple classification: learning superparent for class probability estimation. Expert Syst Appl 42(13):5433–5440 35. Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106 36. Quinlan JR (1993) C4.5: programs for machine learning, 1st edn. Morgan Kaufmann, San Mateo 37. Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn 53(1–2):23–69 38. Saar-Tsechansky M, Provost F (2004) Active sampling for class probability estimation and ranking. Mach Learn 54:153–178 39. Sangineto E (2013) Pose and expression independent facial landmark localization using dense-surf and the Hausdorff distance. IEEE Trans Pattern Anal Mach Intell 35(3):624–638 40. Short RD, Fukunaga K (1981) The optimal distance measure for nearest neighbour classification. IEEE Trans Inf Theory 27:622–627 41. Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29:1213–1228 42. Weinberger KQ, Blitzer JC, Saul LK (2006) Distance metric learning for large margin nearest neighbor classification. Adv Neural Inf Process Syst 18:1473–1480 43. Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34 44. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco 45. Yang L (2006) Distance metric learning: a comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University 46. Zaidi NA, Cerquides J, Carman MJ, Webb GI (2013) Alleviating naive Bayes attribute independence assumption by attribute weighting. J Mach Learn Res 14:1947–1988

Liangxiao Jiang received his Ph.D. degree from China University of Geosciences, Wuhan, China, in 2009. He is currently a professor with the Department of Computer Science, China University of Geosciences (Wuhan). His current research interests include data mining and machine learning. Since 2005, he has published over 60 refereed journal and conference papers, such as in the IEEE Transactions on Knowledge and Data Engineering, Information Sciences, Knowledge and Information Systems, Engineering Applications of Artificial Intelligence, Knowledge-Based Systems, Expert Systems with Applications, Pattern Recognition Letters, Journal of Intelligent Information Systems, AAAI, ICML, ICDM and DASFAA, in the above areas.

123

L. Jiang, C. Li Chaoqun Li received her Ph.D. degree from China University of Geosciences, Wuhan, China, in 2012. She is currently an associate professor with the Department of Mathematics, China University of Geosciences (Wuhan). Her current research interests include data mining and machine learning. Since 2006, he has published over 30 refereed journal and conference papers, such as in the Information Sciences, Knowledge and Information Systems, Engineering Applications of Artificial Intelligence, Knowledge-Based Systems, Expert Systems with Applications, Pattern Recognition Letters, International Journal of Pattern Recognition and Artificial Intelligence, ICANN, ICTAI and PRICAI, in the above areas.

123

Two improved attribute weighting schemes for value difference metric

Recommend Documents