A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

Most supervised classification algorithms produce a soft score (either a probability, a fuzzy degree, a possibility, a cost, etc.) assessing the stren...

1 downloads 262 Views 656KB Size

Download PDF

Soft Computing https://doi.org/10.1007/s00500-018-3320-9

FOCUS

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms Guillermo Villarino1 · Daniel Gómez1 · J. Tinguaro Rodríguez2 · Javier Montero2

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract Most supervised classification algorithms produce a soft score (either a probability, a fuzzy degree, a possibility, a cost, etc.) assessing the strength of the association between items and classes. After that, each item is assigned to the class with the highest soft score. In this paper, we show that this last step can be improved through alternative procedures more sensible to the available soft information. To this aim, we propose a general fuzzy bipolar approach that enables learning how to take advantage of the soft information provided by many classification algorithms in order to enhance the generalization power and accuracy of the classifiers. To show the suitability of the proposed approach, we also present some computational experiences for binary classification problems, in which its application to some well-known classifiers as random forest, classification trees and neural networks produces a statistically significant improvement in the performance of the classifiers. Keywords Supervised classification models · Bipolar models · Machine learning · Soft information

1 Introduction One of the most important topics in data science is classification and particularly supervised classification tasks. In the literature, there exists a huge diversity of supervised classification algorithms, approaches and applications, depending on the specific tasks, type of data, characteristics or efficiency (Kumar and Verma 2012; Lim et al. 2000). Typically, in a supervised classification context the main aim is to be able to classify a set of items into classes based on a training sample or dataset that provides examples of association between items and classes and that is used to train the classifiers in Communicated by C. Kahraman.

B

Guillermo Villarino [email protected] Daniel Gómez [email protected] J. Tinguaro Rodríguez [email protected] Javier Montero [email protected]

1

Facultad de Estudios Estadísticos, Universidad Complutense, Av. Puerta de Hierro s/n, 28040 Madrid, Spain

2

Facultad de Ciencias Matemáticas, Universidad Complutense, Plaza de las Ciencias 3, 28040 Madrid, Spain

order to adequately generalize the observed associations, that is, to fit the classification models to the observed data. After the training phase, classification algorithms provide a mechanism that permits to classify new items (new queries or items in a test sample) in some of the known classes. In this classification task, it is important to differentiate between the internal learning process, which frequently assigns to each pair (item, class) a soft score (e.g. a probability, a fuzzy degree, a possibility) assessing the strength of the association of each item with each class, and the final decision mechanism or stage that rules how items are assigned to a single class based on these soft scores. In this sense, most classifiers just assign each item to the class that obtained the highest soft score, a decision procedure typically known as the maximum rule. However, despite its wide application, this maximum rule may well not be the optimal decision procedure in all contexts. For instance, this rule may not necessarily take advantage of all the relevant information contained in the soft scores, particularly in contexts in which classes present some kind of correlation or interdependence structure. In this work, we study how to address this issue in order to improve the adaptation of the classifiers to each specific application context and thus also their performance. In Villarino et al. (2017), classical supervised algorithms as CART (Breiman 1984), random forest (RF) (Breiman

123

G. Villarino et al.

2001) and neural networks (Ripley 1996; Venables and Ripley 2002) were modelled as probabilistic classifiers, providing soft probabilistic assessments of the association of items with classes. In a second step, a bipolar probabilistic representation framework was developed by allowing some opposition or dissimilarity relationships between the classes to be introduced. In a third step, the more convenient structure of dissimilarity relationships was learned through an evolutionary algorithm. This more expressive representational model and the associated learning process permitted to improve the classification performance of the original classifiers. Nevertheless, even assuming that the training sample is the same, the presence of many randomized and/or uncontrolled factors (as selection of the training sample, initialization, resampling mechanisms, random seeds, etc.) usually tends to introduce some degree of variability and imprecision in the probability estimations, suggesting that the soft scores these probabilistic classifiers provide are not actually precise, but imprecise, probabilities. Taking into account this observation, and with the aim to improve the robustness of the classification process, in this paper we have modelled the soft information provided by probabilistic classifiers in a fuzzy way by following an aggregation scheme of the different probability estimations obtained when we replicate the training process of the classifiers, fixing the same training sample but modifying these uncontrolled factors. Some extensive and rigorous computational experiments have been conducted in order to analyse the suitability of the proposed approach, assessing in the framework of standard classification the significance of the improvements it enables when applied on different base classifiers. Specifically, the proposed methodology is tested on 25 datasets selected from the KEEL dataset repository (Alcalá-Fdez et al. 2011), and it is supported by a proper statistical analysis, as suggested in the literature (Demsar 2006; García and Herrera 2008; García et al. 2010). The remainder of the paper is organized as follows: Section 2 describes the preliminary concepts we will use along the work, including the differences between crisp, probabilistic and fuzzy classifiers, as well as some specific concepts regarding accuracy measures and genetic algorithms (GAs). Then, in Sect. 3, we present the main idea of bipolar knowledge representation and its use in a fuzzy classifier frame. Section 4 is aimed to show the complete two-stage (learning and decision-making) process for constructing a fuzzy bipolar classifier from a soft supervised one. Finally, the experimental framework along with the respective analysis of the results is presented in Sects. 5 and 6. We summarize the paper with the main concluding remarks in Sect. 7.

123

2 Preliminaries In this preliminary section, we introduce some concepts for a better understanding of the paper. We firstly introduce the main concepts of crisp, probabilistic and fuzzy classifiers as well as their differences and relationships to motivate one of the principal contributions of this paper: the importance of modelling the soft information of a classifier before making the final decision in a classification task. Afterwards, we focus on the accuracy measures used to evaluate the classification performance reached by the algorithms, and finally a short introduction to the genetic algorithms (GAs) is shown to provide the proper support to our approach.

2.1 Crisp, probabilistic and fuzzy classifiers Let us denote by {C1 , . . . , Ck } the set classes of a classification problem, and by X = {x1 , . . . , xn } the set of items that has to be classified. As we have pointed in introduction, many classification users only take into account the final output of the classification task. This is probably because they are only interested in the final solution provided by the classifier. This is the reason why in a general way, the classifiers are usually viewed as functions C : X −→ {C1 , . . . , Ck },

(1)

that is, a procedure to assign one of the available classes to each of the items being classified. Nevertheless, the classification process goes through many steps before to arrive to the final assignment, and it is in the intermediate steps that soft information usually appears as a natural way to model the information and the evidence being obtained. Particularly, it is very common that classification algorithms manage soft information for each item x ∈ X about the probability that x belongs to each of the different classes, or in fuzzy classification models about the degree of membership of the item x in the set of classes. Taking into account these considerations, in Villarino et al. (2017) we distinguished between crisp (classical) and probabilistic classifiers. A probabilistic classifier can be viewed as a function C P : X −→ [0, 1]k ,

(2)

that assigns to each item x its probability of belonging to each of the available classes. Obviously, for any x ∈ X it has k (C P (x))i = 1 because of the additivity of to hold that i=1 probability. We would like to remark that many classification algorithms (as, for example, neural networks, random forest or decision trees) could be viewed as probabilistic classifiers

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

if we just look at the soft information provided by the algorithms before to make the final decision or crisp assignment. It is possible to define in a similar way a fuzzy classifier as a function, C F : X −→ [0, 1]k ,

(3)

k in which now the constraint i=1 (C F (x))i = 1 does not necessarily hold, i.e. the membership degrees μCi (x) = (C F (x))i of object x in each of the classes Ci do not necessarily sum up to one. It is also possible in this case to remark that some classification techniques, as, for instance, fuzzy rule-based classifications systems, may be viewed as fuzzy classifiers in the previous sense attending to the soft information they provide before the final decision step, which in this context is sometimes identified as a kind of defuzzification.

2.2 Accuracy measures: Kappa statistic The kappa statistic (Cohen 1960) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is formally defined as, κ=

p0 − pc 1 − pc

(4)

where p0 represents the observed agreement and pc the agreement expected by chance. The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers among themselves. In addition, it takes into account random chance (agreement with a random classifier), which generally means it is less misleading than simply using accuracy as a metric (an Observed Accuracy of 80% is a lot less impressive with an Expected Accuracy of 75% versus an Expected Accuracy of 50%). Computation of Observed Accuracy and Expected Accuracy is integral to comprehension of the kappa statistic. Classifiers built and evaluated on datasets of different class distributions can be compared more reliably through the kappa statistic (as opposed to merely using accuracy) because of this scaling in relation to expected accuracy. It gives a better indicator of how the classifier performed across all instances, because a simple accuracy can be skewed if the class distribution is similarly skewed. Since we are leading with imbalanced datasets and also comparing among different classifiers, this is in our opinion a more effective way of measuring the performance of the classifiers than the usual accuracy rate.

2.3 Genetic algorithms Genetic algorithms (GAs) are a class of evolutionary algorithms made popular by John Holland and his colleagues

during the 1970s (Holland 1975), which have been successfully applied to find exact or approximate solutions to optimization and search problems (Goldberg 1989; Sivanandam and Deepa 2007). In a few words, GAs are stochastic search algorithms inspired by the basic principles of biological evolution and natural selection. GAs simulate the evolution of micro-organisms, where the fittest individuals dominate over the weaker ones, by mimicking the biological mechanisms of evolution, such as selection, crossover and mutation.

3 Bipolar knowledge representation Our knowledge about our own methods of reasoning has been a fruitful source of inspiration in the development of artificial intelligence (AI) since its foundation in the middle of last century. In fact, important areas of this discipline have been and still are closely concerned with the replication of some mental capacities of the human beings, like playing chess or proving theorems. Similarly, expert systems and decision support systems replicate the human ability of making correct decisions in many fields of activity (as, for example, medical care, advertising, control tasks, disaster management) based on explicit knowledge about these realities. In the same way, machine learning and data mining techniques (see, e.g. Hullermeier 2005) have found a wide success by trying to replicate the human ability of discovering patterns and to learn from examples or experience. However, it is important to remark that this success would have not been possible without finding some inspiration in our understanding of how humans manage to solve problems in these fields and neither without the development of formal models enabling an effective representation of this knowledge. In this sense, in this paper we are finding some inspiration in the idea of bipolarity, originally a psychological notion (see Osgood et al. 1957) referring to the human mind’s characteristic way of dealing with information by simultaneously assigning it both a positive and a negative character, as when people make decisions by weighing up the good (pros) and bad (cons) sides of the available alternatives and choosing that with the best balance between those two sides or poles. (It is due to the presence of such two poles that this feature of the human judgement is usually referred to as bipolarity.) This ability to simultaneously deal with both positive and negative information allows our mind being able to represent and manage complex situations related to interests and emotions, such as ambivalence, indeterminacy or conflict. Though the idea of bipolarity already has a long history, in the last 25 years it has attracted the attention of researchers in the fields of logics and knowledge representation. Particularly, many authors (see, for instance, Dubois and Prade 2008;

123

G. Villarino et al.

Atanassov 1999; Montero et al. 2007, 2016) have emphasized the relevance of introducing a bipolar approach in soft knowledge representation formalisms as fuzzy logic (Zadeh 1988) or possibility theory (Dubois and Prade 2006) in order to enhance their expressive and representational power. In consonance with these ideas, in this paper we propose to introduce a bipolar approach in the framework of classification systems. This proposal is therefore focused on replicating, in a soft classification context, the human ability to learn and reason on the basis of information with positive and negative characters. Our approach is particularly devised to adapt to those classifiers that produce in an intermediate step of their algorithmic process some kind of soft score, applying in a final decision stage the so-called maximum rule, which determines that each item is assigned to the class with the highest soft score. However, there are some reasons to think such maximum rule may not always be the best way to make the final assignment, since it does not necessarily take advantage of all the information contained in the soft scores. For instance, let us suppose that in a binary classification context, with just two classes, two items x and y, respectively, obtained the soft scores C(x) = (1, 0) and C(y) = (0.51, 0.49). Obviously, both x and y would be assigned to class 1 since it obtained the highest score in both cases. That is, the maximum rule would provide the same classification output for both items, despite the available evidence pointing to a clear, safe assignment in the first case and to an unclear, rather uncertain assignment in the second case. Depending on the features of the application context, perhaps a no-classification output, in which the object is left unclassified, could have more appropriate in the second case. Maybe even y should have been assigned to class 2 if we were dealing with a quite unbalanced classification problem, in which class 2 is far less frequent (or much more relevant) than class 1. Similarly, consider now a 3-class classification context in which an object x obtained the intermediate soft evaluation C(x) = (0.45, 0.3, 0.25). The maximum rule would clearly assign x to class 1. However, what if there are some clues that the presence of class 3 has to be taken as an evidence that class 1 should be ruled out? In this case, the positive evidence for class 3 should be somehow considered as negative evidence against class 1, somehow reducing the actual positive evidence for class 1, even to the point of finally assigning x to class 2 instead of class 1. This example illustrates the main idea of our proposal: the performance of classifiers in some application contexts may benefit from assuming that the available classes are related through a certain opposition or dissimilarity structure, informing of which classes provide negative evidence against the other classes. That is, given a class, let us say class i, some other classes j = i may be opposite or dissimilar to it, in the sense that if there is positive evidence for

123

these classes j, then the positive evidence for class i should be decreased. In this sense, negative evidence against a given class arises from the positive evidence for the other classes together with the consideration of a particular dissimilarity structure on the set of classes. And the existence of negative evidence against a given class is taken to imply that the actual, finally relevant evidence for such class should be somehow reduced from the initial positive evidence contained in the soft scores produced by the classifiers. In this paper, our approach is to lear n the more convenient dissimilarity structure for each classification problem in order to improve as much as possible the classifiers performance, i.e. their predictive accuracy. In this sense, the proposal of this paper differs from that of Rodríguez et al. (2011, 2013), in which dissimilarity structures were predefined, not learned, aiming to adapt the classifiers to some decision-related requirements of the disaster management context (see also Rodríguez et al. 2012). Other differences with respect to those works are that in this paper we develop a logistic aggregation scheme to provide the final evidence scores from the positive and negative evidence degrees for each class and also that the bipolar approach is applied to a set of well-known classifiers rather than just to a simple fuzzy rule-based classifier. The representation formalism for dissimilarity structures and the way to make them operational in order to produce the negative evidence degrees from the positive ones are described in the next section.

3.1 Dissimilarity structures and bipolar knowledge representation for fuzzy classification Some authors from different areas [such as fuzzy sets theory (Montero et al. 2007), machine learning (Fünkranz et al. 2008), aggregation operators (Amo et al. 2001; Gómez and Montero 2004; Rojas et al. 2014) or knowledge representation (Montero et al. 2016)] have stressed the importance of acknowledging and making explicit the inherent structure that underlies in many problems, as it may allow a better understanding of the reality under consideration and a clearer reflection on the adequacy of our models of such realities. In this sense, in the supervised classification context it is usually assumed that classes are somewhat independent entities, and thus, there exist few models based on the assumption that the classes present some relationships based on a certain structure. However, as discussed above, the application of classifiers in various contexts may benefit from introducing some kinds of structure in the set of classes, in such a way that the modelled relationships between the classes improve the adaptation of classifiers to the features and requirements of the application context and thus also their performance. This idea was effectively applied in Rodríguez et al. (2011, 2013), where a set of fuzzy classes, representing different linguis-

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

tic assessments of the consequences of natural disasters, was endowed with an (predefined) structure intended to model the natural dissimilarity and opposition relationships of such assessments. This resulted in a bipolar fuzzy classification model, in which the association between items (disaster scenarios in this case) and classes is evaluated through pairs of positive and negative membership degrees and that allowed to introduce in the classification model itself some of the decision-related requirements of the disaster management context. In this paper, we aim to translate this bipolar approach to fuzzy classification into a more general setting, assuming the existence of a dissimilarity structure in the set of classes but allowing it to be not predefined but learnt along the classifier training process. The idea is therefore to uncover the most convenient dissimilarity structure in order to improve the classifier’s predictive accuracy. Thus, we depart from a fuzzy classifier C F in the above sense, which will be extended to allow a bipolar knowledge representation framework, with both positive and negative evidence degrees. To this aim, we first of all interpret the output membership vector μ(x) = (μC1 (x), . . . , μCk (x)) of a fuzzy classifier as the available positive evidence for the association between each class and the item x. Taking into + account this, let us denote this vector in the form μ (x) = + + μ1 (x), . . . , μk (x) = μC1 (x), . . . , μCk (x) . Now, we assume that those classes being dissimilar to a given class Ci jointly define an abstract class denoted by dCi , that may be read as “dissimilar to Ci ”. Then, let us represent by μi− (x) the membership degree of item x to that class dCi . This degree is obviously interpreted as negative evidence, that pushes against the association of item x with class Ci as it provides evidence that x is associated with classes dissimilar to Ci . In order to built this negative membership degree, we need an additional information stating what classes, and up to what degree, are dissimilar to each class Ci . We assume this valued dissimilarity relationship to be irreflexive (i.e. no class is dissimilar to itself) and not necessarily symmetric (i.e. a class C j may be dissimilar to Ci without the latter being dissimilar to the former). We denote this relationship by D = di j , where di j ∈ [0, 1] represents the degree up to which class C j is dissimilar to Ci . Therefore, any dissimilarity matrix D is in general non-symmetric and has to verify dii = 0 for all i = 1, . . . , k. As it informs of the dissimilarity relationships among all the available classes, we consider that any such matrix D endows the set of classes {C1 , . . . , Ck } with a dissimilarity structure. As exposed above in Sect. 2, a natural way to obtain the negative degrees μi− (x) is to consider that they arise from the application of the dissimilarity structure to the positive evidence degrees μ+j (x), j = i. Although this negative degree may be built in different ways, we consider in this work the following definition, where Di denotes the i-th row of the

considered dissimilarity matrix D: μi− (x) =

j=i

di j μ+j (x) =

k

di j μ+j (x) = Di μ+ (x), (5)

j=1

Once the positive and negative evidence degrees for each class and item have been obtained, leading to the respective vectors μ+ (x) and μ− (x), it is possible to define a bipolar (fuzzy) classifier as a mapping Cbip : X −→ [0, 1]k × [0, 1]k ,

(6)

assessing the positive and negative evidence, respectively, for and against the association of each item x with each class. Then, similarly to the case of probabilistic and fuzzy classifiers, if a final decision or assignment of item x to a single class has to be produced, the information in the vectors μ+ (x) and μ− (x) has to be somehow further exploited to decide the class with a better balance between positive and negative evidences. In this paper, we propose to use an aggregation scheme to transform back the pairs (μi+ (x), μi− (x)) adj into a single adjusted fuzzy degree μi (x) for each class i = 1, . . . , k, that represents the final soft evidence for the association between item x and class i after the initial fuzzy degrees μ(x) = μ+ (x) have been readjusted through the action of the dissimilarity structure D. The final decision is then produced by applying the maximum rule to these adjusted fuzzy degrees, that is, the item x is assigned to the adj adj class C h , h ∈ {1, . . . , k}, such that μh (x) = maxi μi (x). Therefore, what is left is to establish how to aggregate the positive and negative information to obtain these adjusted degrees. In this paper, we have considered two aggregation models, which we describe in detail in the next subsection. Before to that, let us remark that all the processes we have just described, which departs from a fuzzy classifier, obtains negative evidence degrees through the consideration of a dissimilarity structure, aggregates positive and negative evidence into adjusted degrees and finally applies the maximum rule, which can be seen in fact as a refining of the maximum rule itself when we join this process together with a learning process of the dissimilarity matrix D: instead of applying the maximum rule to the initial fuzzy degrees, we adjust these in such a way that hopefully the maximum rule, finally applied to the adjusted degrees, is able to make better decisions—that is, class assignments—than it would have been without the dissimilarity-driven adjustment process.

3.2 Aggregating bipolar evidence: the additive and logistic cases Let us now address the question of how to aggregate, for a given class Ci and an item x, the pair of positive and negative

123

G. Villarino et al.

evidence degrees μi+ (x) and μi− (x) in order to obtain a single adj adjusted fuzzy degree μi (x). Obviously, different aggregation choices will lead to different adjusted classifiers. In this work, we have studied two different kinds of aggregation, which are defined below. Definition 1 Let μi+ (x), μi− (x) be the positive and negative membership degrees of item x into class Ci . The additive adjusted membership degree of x into class Ci is defined as μiadd (x) = max{0, μi+ (x) − μi− (x)}.

(7)

Notice that the previous definition can be interpreted as the Lukasiewicz t-norm W (a, b) = max{a + b − 1, 0} of the positive and non-negative degrees, that is, μiadd (x) = W (mui+ (x), n(mui− (x))), where n stands for the standard negation n(a) = 1 − a. Therefore, the additive aggregation μiadd (x) assesses the degree up to which the association of x with class Ci gathers positive and non-negative evidence and does so by balancing the positive and negative degrees in an additive fashion. In this way, the positive evidence mui+ (x) initially provided by the fuzzy classifier is adjusted by subtracting from it the negative evidence mu i− (x). Particularly, the initial degrees are not modified when no class is dissimilar to Ci , that is, when Di = 0. Thus, an adjusted degree μiadd (x) > 0 represents the existence of a positive gap between the support for class Ci and the support for class dCi , that is, for the classes considered dissimilar to Ci . In this situation, the strength of the association of item x with class Ci may have been reduced from its initial assessment, but it is still perfectly possible that item x is finally assigned to Ci . On the other hand, a zero value of μiadd (x) represents a situation in which there exists more evidence for the dissimilar class dCi than for Ci , and thus, the adjusted classifier should not assign the item to class Ci . In the following definition, we propose an alternative way to aggregate the positive and negative information into a single adjusted degree. Definition 2 Let μi+ (x), μi− (x) be the positive and negative membership degrees of item x into class Ci . The logistic adjusted membership degree of x into class Ci is defined as log μi (x)

=

⎧ ⎨ ⎩

1−e 1

−

μi+ (x) μi− (x)

if μi− (x) > 0 otherwise

(8)

Unlike the additive logic of the previous aggregation, this logistic aggregation focuses on the ratio between positive and negative information, adjusting it to range in the [0,1] interval through a logistic transformation. This permits a somehow more flexible behaviour of the adjusted degrees, in the sense that the choice of the dissimilarity

123

matrix D may have an even greater influence in the adjustment of the initial positive evidence provided by the base log fuzzy classifier, up to the point that class μi (x) = 1 whenever no evidence is gathered for the dissimilar class dCi , that is, when μi− (x) = 0. For this reason, this logistic aggregation is not well suited to work with extreme dissimilarity structures in which some of the di j are 0. However, if this is not the case (and the dissimilarity learning algorithm will rapidly make sure it is not) this increased flexibility may allow the adjusted classifier to be able to effectively refine the initial maximum rule assignments in a wider range of situations than the previous additive aggregation. As mentioned above, once one of these two aggregation methods has been applied and the adjusted fuzzy degrees adj adj μi (x) have been obtained for each class (either μi (x) = adj log μiadd (x) or μi (x) = μi (x)), the final decision on the classification of item x is made by applying the maximum rule to such adjusted degrees. Therefore, the item x is finally assigned to the class C h with a maximum adjusted degree adj adj μh (x), that is, h = arg maxi∈{1,...,k} μi (x). Summarizing the above process in the terms stated in Sect. 3.1, let us remark that the proposed approach consists on first transforming a fuzzy classifier into a bipolar one by means of the consideration of a dissimilarity structure, which is latter transformed back into an adjusted fuzzy classifier through the aggregation of the bipolar evidence degrees. Finally, this adjusted fuzzy classifier is defuzzified through the maximum rule in order to obtain a usual crisp classifier. Particularly, given a fuzzy classifier, a dissimilarity matrix and a procedure to aggregate the positive and negative information, this final crisp classifier is fully determined. In order to show how a fuzzy classification output is modelled in a bipolar way and through a particular dissimilarity matrix, and how these bipolar scores are aggregated into an adjusted fuzzy score, we present the following example. Example 1 Given a classification problem with a two-class target C = (C1 , C2 ), let us suppose that we have an item x with the following membership degrees μ(x) = (0.6, 0.4) after the training process of a fuzzy classifier. In a classical classification scheme, by using the maximum rule, this item would be assigned to the class C1 . After applying the ideas described in this section, it is possible to build the positive (μ+ (x) = μ(x)) and negative degrees from a dissimilarity matrix. Let us consider a (non-symmetric) situation in which class C1 is highly dissimilar to class C2 but not conversely. This situation can appear in many real situations as: unbalanced classification problems or algorithms with non-symmetric classification errors, among many others.

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

If we define the dissimilarity matrix D with d12 = 1 and d21 = 0.1, we can calculate the negative evidence as follows, + μ− (x) = Dμ+ (x) = (d12 μ+ 2 (x), d21 μ1 (x)) = (0.4, 0.06)

Let us observed that the negative evidence of this item x is greater at the class C1 than at C2 . This happens because class C2 is strongly dissimilar to class C1, and also because the original fuzzy score of class C2 (0.4) is not much smaller than that of class C1 (0.6). So the bipolar membership degrees of these two classes are − μ1 (x) = (μ+ 1 (x), μ1 (x)) = (0.6, 0.4) bip

− μ2 (x) = (μ+ 2 (x), μ2 (x)) = (0.4, 0.06). bip

Once the bipolar membership degrees have been calculated, we can apply the two aggregation operators shown above to get the adjusted membership degrees μadd (x) = (0.2, 0.34)

0.4 0.6 log , 1 − exp − μ (x) = 1 − exp − 0.4 0.06 = (0.77, 0.99) and their respective class assignment, log

C hadd (x) = C h (x) = C2 . Thus, the considered dissimilarity structure between the classes can change the final class assignment from C1 to C2 by applying both aggregations.

4 Fuzzy bipolar classifier from a supervised classifier Most (if not all) supervised classification algorithms can be viewed as a two-stage process, in which the first stage is aimed to train the classifier and produce some kinds of soft scores assessing the strength of association between an item and the available classes. The second stage is reserved for making the final decision or class assignment on the basis of the previous stage’s soft scores. From this point of view, in this section we describe the proposed complete two-stage classification process showing the modifications to be introduced in each of these two stages in order to get a fuzzy bipolar classifier from a soft one. Figure 1 illustrates the previous idea that a standard soft classifier can be viewed as a two-step procedure (learning and decision). In the first one, the classifier uses the training data to develop a mechanism that permits evaluating each item in a soft way. In the second step (final decision), the

Training Stage (S1). The classifier produces a soft score assessing the association between each item and each class.

Decision making stage (S2). Maximum rule: assign the item to the class with the highest score. Fig. 1 Flow diagram of a standard soft classifier

final classification is carried out based on these soft scores using the maximum rule. In this paper, we propose to modify this second step transforming any fuzzy classifier into a bipolar one (see Sect. 4.2) following the ideas described in Sect. 3. In addition, just in case the soft classifier is not already a fuzzy one, we also provide a mechanism to transform any probabilistic classifier into a fuzzy one during the training step (see Sect. 4.1).

4.1 Robustness-enhancing fuzzification of a probabilistic classifier As mentioned above, there are many uncontrolled factors (seeds, randomness, initialization, …) that can influence the development and performance of a supervised classification process once the training dataset has been fixed. So, even for the same training dataset, the estimation of the different probabilities C P (x) = (C P (x)1 , . . . C P (x)k ) that a probabilistic classifier provides for each object x ∈ X in the training dataset could vary depending on the particular realization of the algorithm in the training step. Taking into account this consideration, when statistical replication is applied to enhance the robustness of the classification process of a probabilistic classifier, it is then possible to interpret the imprecision arising in the probability estimation as a fuzzy set. Figure 2 shows a flow diagram of the proposed training stage (S1). Let us note that in case our classifier is already a fuzzy one, this stage has no particular changes with respect to the usual training phase. Here we provide a procedure to accomplish this robustnessenhancing fuzzification of a probabilistic classifier. We assume that the estimation of the probabilities C P (x) = (C P (x)1 , . . . , C P (x)k ) is somehow replicated a number of times (for instance, using different seeds to get different bootstrap samples of the training dataset, or allowing a different initialization of the algorithm) in order to obtain an assessment of the involved estimation imprecision. Let L denote the number of replications applied, and let also C lP (x) = (C lP (x)1 , . . . , C lP (x)k ) denote the probability distribution on the different classes obtained in the l-th realization of the

123

G. Villarino et al.

Decision Making Stage S2

Training Stage S1

Take a Dissimilarity Matrix D

Is φ a Fuzzy Classifier? Yes

Train the fuzzy classifier in the way you usually do it

Apply the bipolar knowledge representa− tion and obtain the paires: (μ+ i (x), μi (x))

No

Train the probabilistic classifier and convert it into a fuzzy one

Aggregate the positive and negative evidences into an adjusted membership degree: μbip i (x). Apply the maximum rule on this adjusted degrees.

Go to Decision Making stage S2 GA Iterations

Fig. 2 Flow diagram of the proposed training stage (S1)

probabilistic classifier. The idea is then aggregating the L different probability distributions obtained after replication is applied in order to obtain membership degrees of item x in each class. Three aggregation operators have been considered in this work to carry out this step: the maximum (that could be considered as a possibilistic measure in the sense of Dubois and Prade 2002), the minimum and the arithmetic mean (that could be considered as a more precise estimation of the real probability of the different classes). Formally, let φ denote any of these three aggregation operators. Then, the robust membership degrees of item x in class Ci are obtained as μCi (x) = φ(C 1P (x)i , . . . , C PL (x)i ).

(9)

Thus, a fuzzy classifier C F in the sense exposed in Sect. 2.1 is obtained after the aggregation of the different replicated probability distributions. Let us conclude this section with some remarks. First, there exist many ways to obtain fuzzy classifiers from the previous point of view. In this work, we have produced this fuzzy classification from a probabilistic one due to the natural imprecision that many probabilistic algorithms present when they estimate probabilities. Also, it is important to mention that the number of replications L increases the computational complexity of the problem. In this work, we have fixed L = 5. Values greater than L = 5 will increase the robustness of the process, but will make the computational complexity of the process more difficult to afford. Finally, let us observe that when no replication is performed, or a totally deterministic probabilistic algorithm is used, the resulting classifier would be the same despite the aggregation operator being used. This case will then coincide with the probabilistic bipolar approach developed in Villarino et al. (2017).

123

Accuracy evaluation. Kappa Metric Choose the best dissimilarity matrix D ∗ Test Set evaluation and final comparisons

Fig. 3 Flow diagram of the proposed decision-making stage (S2)

4.2 Bipolar decision-making: learning the dissimilarity matrix It is clear that the dissimilarity matrix D plays a crucial role in this classification scheme since it determines how the negative evidence is derived from the initial fuzzy evidence for each class. As a consequence, the performance of the resulting crisp classifier, as well as the effect of incorporating the bipolar representation framework and the aggregation method, is absolutely dependent on the choice of the matrix D. Figure 3 shows a flow diagram of the proposed decisionmaking stage (S2), including the genetic search of the dissimilarity structure and its application to the test set. Ideally, in real situations the adequate structure of dissimilarity between classes should be proposed by application domain experts based on his knowledge. However, in many cases it may be more practical to obtain the matrix D through a learning process carried out once the base fuzzy classifier has been trained. When this learning process is driven by a measure of performance focused on the generalization accuracy of the adjusted crisp classifier, the resulting matrix will tend to allow fixing some of the misassignments committed by the base classifier on the training sample, hopefully also improving its predictive accuracy on new queries or a test

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

sample. Therefore, this learning approach allows that any fuzzy classifier may benefit from introducing a dissimilarity structure in the set of classes, aiding the decision rule of the classifiers to better adapt to the specific features of each dataset or application context. Here we propose that the learning process of the dissimilarity matrix D is performed by means of a genetic algorithm (GA). Regarding the specificities of applying a GA to fine-tune the dissimilarity structure of the set of classes, the initial population of dissimilarity matrices is randomly generated subject to the theoretical constraints exposed in Sect. 3.1, except two individuals that are, respectively, set to represent the two extreme cases of dissimilarity matrices: a zero matrix (di j = 0 for all i, j = 1, . . . , k) representing a null structure in which no classes are dissimilar, and a full-opposition matrix (di j = 1 for all i = j = 1, . . . , k) that leads to consider all classes as fully dissimilar. A real-valued coding scheme is used to represent this population, and the individuals fitness is evaluated through the kappa statistic (Cohen 1960) of the predicted and observed classes. The specific parameters of the applied GA are given in Sect. 5. Particularly, GAs have been effectively applied in many fuzzy rule-based classification systems (FRBCS, see Cordon et al. 1999) in order to achieve a fine-tuning of the membership functions of the independent variables (see, e.g. Alcala-Fdez et al. 2011) or the weights of the fuzzy rules (Ishibuchi et al. 2005), as well as to select the rules that form the final rule base (Alcalá et al. 2007). Thus, a first difference of our proposal from these previous usages of GAs in the fuzzy classification context is that the fine-tuning of the dissimilarity matrix rather affects the aggregation and decision-making stages of the classifiers and would not modify at all any membership functions, rule weights or the set of rules a base FRBCS may provide. A second (and in our opinion more important) difference lies in the fact that our proposed approach may be applied to any kind of fuzzy classifier, in the wide sense given in Sect. 2.1 (and not only to FRBCS, as it is the case of the different applications of GAs just referenced), since it only relies on the final fuzzy scores C F (x) that will be always provided by any fuzzy classifier, despite the specific features of the knowledge representation structure or the algorithm used in the intermediate steps of the classifier. This independence of our dissimilarity-driven approach from the structure and intermediate processes of the base fuzzy classifiers is particularly important since fuzzy classifiers may be obtained in several ways. For instance, as just described in Sect. 4.1, the imprecision in the estimation of probabilities that arises when applying any kind of statistical replication (e.g. bootstrapping) to a probabilistic classifier is usually addressed through aggregation schemes in order to increase the robustness of the classification process and may naturally lead to a

fuzzy modelling of evidence, since particularly the obtained soft scores do not necessarily have to sum up to one.

5 Experimental framework This section is devoted to present a computational experience aimed to assess the performance of our dissimilarity-based bipolar knowledge representation approaches (additive and logistic) when applied on recognized classifiers such as CART (Breiman 1984), random forest (Breiman 2001) and neural networks (Ripley 1996; Venables and Ripley 2002). Although, as noted in Sect. 3.1, the intermediate soft scores provided by these classifiers are of a probabilistic nature, we make use of the replication-plus-aggregation scheme described in Sect. 3.5 to transform them into robust fuzzy classifiers, upon which the proposed bipolar approach will then be applied. In this work, we restrict ourselves to test the proposed approach in a binary classification setting, that is, when the set of classes is formed by just k = 2 classes. This setting will still allow to validate the suitability of our proposal while at the same time significantly reducing the computational efforts required by the experiments. This section is organized as follows: we firstly explain the details of the specific experimental setting followed in this work (Sect. 5.1), and next we present the real-world classification datasets selected for the experimental study (Sect. 5.2). In the third and fourth subsections, we briefly describe the different classifiers (Sect. 5.3) and the configuration of the GA (Sect. 5.4) that will be used in the experimental analysis. Finally, we present the statistical analysis methodology used to evaluate the significance of the results (Sect. 5.5). The results of this computational experience will be described and discussed in Sect. 6.

5.1 Experimental setting details As just mentioned, the base classifiers used in this experiment are CART, random forest (RF) and neural networks (NNs). This experience is designed to compare the benchmark performance of these classifiers, as well as of the robust fuzzy classifiers obtained through replication and aggregation of those base classifiers, with that of the classifiers obtained from the later ones by means of the proposed dissimilarity learning process and the additive and logistic adjustments. We compute different results for each of the three aggregation operators (max, min and mean, as introduced in Sect. 4.1) used on the replicated base classifiers. Thus, we are actually carrying out 9 different experiments (3 base classifiers times 3 aggregation operators), in each of which the performances of 4 classifiers (base classifier, robust fuzzified classifier, additive bipolar classifier and logistic bipolar classifier) are compared using several datasets. The objective is

123

G. Villarino et al.

thus to uncover whether the proposed dissimilarity-driven adjustments enable an improvement of the performance of the base and fuzzified classifiers, under different aggregation conditions and for different base classifiers. The results for each classifier in each experiment will be obtained following a fivefold cross-validation scheme for each dataset. In each folder, that is, for each training set, the optimal base classifier parametric configuration (complexity parameter c p for CART, number of randomly selected features mtry in case of RF and number of neurons si ze or weight decay parameter decay for NNet) is approximated using a grid P on the space of parameters of the algorithms considered. In order to evaluate the performance of each specific parametric configuration p ∈ P, 25 bootstrap samples of the training set are generated, in such a way that the base classifiers are fit to each of these bootstrap samples and then tested on a hold-out sample (composed by the non-selected instances in the bootstrapping process) using the kappa statistic. The classifiers performance for each configuration p (in each folder of each dataset) is then computed by averaging the kappa statistics obtained in each hold-out sample, and thus, the parametric configuration p to be finally applied in the current folder is that with the maximum average kappa. This optimal parametric configuration classifier is then fit to the whole training set and latter applied to the test set, providing vectors of probabilities C P (x) for each item x in the training and test sets of the folder. Notice that these vectors may be exploited through the maximum decision rule to get crisp class assignments, from which train and test accuracy rates of the base probabilistic classifiers for the current folder may be obtained. We replicate this parametric search training step L = 5 times in each folder using different seeds to draw the bootstrap samples, thus providing replicated probability estimations C lP (x), l = 1, . . . , L for each object x in the training and test sets. We then apply an aggregation operator φ (either maximum, minimum or arithmetic mean) to these replicated probabilities, leading to the degrees μCi (x) = φ(C 1P (x)i , . . . , C 5P (x)i ) of the resulting aggregated fuzzy classifier C F , assessing the membership of each train and test item x to each class i = 1, . . . , k. As before, these aggregated fuzzy degrees may be then exploited through the maximum rule to provide crisp class assignments and compute both train and test accuracy rates of the fuzzy aggregated classifiers. At each folder, the genetic dissimilarity learning process is carried out departing from the vectors of aggregated degrees μ(x) of the items x in the training sample and then searching for an optimal dissimilarity matrix D so that the resulting adjusted degrees maximize the kappa statistic on the training sample. After such optimal matrix is learnt, it is applied to the aggregated fuzzy classifiers to obtain adjusted degrees

123

adj

μi (x) for each class i and for each item x in the training and test samples. These adjusted degrees can then be exploited through the maximum rule to provide crisp class assignments and compute both train and test accuracy rates of the bipolar adjusted classifiers. A different genetic search is conducted for each of the adjustment procedures, additive and logistic, and thus, each method provides its own results. The train and test performance measures of each of the 4 classifiers in each dataset considered in each experiment are finally computed by, respectively, averaging the train and test accuracy rates of the F = 5 different folders. For clarity, the different steps of the described experimental setting for each dataset and probabilistic base classifier are summarized in Algorithm 1. Let us remark that the procedure of fitting the probabilistic classifier to each dataset has been implemented using the caret package (see Kuhn 2008) of the R software. Particularly, S = 25 bootstrap samples and L = 5 replications are the default options of this package. Regarding the space of parameters considered, it depends on the available options for each classifier in the specific implementations used to fit the model to the training set. Algorithm 1 Select the number of folders F, replications L, bootstrapped samples S and the grid P of parametric configurations for training the base probabilistic classifier C P . 2: for each folder f ∈ F do for each replication l ∈ L do 4: procedure Fit the probabilistic classifier to the training set(P, S) for each parametric configuration p ∈ P do 6: for each sample s ∈ S do Hold-out specific samples 8: Fit the classifier to the remaining samples Predict the hold-out samples 10: end for Calculate the average performance across the S hold-out samples 12: end for Determine the optimal parametric configuration 14: Fit the classifier to the whole training dataset using this optimal configuration Obtain the probabilistic scores C lP (x) 16: end procedure end for 18: Apply aggregation operators φ to the L probability vectors C lP (x) Obtain the aggregated fuzzy degrees C F (x) 20: for each adjustment method ∈ {additive, logistic} do Run the GA to learn the dissimilarity structure in the training set 22: Apply the resulting matrix D to the scores C F (x) and obtain the bipolar degrees Cbip (x) Apply the adjustment method to obtain the corresponding fuzzy adjusted degrees μadj (x) 24: end for end for 26: Calculate the train and test average performance across the F folders for the four classifiers using the maximum rule

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

5.2 Datasets We have selected a benchmark of 25 datasets from the KEEL dataset repository (Alcalá-Fdez et al. 2011), which are publicly available on the corresponding web page, including general information about them, the folder partitions for the validation of the experimental results and so on. Particularly, we have used the five-folder cross-validation datasets provided by KEEL in the different experiments. Therefore, the considered datasets are already split into 5 random partitions of data, each one with 20% of the patterns, and in each folder step 4 of them are combined to form the training dataset (thus containing 80% of the items, and the remaining partition is used as test dataset. Table 1 summarizes the properties of the selected datasets, showing for each dataset the number of examples (#Ex.), the number of attributes (#Atts.), type (Real/Integer/Natural) and the imbalance ratio (IR) once the dataset has been transformed into a binary classification problem. In order to convert a multi-class dataset into an unbalance two-class (C0 ,C1 ) dataset, we have taken as class C1 the original one closest to 20% of instances and as class C O the union of the remainder classes. We must point out that many of the considered datasets present a high IR, because of its previous multi-class unbalanced distribution. We consider that as an opportunity to assess our bipolar approaches when dealing with unbalanced datasets.

strapped sample of the original training data, and searches only across a randomly selected subset of the input variables to determine a split (for each node). A neural network is an information processing paradigm inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems and, like people do, they learn by example. These algorithms are widely used in many applications such as pattern recognition or data classification, through a learning process.

5.4 Genetic algorithm details Finally, regarding the GA used at the evolutionary tuning of the dissimilarity structures, we have used the default GA for real-coded chromosomes implemented in the genalg R package. It is a standard GA, with usual crossover and mutation operators, the details of which can be consulted at Willighagen (2005). The GA has been run with the following configuration, which provided satisfying solutions in a feasible amount of time: – – – –

Population Size: 50 individuals. Number of iterations: 20. Mutation Chance: 0.01. Elitism: About 20% of the population size.

5.3 Algorithms considered as base classifier 5.5 Statistical test for performance comparison We have selected three well-known classifiers for this experiment. On the one hand, we consider the CART (Breiman 1984) algorithm because of its simplicity; on the other hand, we choose a more complex rule-based classifier such as random forest (Breiman 2001), as well as one of the most powerful machine learning parametric algorithms, neural networks (Ripley 1996; Venables and Ripley 2002). Let us briefly describe the main features of these classification methodologies. A CART tree is a binary decision tree that is constructed by repeatedly splitting a node into two child nodes, beginning with the root node that contains the whole training sample. The basic idea of tree growing is to choose a split among all the possible splits at each node so that the resulting child nodes are the ’purest’ in terms of an impurity measure such as the information criteria or the Gini index, which we have used in this work. Random forest is one of the most competitive tree-based ensemble classifiers. It uses a similar but improved method of bootstrapping as bagging and can be considered an improved version of bagging. In training, the random forest algorithm creates multiple CART-like trees, each trained on a boot-

In this paper, we use some hypothesis validation techniques in order to give statistical support to the analysis of the results. We will use nonparametric tests because the initial conditions that guarantee the reliability of the parametric tests cannot be fulfilled, which imply that the statistical analysis may lose credibility when parametric tests are used in this context (Demsar 2006). Specifically, we employ the Wilcoxon rank test (Wilcoxon 1945) as a nonparametric statistical procedure for making pairwise comparisons between two algorithms. For multiple comparisons, we use the Friedman aligned ranks test, which is recommended in the literature (Demsar 2006; García and Herrera 2008; García et al. 2010) to detect statistical differences among a group of results. Finally, the Holm post hoc test (Holm 1979) has been used to find the algorithms that reject the equality hypothesis with respect to a selected control method. The post hoc procedure allows us to know whether a hypothesis of comparison of means could be rejected at a specified level of significance α. Furthermore, we compute the adjusted p value (APV) in order to take into account that

123

G. Villarino et al. Table 1 Summary description for the employed datasets Id.

Dataset

Car

Car

Con

Contraceptive

Eco Fla

#Ex.

#Atts.

(R/I/N) (15/0/10)

#IR

159

25

1473

9

(6/0/3)

3.5 3.43

ecoli

336

7

(7/0/0)

3.34

flare

1066

25

(15/0/10)

2.58

Gla

Glass

214

9

(9/0/0)

2.05

Lin

Lymphography

148

18

(3/0/15)

1.43

Shu

Shuttle

2175

9

(0/9/10)

Thy

Thyroid

720

21

(6/0/15)

5.44 18.1

Yea

Yeast

1484

8

(8/0/0)

5.1

Aus

Australian

690

14

(3/5/6)

1.25

Bal

Balance

625

4

(4/0/10)

1.19

Bup

Bupa

345

6

(1/5/0)

1.38

1000

20

(0/7/13)

2.33

160

4

(0/4/0)

3.76

10992

16

(0/16/0)

8.57

5472

10

(4/6/0)

14.57

Ger

German

Hay

Hayes-Roth

Pen

Penbased

Pag

Page-blocks

Sah

Saheart

462

9

(5/3/1)

1.89

Veh

Vehicle

846

18

(0/18/0)

2.93

Wis

Wisconsin

Wnq

Winequality-red

699

9

(0/9/0)

1.83

1599

11

(11/0/0)

6.97 1.88

Pim

Pima

768

8

(8/0/0)

Nty

Newthyroid

215

5

(4/1/0)

5.14

Hab

Haberman

306

3

(0/3/0)

2.81

Der

Dermatology

366

34

(0/34/0)

4.07

App

Appendicitis

106

7

(7/0/0)

4.25

multiple tests are conducted. In this manner, we can compare directly the APV with respect to the level of significance α in order to be able to decide whether to reject or not the null hypothesis. These tests are suggested in the studies presented in Demsar (2006), García and Herrera (2008), García et al. (2010), where it is recommended their use in the field of machine learning. A complete description of these tests, with many considerations and recommendations and even the software used to run this analysis, can be found on the website http:// sci2s.ugr.es/sicidm/.

6 Experimental results This section is aimed to present the results of the computational experience described above and carried out in order to study the capacity of enhancement of our bipolar adjusted classifiers with respect to the reference base classifiers as well as to the fuzzy classifiers obtained by the aggregation process, to which the proposed final decision tuning method is applied.

123

Taking into account that we have considered three different aggregation functions, the results will be shown for each one separately. Thus, we will have three groups of results depending on the aggregation method followed to reach the fuzzy information from the replicated probabilistic classifiers, and in each group of results we will provide a main table of classification kappa metrics and the proper statistical tests information as well. Results are grouped, for each base algorithm, in pairs for training and test, where the best global result for each considered dataset is stressed in bold face. None is stressed in case of ties. The experimental study has been obtained using R Software. Specifically, we used the caret package (Kuhn 2008) for the classifiers training, fitting them through the underlying classical packages rpart, random forest and NNet, and finally the genalg package (Willighagen 2005) to assess the GA. For performing all the analysis presented in this paper, we have used a computer AMD A10-6700 3.94GHz, 8GB RAM, Windows 8.1.

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

6.1 Maximum aggregation function We can observe from the results of Tables 2, 3 and 4 the general good behaviour of the bipolar tuning method, at least regarding one of the bipolar adjustment methods, since it allows improving the performance of both the reference algorithms and the fuzzy aggregated classifiers. When the bipolar method is applied to the CART classifier, as shown in Table 2, the results could be interpreted as follows: – There is an improvement by kappa means of 0.004 when comparing the aggregation against the randomly selected reference and 0.01 in case of both additive and logistic bipolar approaches. – Both additive and logistic bipolar classifiers outperform the classification of the remainder approaches in 7 datasets. – Reference wins in 4 out of 25 datasets and aggregation does it in 2 of them. – There are ties in the results of 10 out of 25 datasets. In summary, we have reached improvements or ties in 21 out of 25 datasets when comparing only against the reference and in 19 out of 25 of them when doing it against the aggregation as well. It is clear that there exists a very similar behaviour of both bipolar approaches considered when are applied to the CART algorithm. Regarding the bipolar method applied to the RF classifier, in Table 3 we show the results and the following brief description of its behaviour. – There is an improvement by kappa means of 0.006 when comparing the additive bipolar model against the randomly selected reference and 0.01 if comparing it against the fuzzy aggregation. – The additive bipolar classifier outperforms the classification of the remainder approaches in 6 datasets and the logistic one does so in 9 of them. – Reference wins in 8 out of 25 datasets and aggregation does so in 2 of the considered datasets. – There are ties in the results of 3 out of 25 datasets. Thus, we can see that we have reached improvements or ties in 17 out of 25 datasets when comparing only against the reference and in 15 out of 25 of them when doing so against the aggregation as well. It is important to note the variable behaviour of the logistic bipolar method in this case. Despite being the winner method in a number of datasets, we can see that its mean is not the best because of the lower kappa value obtained in several of the remainder datasets. This could be explained due to the extreme char-

acter of this kind of aggregation function, as explained in Sect. 3.2. Considering the NNet classifier, the bipolar method reaches the results shown in Table 4 that could be interpreted as follows: – There is an improvement by kappa means of 0.002 when comparing the aggregation against the randomly selected reference, being of 0.034 and 0.038 in case of both additive and logistic bipolar approaches, respectively. – Both additive and logistic bipolar classifiers outperform the classification of the remainder approaches in 7 and 9 datasets. – Reference wins in 4 out of 25 datasets and aggregation does it in 6 of them. – There is only one tie in these results. On balance we have reached improvements or ties in 21 out of 25 datasets when comparing only against the reference and in 15 out of 25 of them when doing so against the aggregation as well. In order to detect significant differences among the results of the different approaches, we carry out the Friedman aligned rank test. This test obtains a low p value for all the three algorithms, which implies that there are significant differences between the results provided by each method. For this reason, we can apply a post hoc test to compare our methodology against the remaining approaches. Specifically, a Holm test is applied using the best approach (the one with the lower ranking) as control method and computing the adjusted p value (APV) for the remaining methods. The Holm post hoc test, as shown in Table 5, reflects that the bipolar method outperforms both the reference and the aggregated classifiers with a high level of confidence in the cases of both the CART algorithm, in which the logistic approach seems to be the best, and the NNet classifier where the additive one achieves the lowest (i.e. best) ranking. Regarding the base RF classifier, due to the odd behaviour of our logistic bipolar method when it is applied to this classifier, the Holm post hoc test indicates that there is not enough statistical evidence to assert that our method achieves better results than the reference or aggregated classifiers. However, the additive bipolar model reaches the lowest ranking with an appreciable difference. The statistical analysis of the pairwise comparisons of methods, which is carried out by means of a Wilcoxon test, as shown in Table 6, reflects the superiority of the proposed methodology when it is applied to the CART and NNet algorithms with acceptable p values, specially when comparing the logistic bipolar model against the reference for CART algorithm and the additive one against the aggregated classifier in case of NNet. Again, the application of the

123

G. Villarino et al. Table 2 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the CART algorithm and max aggregation CART Max Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

0.796

0.727

0.795

0.733

0.797

0.732

0.797

0.732

Con

0.391

0.247

0.336

0.232

0.409

0.289

0.409

0.305

Eco

0.800

0.642

0.791

0.595

0.797

0.642

0.797

0.642

Fla

0.620

0.560

0.604

0.559

0.627

0.561

0.627

0.561

Gla

0.722

0.488

0.720

0.473

0.726

0.482

0.726

0.482

Lin

0.667

0.620

0.667

0.620

0.667

0.620

0.667

0.620

Shu

0.994

0.995

0.994

0.995

0.994

0.995

0.994

0.995

Thy

0.876

0.819

0.876

0.819

0.879

0.806

0.879

0.806

Yea

0.605

0.474

0.574

0.480

0.614

0.482

0.614

0.482

Aus

0.785

0.673

0.782

0.684

0.795

0.691

0.795

0.691

Bal

0.655

0.518

0.655

0.518

0.655

0.518

0.655

0.518

Bup

0.623

0.346

0.618

0.344

0.622

0.345

0.622

0.345

Ger

0.503

0.301

0.518

0.335

0.538

0.341

0.540

0.338

Hay

0.841

0.738

0.841

0.738

0.841

0.738

0.841

0.738

Pen

0.970

0.932

0.970

0.932

0.970

0.932

0.972

0.939

Pag

0.872

0.887

0.872

0.887

0.872

0.887

0.872

0.887

Sah

0.545

0.220

0.511

0.340

0.582

0.266

0.582

0.266

Veh

0.607

0.309

0.583

0.311

0.663

0.353

0.663

0.349

Wis

0.924

0.880

0.924

0.880

0.924

0.880

0.924

0.880

Wnq

0.644

0.344

0.637

0.352

0.658

0.399

0.658

0.399

Pim

0.654

0.480

0.646

0.477

0.666

0.460

0.666

0.460

Nty

0.864

0.720

0.864

0.720

0.864

0.720

0.864

0.720

Hab

0.430

0.055

0.412

0.053

0.440

0.084

0.440

0.084

Der

0.989

0.953

0.989

0.953

0.989

0.953

0.989

0.953

App

0.636

0.392

0.636

0.392

0.636

0.392

0.636

0.392

Mean

0.721

0.573

0.713

0.577

0.729

0.583

0.729

0.583

The best global result for each considered dataset is stressed in bold face

methodology on the RF algorithm does not reach significant improvements. All in all, considering the maximum as the aggregation operator to get the fuzzy classifier, we have reached good results for CART and NNet, not the case when it comes to RF.

6.2 Minimum aggregation function When using the minimum operator to build the fuzzy aggregated classifier, Tables 7, 8 and 9 show the general good behaviour of the bipolar tuning method, at least in one of the bipolar approaches, since it enhances the performance of both the reference algorithm and its fuzzy aggregation. When the bipolar method is applied to the CART classifier, as shown in Table 7, the results could be interpreted as follows:

123

– There is an improvement by kappa means of 0.004 when comparing the aggregation against the randomly selected reference and 0.013 in case of logistic bipolar approach. – The additive bipolar model enhances the results of the remainder classifiers in 7 datasets and the logistic one does it in 9 of them. – Reference and fuzzy aggregation win in 2 out of 25 datasets each. – There are ties in the results of 11 out of 25 datasets.

In summary, we have achieved improvements or ties in 23 out of 25 datasets when comparing only against the reference and in 22 out of 25 of them when doing so against the aggregation as well. It is clear that the logistic bipolar approach presents a better behaviour pattern than the additive model when they are applied to the CART algorithm.

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms Table 3 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the RF algorithm and max aggregation RF Max Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

1.000

0.795

1.000

0.801

1.000

0.796

1.000

0.815

Con

0.707

0.254

0.664

0.242

0.804

0.287

0.807

0.292

Eco

1.000

0.549

1.000

0.553

1.000

0.571

1.000

0.571

Fla

0.583

0.560

0.597

0.566

0.643

0.622

0.653

0.590

Gla

1.000

0.654

1.000

0.654

1.000

0.645

1.000

0.578

Lin

0.959

0.744

0.975

0.739

0.991

0.737

0.991

0.733

Shu

1.000

0.996

1.000

0.996

1.000

0.998

1.000

0.992

Thy

1.000

0.886

1.000

0.898

1.000

0.911

1.000

0.871

Yea

0.999

0.498

1.000

0.507

1.000

0.509

1.000

0.537

Aus

0.985

0.720

0.991

0.718

0.999

0.717

0.999

0.713

Bal

0.670

0.575

0.669

0.579

0.673

0.577

0.673

0.577

Bup

1.000

0.469

1.000

0.461

1.000

0.426

1.000

0.442

Ger

1.000

0.332

1.000

0.338

1.000

0.369

1.000

0.373

Hay

1.000

1.000

1.000

1.000

1.000

1.000

1.000

0.994

Pen

1.000

0.951

1.000

0.956

1.000

0.957

1.000

0.972

Pag

1.000

0.860

1.000

0.858

1.000

0.853

1.000

0.839

Sah

0.971

0.244

0.971

0.246

1.000

0.316

1.000

0.280

Veh

1.000

0.380

1.000

0.359

1.000

0.365

1.000

0.296

Wis

0.994

0.941

0.995

0.939

0.997

0.925

0.997

0.924

Wnq

1.000

0.477

1.000

0.476

1.000

0.466

1.000

0.491

Pim

1.000

0.506

1.000

0.501

1.000

0.488

1.000

0.493

Nty

1.000

0.988

1.000

0.988

1.000

0.988

1.000

0.974

Hab

0.780

0.162

0.819

0.074

0.878

0.150

0.886

0.174

Der

1.000

0.985

1.000

0.985

1.000

1.000

1.000

1.000

App

1.000

0.551

1.000

0.551

1.000

0.551

1.000

0.512

Mean

0.946

0.643

0.947

0.639

0.959

0.649

0.960

0.641

The best global result for each considered dataset is stressed in bold face

When we consider the RF classifier and apply our bipolar method, in which results are shown in Table 8, we can describe its behaviour as follows. – There is an improvement by kappa means of 0.002 when comparing the additive bipolar model against the randomly selected reference and 0.06 in case of comparing it against the fuzzy aggregation. – The additive bipolar classifier outperforms the classification of the remainder approaches in 7 datasets and the bipolar one does it in 6 of them. – Reference wins in 8 out of 25 datasets and aggregation does it in 4 of the considered datasets. – There is a unique tie in these results. Therefore, we have reached improvements or ties in 17 out of 25 datasets when comparing only against the reference and

in 15 out of 25 of them when doing it against the aggregation as well. Once again we have to note the odd behaviour of the methodology either in the fuzzy aggregation or in the logistic bipolar approaches. Considering the NNet classifier, the bipolar method reaches the results shown in Table 9 that could be interpreted as follows: – There is an improvement by kappa means of 0.002 when comparing the aggregation against the randomly selected reference, being of 0.043 and 0.024 in case of both additive and logistic bipolar approaches, respectively. – Both additive and logistic bipolar classifiers outperform the classification of the remainder approaches in 8 and 11 datasets. – Reference wins in 4 out of 25 datasets and aggregation does it in 5 of them. – There is only one tie in this results.

123

G. Villarino et al. Table 4 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the NNet algorithm and max aggregation NNet Max Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

0.998

0.971

0.999

0.967

1.000

0.966

1.000

0.972

Con

0.209

0.226

0.211

0.218

0.357

0.327

0.361

0.340

Eco

0.725

0.488

0.708

0.569

0.764

0.514

0.766

0.466

Fla

0.587

0.614

0.587

0.610

0.623

0.631

0.632

0.606

Gla

0.554

0.454

0.560

0.454

0.628

0.457

0.628

0.452

Lin

0.884

0.778

0.884

0.788

0.905

0.774

0.905

0.779

Shu

0.993

0.974

0.997

0.987

0.999

0.983

0.999

0.982

Thy

0.694

0.632

0.716

0.594

0.791

0.626

0.832

0.724

Yea

0.498

0.477

0.499

0.477

0.544

0.547

0.547

0.549

Aus

0.371

0.310

0.461

0.455

0.490

0.466

0.489

0.459

Bal

0.653

0.665

0.655

0.665

0.666

0.641

0.666

0.641

Bup

0.555

0.370

0.569

0.310

0.619

0.358

0.617

0.357

Ger

0.475

0.360

0.494

0.414

0.557

0.422

0.554

0.426

Hay

0.870

0.726

1.000

0.694

1.000

0.731

1.000

0.727

Pen

0.988

0.936

0.991

0.938

0.995

0.934

0.996

0.931

Pag

0.927

0.893

0.928

0.901

0.941

0.905

0.945

0.877

Sah

0.381

0.338

0.476

0.310

0.532

0.284

0.532

0.280

Veh

0.546

0.428

0.634

0.486

0.691

0.519

0.691

0.525

Wis

0.946

0.919

0.947

0.928

0.957

0.924

0.958

0.922

Wnq

0.260

0.281

0.309

0.269

0.453

0.407

0.471

0.402

Pim

0.492

0.416

0.499

0.355

0.541

0.405

0.540

0.399

Nty

0.981

0.964

0.994

0.949

1.000

0.973

1.000

0.973

Hab

0.350

0.219

0.362

0.188

0.409

0.192

0.417

0.277

Der

1.000

0.985

1.000

0.985

1.000

0.985

1.000

0.985

App

0.165

0.149

0.037

0.103

0.528

0.450

0.674

0.481

Mean

0.644

0.583

0.661

0.585

0.720

0.617

0.729

0.621

The best global result for each considered dataset is stressed in bold face Table 5 Average rankings of the algorithms (aligned Friedman), associated p values and Holm test APV for each algorithm with the MAX aggregation

Algorithm

Rank CART

Rank RF

Rank NNet

“Ref”

62.2

52.54

62.14

“Aggr”

58.76

55.3

59.98

“BipAdd”

41.56

42.36

39.42

“BipLog”

39.48

51.8

40.46

p value

0.000163

0.000135

0.000147

APV “Ref”

0.0168*

0.4295

0.0168*

APV “Aggr”

0.0375*

0.3444

0.0244*

Bold values represents Adjusted P-values

123

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms Table 6 Wilcoxon test to compare the bipolar tuning approaches (R + ) against the base classifier (R − ) with the MAX aggregation

R+

Comparison

R−

p value

CARTbipAdd versus CARTRef

213.5

86.5

0.0674

CARTbipLog versus CARTRef

236.5

88.5

0.0450

CARTbipAdd versus CARTAggr

224.5

100.5

0.0926

CARTbipLog versus CARTAggr

214.0

86.0

0.0653

NNetbipAdd versus NNetRef

215.0

85.0

0.0612

NNetbipLog versus NNetRef

206.0

94.0

0.1064

NNetbipAdd versus NNetAggr

228.0

72.0

0.0249

NNetbipLog versus NNetAggr

216.0

84.0

0.0574

Bold values represent statistically significant results Table 7 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the CART algorithm and min aggregation CART Min Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

0.796

0.727

0.795

0.733

0.797

0.734

0.797

0.734

Con

0.391

0.247

0.336

0.232

0.371

0.298

0.401

0.325

Eco

0.800

0.642

0.791

0.595

0.791

0.595

0.793

0.595

Fla

0.620

0.560

0.604

0.559

0.627

0.561

0.627

0.561

Gla

0.722

0.488

0.720

0.473

0.722

0.513

0.723

0.489

Lin

0.667

0.620

0.667

0.620

0.667

0.620

0.667

0.620

Shu

0.994

0.995

0.994

0.995

0.994

0.995

0.994

0.995

Thy

0.876

0.819

0.876

0.819

0.879

0.806

0.879

0.806

Yea

0.605

0.474

0.574

0.480

0.585

0.488

0.594

0.498

Aus

0.785

0.673

0.782

0.684

0.795

0.691

0.795

0.691

Bal

0.655

0.518

0.655

0.518

0.655

0.518

0.655

0.518

Bup

0.623

0.346

0.618

0.344

0.621

0.349

0.621

0.349

Ger

0.503

0.301

0.518

0.335

0.538

0.342

0.538

0.342

Hay

0.841

0.738

0.841

0.738

0.841

0.738

0.841

0.738

Pen

0.970

0.932

0.970

0.932

0.970

0.932

0.970

0.932

Pag

0.872

0.887

0.872

0.887

0.872

0.887

0.872

0.887

Sah

0.545

0.220

0.511

0.340

0.578

0.258

0.582

0.266

Veh

0.607

0.309

0.583

0.311

0.629

0.344

0.651

0.352

Wis

0.924

0.880

0.924

0.880

0.924

0.880

0.924

0.880

Wnq

0.644

0.344

0.637

0.352

0.644

0.408

0.645

0.407

Pim

0.654

0.480

0.646

0.477

0.657

0.456

0.662

0.477

Nty

0.864

0.720

0.864

0.720

0.864

0.720

0.864

0.720

Hab

0.430

0.055

0.412

0.053

0.422

0.082

0.438

0.136

Der

0.989

0.953

0.989

0.953

0.989

0.953

0.989

0.953

App

0.636

0.392

0.636

0.392

0.636

0.392

0.636

0.392

Mean

0.721

0.573

0.713

0.577

0.723

0.582

0.726

0.586

The best global result for each considered dataset is stressed in bold face

123

G. Villarino et al. Table 8 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the RF algorithm and min aggregation RF Min Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

1.000

0.795

1.000

0.801

1.000

0.795

1.000

0.825

Con

0.707

0.254

0.664

0.242

0.787

0.304

0.801

0.297

Eco

1.000

0.549

1.000

0.553

1.000

0.586

1.000

0.571

Fla

0.583

0.560

0.597

0.566

0.630

0.624

0.648

0.615

Gla

1.000

0.654

1.000

0.654

1.000

0.640

1.000

0.614

Lin

0.959

0.744

0.975

0.739

0.991

0.711

0.991

0.703

Shu

1.000

0.996

1.000

0.996

1.000

0.998

1.000

0.996

Thy

1.000

0.886

1.000

0.898

1.000

0.892

1.000

0.843

Yea

0.999

0.498

1.000

0.507

1.000

0.509

1.000

0.516

Aus

0.985

0.720

0.991

0.718

0.999

0.719

0.999

0.718

Bal

0.670

0.575

0.669

0.579

0.673

0.577

0.673

0.577

Bup

1.000

0.469

1.000

0.461

1.000

0.412

1.000

0.423

Ger

1.000

0.332

1.000

0.338

1.000

0.344

1.000

0.401

Hay

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

Pen

1.000

0.951

1.000

0.956

1.000

0.958

1.000

0.962

Pag

1.000

0.860

1.000

0.858

1.000

0.853

1.000

0.835

Sah

0.971

0.244

0.971

0.246

1.000

0.305

1.000

0.279

Veh

1.000

0.380

1.000

0.359

1.000

0.340

1.000

0.389

Wis

0.994

0.941

0.995

0.939

0.997

0.935

0.997

0.931

Wnq

1.000

0.477

1.000

0.476

1.000

0.473

1.000

0.475

Pim

1.000

0.506

1.000

0.501

1.000

0.509

1.000

0.470

Nty

1.000

0.988

1.000

0.988

1.000

0.974

1.000

0.822

Hab

0.780

0.162

0.819

0.074

0.867

0.132

0.884

0.171

Der

1.000

0.985

1.000

0.985

1.000

0.988

1.000

0.985

App

1.000

0.551

1.000

0.551

1.000

0.551

1.000

0.551

Mean

0.946

0.643

0.947

0.639

0.958

0.645

0.960

0.639

The best global result for each considered dataset is stressed in bold face

Overall, we have reached improvements or ties in 21 out of 25 datasets when comparing only against the reference and in 16 out of 25 of them when doing so against the aggregation as well. For the minimum aggregation function, the Friedman aligned rank test obtains a low p value for all the three algorithms, see Table 10 , which implies that there are significant differences between the results. For this reason, we can apply a post hoc test to compare our methodology against the remaining approaches. This test, as shown in Table 10, reflects that the bipolar method outperforms both the reference and the aggregated classifiers with a high level of confidence in case of CART and NNet algorithms, in which the logistic approach seems to be the best.

123

Contrarily and due to the odd behaviour of our logistic bipolar approach when it is applied to the RF algorithm, the Holm post hoc test indicates that there is not enough statistical evidence to assert that our method achieves better results than the reference or aggregated classifiers. However, the additive bipolar model reaches again the lowest ranking. The Wilcoxon test, as shown in Table 11, reflects the superiority of our new methodology when it is applied to the CART and NNet algorithms with very low p values, specially when comparing both bipolar models against the reference and the logistic one against the fuzzy aggregation for CART algorithm and the two bipolar models against the aggregated classifier in case of NNet. Again, the application of the methodology on the RF algorithm does not reach significant improvements.

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms Table 9 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the NNet algorithm and min aggregation NNet Min Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

0.998

0.971

0.999

0.967

1.000

0.968

1.000

0.967

Con

0.209

0.226

0.211

0.218

0.352

0.324

0.361

0.333

Eco

0.725

0.488

0.708

0.569

0.760

0.512

0.762

0.510

Fla

0.587

0.614

0.587

0.610

0.621

0.630

0.632

0.605

Gla

0.554

0.454

0.560

0.454

0.630

0.484

0.627

0.486

Lin

0.884

0.778

0.884

0.788

0.905

0.761

0.905

0.761

Shu

0.993

0.974

0.997

0.987

0.999

0.987

0.999

0.982

Thy

0.694

0.632

0.716

0.594

0.792

0.663

0.890

0.872

Yea

0.498

0.477

0.499

0.477

0.543

0.545

0.547

0.555

Aus

0.371

0.310

0.461

0.455

0.490

0.467

0.490

0.467

Bal

0.653

0.665

0.655

0.665

0.666

0.641

0.666

0.641

Bup

0.555

0.370

0.569

0.310

0.608

0.355

0.611

0.351

Ger

0.475

0.360

0.494

0.414

0.548

0.429

0.548

0.415

Hay

0.870

0.726

1.000

0.694

1.000

0.694

1.000

0.694

Pen

0.988

0.936

0.991

0.938

0.991

0.938

0.994

0.949

Pag

0.927

0.893

0.928

0.901

0.940

0.876

0.944

0.876

Sah

0.381

0.338

0.476

0.310

0.527

0.286

0.531

0.284

Veh

0.546

0.428

0.634

0.486

0.694

0.538

0.705

0.539

Wis

0.946

0.919

0.947

0.928

0.958

0.929

0.958

0.929

Wnq

0.260

0.281

0.309

0.269

0.455

0.395

0.467

0.371

Pim

0.492

0.416

0.499

0.355

0.535

0.439

0.537

0.409

Nty

0.981

0.964

0.994

0.949

1.000

0.973

1.000

0.973

Hab

0.350

0.219

0.362

0.188

0.425

0.210

0.422

0.224

Der

1.000

0.985

1.000

0.985

1.000

0.985

1.000

0.985

App

0.165

0.149

0.037

0.103

0.265

0.156

0.643

0.462

Mean

0.644

0.583

0.661

0.585

0.708

0.607

0.729

0.626

The best global result for each considered dataset is stressed in bold face Table 10 Average rankings of the algorithms (aligned Friedman), associated p values and Holm test APV for each algorithm with the MIN aggregation

Algorithm

Rank CART

Rank RF

Rank NNet

“Ref”

63.34

49.72

59.98

“Aggr”

57.86

53.88

59.92

“BipAdd”

41.56

46.86

43.58

“BipLog”

39.24

51.54

38.52

p value

0.000156

0.000133

0.000161

Holm “Ref”

0.0099*

1

0.0267*

Holm “Aggr”

0.0465*

1

0.0267*

Bold values represents Adjusted P-values

123

G. Villarino et al. Table 11 Wilcoxon test to compare the bipolar tuning approaches (R + ) against the base classifier (R − ) with the MIN aggregation

R+

Comparison

R−

p value 0.0283

CARTbipAdd versus CARTRef

243.5

81.5

CARTbipLog versus CARTRef

246.5

78.5

0.0229

CARTbipAdd versus CARTAggr

213.5

86.5

0.0674

CARTbipLog versus CARTAggr

233.5

66.5

0.0163

NNetbipAdd versus NNetRef

217.0

83.0

0.0537

NNetbipLog versus NNetRef

216.0

84.0

0.0574

NNetbipAdd versus NNetAggr

231.5

68.5

0.0191

NNetbipLog versus NNetAggr

241.5

83.5

0.0324

Bold values represent statistically significant results

All in all, considering the minimum as the aggregation operator to get the fuzzy classifier, we have reached positive results for CART and NNet, not so in case of RF.

6.3 Arithmetic mean aggregation function When the bipolar method is applied to the CART classifier, as shown in Table 12, the results could be interpreted as follows: – There is an improvement by kappa means of 0.008 when comparing both the logistic and the additive bipolar models against the randomly selected reference and 0.01 in case of comparing them against the fuzzy aggregated classifier. – Both additive and logistic bipolar classifiers outperform the classification of the remainder approaches in 6 and 7 datasets (Table 7). – Reference and aggregation win in 3 out of 25 datasets each. – There are ties in the results of 12 out of 25 datasets. Consequently, we can see that we have reached improvements or ties in 22 out of 25 datasets when comparing only against the reference and in 20 out of 25 of them when doing it against the aggregation as well. It is clear that there exists a very similar behaviour of both bipolar approaches considered when are applied to the CART algorithm. In case of the bipolar method applied to the RF classifier, in Table 13 we show the results and the following brief description of its behaviour. – There is an improvement by kappa means of 0.004 when comparing the logistic bipolar model against the randomly selected reference and 0.007 in case of comparing it against the aggregation. – The additive bipolar classifier outperforms the classification of the remainder approaches in 6 datasets and the logistic one does so in 8 of them. – Reference wins in 6 out of 25 datasets and aggregation does so in 4 of them the considered datasets.

123

– There are ties in the results of 3 out of 25 datasets (Table 13). Therefore, we have reached improvements or ties in 19 out of 25 datasets when comparing only against the reference and in 15 out of 25 of them when doing it against the aggregation as well. It seems that the additive bipolar classifier presents the worst results by means because of the low kappa value obtained in datasets such as German and Vehicle (Table 14). Considering the NNet classifier, the bipolar method reaches the results shown in Table 13 that could be interpreted as follows: – There is an improvement by kappa means of 0.041 when comparing the logistic bipolar model against the randomly selected reference, being of 0.047 in case of doing so against the fuzzy aggregated classifier. – The logistic bipolar classifier outperforms the classification of the remainder approaches in 14 datasets. – Reference wins in 4 out of 25 datasets and aggregation does it in 6 of them. – There are no ties in these results (Table 14). On consideration, we have reached improvements or ties in 21 out of 25 datasets when comparing only against the reference and in 16 out of 25 of them when doing it against the aggregation as well. Once again, the Friedman aligned rank test obtains a low p value for all the three algorithms, so there are significant differences between the results. Consequently, we apply a Holm post hoc test, as shown in Table 15, in which it is clear that the bipolar method outperforms both the reference and the aggregated classifiers with an acceptable level of confidence in case of both CART and NNet algorithms, where the logistic bipolar classifier achieves lowest ranking . Opposed to this and given the odd behaviour of our additive bipolar approach when it is applied to the RF algorithm with the minimum aggregation, the Holm post hoc test indicates that there is not enough statistical evidence to assert that our method achieves better results than the reference or

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms Table 12 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the CART algorithm and mean aggregation CART Mean Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

0.796

0.727

0.795

0.733

0.797

0.733

0.797

0.733

Con

0.391

0.247

0.379

0.245

0.409

0.284

0.410

0.301

Eco

0.800

0.642

0.793

0.595

0.797

0.642

0.797

0.642

Fla

0.620

0.560

0.620

0.554

0.627

0.561

0.627

0.561

Gla

0.722

0.488

0.722

0.454

0.723

0.489

0.723

0.489

Lin

0.667

0.620

0.667

0.620

0.667

0.620

0.667

0.620

Shu

0.994

0.995

0.994

0.995

0.994

0.995

0.994

0.995

Thy

0.876

0.819

0.876

0.819

0.879

0.806

0.879

0.806

Yea

0.605

0.474

0.584

0.484

0.614

0.478

0.614

0.478

Aus

0.785

0.673

0.782

0.666

0.790

0.680

0.790

0.680

Bal

0.655

0.518

0.655

0.518

0.655

0.518

0.655

0.518

Bup

0.623

0.346

0.623

0.346

0.623

0.346

0.623

0.346

Ger

0.503

0.301

0.485

0.352

0.511

0.348

0.511

0.348

Hay

0.841

0.738

0.841

0.738

0.841

0.738

0.841

0.738

Pen

0.970

0.932

0.970

0.932

0.970

0.932

0.972

0.939

Pag

0.872

0.887

0.872

0.887

0.872

0.887

0.872

0.887

Sah

0.545

0.220

0.572

0.217

0.584

0.215

0.584

0.215

Veh

0.607

0.309

0.627

0.302

0.656

0.352

0.662

0.346

Wis

0.924

0.880

0.924

0.880

0.924

0.880

0.924

0.880

Wnq

0.644

0.344

0.646

0.337

0.659

0.393

0.659

0.393

Pim

0.654

0.480

0.646

0.477

0.665

0.470

0.667

0.467

Nty

0.864

0.720

0.864

0.720

0.864

0.720

0.864

0.720

Hab

0.430

0.055

0.430

0.055

0.440

0.084

0.440

0.084

Der

0.989

0.953

0.989

0.953

0.989

0.953

0.989

0.953

App

0.636

0.392

0.636

0.392

0.636

0.392

0.636

0.392

Mean

0.721

0.573

0.720

0.571

0.728

0.581

0.728

0.581

The best global result for each considered dataset is stressed in bold face

aggregated classifiers. However, the logistic bipolar model reaches the lowest ranking with an appreciable difference. Pairwise comparisons carried out by a Wilcoxon test, as shown in Table 16, reflect the enhancement of our new methodology when it is applied to the CART and NNet algorithms with acceptable p values, specially when comparing the logistic bipolar model against the reference for CART algorithm and for all the pairs in case of NNet. Again, the application of the methodology on the RF algorithm does not reach significant improvements (Table 16).

6.4 Conclusions of the computational experience On the whole, comparing the behaviour of the three aggregation operators (maximum, minimum and mean) used to get the fuzzy classifiers, generally speaking we have achieved

very positive results for CART and NNet and not so good in the case of RF. The complete analysis holds the following asseverations: – Considering CART and NNet base classifiers, the fuzzy aggregated classifier mildly enhances the behaviour of the reference by kappa means when using the maximum and minimum aggregation operators. Not so in case of the mean aggregation where the fuzzy classifier shows a light decrease in kappa mean. Regarding the RF algorithm, no improvement is reached in this comparison for all the three aggregation operators. – Both additive and logistic bipolar classifiers outperform the classification of the remainder approaches by means for the three considered aggregation operators when are applied to the CART and NNet algorithms. In case of

123

G. Villarino et al. Table 13 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the RF algorithm and mean aggregation RF Mean Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

1.000

0.795

1.000

0.801

1.000

0.799

1.000

0.810

Con

0.707

0.254

0.663

0.244

0.799

0.299

0.805

0.295

Eco

1.000

0.549

1.000

0.553

1.000

0.570

1.000

0.577

Fla

0.583

0.560

0.584

0.571

0.631

0.617

0.650

0.611

Gla

1.000

0.654

1.000

0.656

1.000

0.611

1.000

0.683

Lin

0.959

0.744

0.975

0.739

0.991

0.703

0.991

0.717

Shu

1.000

0.996

1.000

0.998

1.000

0.998

1.000

0.996

Thy

1.000

0.886

1.000

0.898

1.000

0.863

1.000

0.893

Yea

0.999

0.498

1.000

0.507

1.000

0.519

1.000

0.528

Aus

0.985

0.720

0.989

0.719

0.995

0.731

0.995

0.724

Bal

0.670

0.575

0.671

0.569

0.673

0.580

0.673

0.580

Bup

1.000

0.469

1.000

0.450

1.000

0.446

1.000

0.401

Ger

1.000

0.332

1.000

0.335

1.000

0.275

1.000

0.318

Hay

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

Pen

1.000

0.951

1.000

0.956

1.000

0.959

1.000

0.959

Pag

1.000

0.860

1.000

0.853

1.000

0.862

1.000

0.835

Sah

0.971

0.244

0.974

0.253

1.000

0.291

1.000

0.322

Veh

1.000

0.380

1.000

0.365

1.000

0.314

1.000

0.362

Wis

0.994

0.941

0.995

0.939

0.997

0.934

0.997

0.922

Wnq

1.000

0.477

1.000

0.472

1.000

0.489

1.000

0.476

Pim

1.000

0.506

1.000

0.500

1.000

0.424

1.000

0.460

Nty

1.000

0.988

1.000

0.988

1.000

0.958

1.000

0.988

Hab

0.780

0.162

0.800

0.103

0.870

0.168

0.880

0.177

Der

1.000

0.985

1.000

0.985

1.000

0.985

1.000

1.000

App

1.000

0.551

1.000

0.551

1.000

0.551

1.000

0.551

Mean

0.946

0.643

0.946

0.640

0.958

0.638

0.960

0.647

The best global result for each considered dataset is stressed in bold face

RF, the additive bipolar model is the one with the best mean for the maximum and minimum aggregations, and the logistic one obtains the best result for the aggregation by the mean operator. – The Friedman aligned rank test reflects that there are statistical differences between the four approaches in all the multiple comparisons for all the three aggregations applied to all the three algorithms. However, the Holm post hoc test applied to the RF algorithm points out that there is not enough evidence to affirm that the control method (additive bipolar in case of maximum and minimum and logistic bipolar for the mean aggregation) improves the results reached by either the reference or its fuzzy aggregation.

123

– Regarding the pairwise comparisons, the Wilcoxon rank test points out that the logistic models applied to the NNet algorithm clearly improve the behaviour of the fuzzy when it is done by minimum and mean operators, doing so just in the additive model for the maximum. In case of CART algorithm, we can see statistically significant improvements for the following pairs; logistic bipolar vs. ref for both maximum and mean operators, and in all but additive bipolar vs. fuzzy aggregation when focusing on the minimum. – Unfortunately, we have reached no statistical significant enhancements in results for the RF classifier. However, we can see some interesting improvements in several datasets specially in the case of the mean aggregation.

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms Table 14 Results in train and test achieved by the genetic fuzzy bipolar approaches applied to the NNet algorithm and mean aggregation NNet Mean Ref

Aggr

bipAdd

bipLog

Tr.

Tst

Tr.

Tst

Tr.

Tst

Tr.

Tst

Car

0.998

0.971

0.999

0.967

1.000

0.968

1.000

0.971

Con

0.209

0.226

0.210

0.215

0.355

0.316

0.364

0.336

Eco

0.725

0.488

0.718

0.557

0.760

0.514

0.763

0.514

Fla

0.587

0.614

0.589

0.619

0.622

0.627

0.632

0.606

Gla

0.554

0.454

0.561

0.456

0.638

0.519

0.640

0.506

Lin

0.884

0.778

0.884

0.788

0.905

0.761

0.905

0.779

Shu

0.993

0.974

0.999

0.986

0.999

0.981

0.999

0.984

Thy

0.694

0.632

0.667

0.540

0.787

0.675

0.858

0.785

Yea

0.498

0.477

0.498

0.477

0.543

0.545

0.547

0.548

Aus

0.371

0.310

0.441

0.396

0.480

0.419

0.480

0.420

Bal

0.653

0.665

0.654

0.665

0.665

0.641

0.666

0.641

Bup

0.555

0.370

0.580

0.326

0.614

0.382

0.612

0.384

Ger

0.475

0.360

0.502

0.424

0.551

0.438

0.550

0.418

Hay

0.870

0.726

0.953

0.694

0.961

0.720

0.961

0.720

Pen

0.988

0.936

0.991

0.943

0.995

0.937

0.996

0.943

Pag

0.927

0.893

0.922

0.901

0.939

0.868

0.946

0.880

Sah

0.381

0.338

0.462

0.312

0.509

0.288

0.511

0.266

Veh

0.546

0.428

0.629

0.489

0.695

0.555

0.693

0.558

Wis

0.946

0.919

0.948

0.930

0.958

0.918

0.958

0.918

Wnq

0.260

0.281

0.269

0.254

0.440

0.384

0.455

0.410

Pim

0.492

0.416

0.508

0.380

0.544

0.410

0.545

0.425

Nty

0.981

0.964

0.994

0.949

1.000

0.973

1.000

0.973

Hab

0.350

0.219

0.365

0.184

0.422

0.235

0.422

0.216

Der

1.000

0.985

1.000

0.985

1.000

0.985

1.000

1.000

App

0.165

0.149

0.033

0.000

0.253

0.156

0.625

0.389

Mean

0.644

0.583

0.655

0.577

0.705

0.609

0.725

0.624

The best global result for each considered dataset is stressed in bold face Table 15 Average rankings of the algorithms (aligned Friedman), associated p values and Holm test APV for each algorithm with the MEAN aggregation

Algorithm

Rank CART

Rank RF

Rank NNet

“Ref”

58.2

51.54

63.56

“Aggr”

60.08

54.74

60.88

“BipAdd”

43

53.3

41.32

“BipLog”

40.72

42.42

36.24

p value

0.000156

0.000130

0.000142

Holm “Ref”

0.0663

0.399

0.0026*

Holm “Aggr”

0.0549

0.399

0.0053*

Bold values represents Adjusted P-values

123

G. Villarino et al. Table 16 Wilcoxon test to compare the bipolar tuning approaches (R + ) against the base classifier (R − ) with the MEAN aggregation

R+

Comparison

R−

p value

CARTbipAdd versus CARTRef

231.0

94.0

0.0633

CARTbipLog versus CARTRef

221.5

78.5

0.0396

CARTbipAdd versus CARTAggr

201.5

98.5

0.137

CARTbipLog versus CARTAggr

223.5

101.5

0.0979

NNetbipAdd versus NNetRef

231.0

69.0

0.0198

NNetbipLog versus NNetRef

262.0

63.0

0.0071

NNetbipAdd versus NNetAggr

231.0

69.0

0.0198

NNetbipLog versus NNetAggr

249.0

76.0

0.0192

Bold values represent statistically significant results

7 Discussion and final remarks In this paper, we have studied the extension of fuzzy supervised classifiers into a bipolar knowledge representation framework by means of the introduction of a dissimilarity structure in the set of classes. These structures enable considering different opposition or dissimilarity relationships between the available classes, that otherwise are by default considered as independent, unrelated objects. These relationships provide further information on the underlying structure of the classification problems being addressed, which can be used at the final decision or classification stage to better exploit the soft scores provided by any fuzzy classifier to assess the association between each item and each class. Therefore, the introduction of dissimilarity structures may allow strengthening the adaptation of the classifiers to each specific application context, in which classes acquire a particular semantics, thus also improving the classifier performance. The application of a dissimilarity structure to the soft scores provided by fuzzy classifiers leads to a bipolar knowledge representation framework, in which the association of item with classes is assessed by means of a pair of positive and negative evidence degrees. Here, we have proposed two aggregation procedures, logistic and additive, to transform back these bipolar pairs of positive and negative evidence into single fuzzy adjusted scores. Therefore, dissimilarity structures can be used to perform an adjustment of fuzzy classifiers to take into account the relationships between classes being considered. Following these ideas, in this work we have proposed an evolutionary learning approach to obtain the most convenient dissimilarity structure of the set of classes for each classification problem. Particularly, to the extent of our knowledge, the application of genetic algorithms to the fine-tuning of dissimilarity structures constitutes a novel approach in the context of evolutionary fuzzy classification, particularly regarding fuzzy rule-based classification systems, since it is neither applied to the membership functions of independent variables nor to the classifier rule base, but rather to the resulting fuzzy scores after the classifier has been trained.

123

An important feature of this novel approach is this independence from the knowledge representation framework and structure of the fuzzy classifiers, that enables it to be applied to any fuzzy classifier, despite the nature and specific steps of the intermediate algorithm providing the fuzzy scores. In this sense, the proposed approach can be understood as a general postprocessing method to fine-tune the maximum decision rule usually applied to defuzzify fuzzy classifiers and make the decision on the class assignment of each item to be classified. To study the feasibility of the proposed approach, and particularly to remark that it can be applied to any fuzzy classifier despite how it is obtained, we have applied it to a set of robust fuzzy classifiers obtained through the aggregation of the replicated probability distribution estimations provided by different base probabilistic classifiers, specifically CART classification trees, random forests and artificial neural networks. Three different aggregation schemes (min, max and mean) have been applied at this aggregation process. A rigorous and extensive computational experience has been conducted to analyse whether the proposed additive and logistic bipolar approaches enabled an statistically significant improvement of the base probabilistic and aggregated fuzzy classifiers. Along this experimental study, we have reached several lessons learned: – The bipolar framework allowed to significantly improve the results of the three base machine learning algorithms considered in this work. – Both the additive and logistic adjustment methods significantly outperform the results of the base classifier in case of CART and NNet, but only the additive method enhances the behaviour of RF in statistical terms. – Comparing both the additive and the logistic proposed classifiers, we find there is no clear winner. In fact, this question seems to be somehow dependent on the base algorithm considered as well as on the dataset of application. These results lead us to conclude that the proposed approach provides a suitable solution to confront binary classification

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

problems and improve the decision rule that manages how the intermediate soft information gathered by many classifiers is exploited. Regarding future research on this approach, a main line of work will be devoted to study further mechanisms than the additive and logistic aggregations for exploiting the bipolar pairs of positive and negative evidence. A particularly appealing possibility is to use these bipolar pairs as the base information of a multivalued para consistent logic, as those proposed in Ozturk and Tsoukiàs (2007), Turunen et al. (2010), Rodríguez et al. (2014). This would allow an even more expressive representational framework to take advantage of all the information contained in the soft scores provided by classifiers. Some efforts will also be devoted to extend the computational experiences out of the binary classification context, extending the evolutionary learning of the dissimilarity structures to multi-class frameworks. The experimental setting will also be extended to consider also other base classifiers, as, for instance, fuzzy rule-based classification systems. Acknowledgements This research has been partially supported by the Government of Spain, Grant TIN2015-66471-P and the FPU fellowship Grant 2015/06202 from the Ministry of Education of Spain.

Compliance with ethical standards Conflict of interest The authors declare that they have no conflict of interest.

References Alcalá R, Alcalá-Fdez J, Herrera F (2007) A proposal for the genetic lateral tuning of linguistic fuzzy systems and its interaction with rule selection. IEEE Trans Fuzzy Syst 15(4):616–635 Alcalá-Fdez J, Alcala R, Herrera F (2011a) A fuzzy association rulebased classification model for high-dimensional problems with genetic rule selection and lateral tuning. IEEE Trans Fuzzy Syst 19(5):857–872 Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011b) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(2–3):255–287 Amo A, Montero J, Molina E (2001) Representation of consistent recursive rules. Eur J Oper Res 130:29–53 Atanassov KT (1999) Intuitionistic fuzzy sets theory and applications. Physica-Verlag, Heidelberg Breiman L (1984) Classification and regression trees. Kluwer Academic Publishers, New York Breiman L (2001) Random forests. Mach Learn 40:5–32 Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46 Cordon O, del Jesús MJ, Herrera F (1999) A proposal on reasoning methods in fuzzy rule-based classification systems. Int J Approx Reason 20(1):21–45 Demsar J (2006) Statistical comparisons of classifiers over multiple datasets. J Mach Learn Res 7:1–30

Dubois D, Prade H (2002) Possibility theory. Probability theory and multiple-valued logics: a clarification. Ann Math Artif Intell 32:35–66 Dubois D, Prade H (2006) A bipolar possibilistic representation of knowledge and preferences and its applications. Fuzzy Logic Appl 3849:1–10 Dubois D, Prade H (2008) An introduction to bipolar representations of information and preference. Int J Intell Syst 23(8):866–877 Fünkranz J, Hüllermeier E, Loza Mencía E, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133– 153 García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple datasets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694 García S, Fernandez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064 Goldberg D (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Boston Gómez D, Montero J (2004) A discussion on aggregation operators. Kybernetika 40(1):107–120 Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70 Hullermeier E (2005) Fuzzy methods in machine learning and data mining: status and prospects. Fuzzy Sets Syst 156(3):387–406 Ishibuchi H, Yamamoto T, Nakashima T (2005) Hybridization of fuzzy GBML approaches for pattern classification problems. IEEE Trans Syst Man Cybern Part B Cybern 35(2):359–365 Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26. https://doi.org/10.18637/jss.v028.i05 Kumar R, Verma R (2012) Classification algorithms for data mining: a survey. Int J Innov Eng Technol 2:7–14 Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy. Complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40:203–228 Montero J, Gómez D, Bustince H (2007) On the relevance of some families of fuzzy Sets. Fuzzy Sets Syst 158(22):2429–2442 Montero J, Bustince H, Franco C, Rodríguez JT, Gómez D, Pagola M, Fernandez J, Barrenechea E (2016) Paired structures in knowledge representation. Knowl Based Syst 100:50–58 Osgood CE, Suci GJ, Tannenbaum PH (1957) The measurement of meaning. University of Illinois Press, Urbana Ozturk M, Tsoukiàs A (2007) Modeling uncertain positive and negative reasons in decision aiding. Decis Support Syst 43:1512–1526 Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge Rodríguez JT, Vitoriano B, Montero J (2011) Rule-based classification by means of bipolar criteria. In: IEEE symposium on computational intelligence in multicriteria decision-making (MDCM) vol 2011, p 197–204 Rodríguez JT, Vitoriano B, Montero J (2012) A general methodology for data-based rule building and its application to natural disaster management. Comput Oper Res 39(4):863–873 Rodríguez JT, Vitoriano B, Gómez D, Montero J (2013) Classification of disasters and emergencies under bipolar knowledge representation. In: Vitoriano B, Montero J, Ruan D (eds) Decision aid models for disaster management and emergencies, Atlantis computational intelligence systems, vol 7, p 209–232 Rodríguez JT, Turunen E, Ruan D, Montero J (2014) Another paraconsistent algebraic semantics for Lukasiewicz–Pavelka logic. Fuzzy Sets Syst 242:132–147 Rojas K, Gómez D, Montero J, Rodríguez JT, Valdivia A, Paiva F (2014) Development of child’s home environment indexes based

123

G. Villarino et al. on consistent families of aggregation operators with prioritized hierarchical information. Fuzzy Sets Syst 241:41–60 Sivanandam SN, Deepa SN (2007) Introduction to genetic algorithms. Springer, Berlin Turunen E, Ozturk M, Tsoukiàs A (2010) Paraconsistent semantics for Pavelka style fuzzy sentential logic. Fuzzy Sets Syst 161:1926– 1940 [Venables and RipleyVenables and Ripley2002]Venables Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, Berlin

123

Villarino G, Gómez D, Rodríguez JT (2017) Improving supervised classification algorithms by a bipolar knowledge representation. In: Advances in fuzzy logic and technology 2017, p 518–529 Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83 Willighagen E (2005) Genalg: R based genetic algorithm. http://cran. r-project.org/ Zadeh LA (1988) Fuzzy-logic. Computer 21(4):83–93 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A bipolar knowledge representation model to improve supervised fuzzy classification algorithms

Recommend Documents