ISSN 10642307, Journal of Computer and Systems Sciences International, 2009, Vol. 48, No. 3, pp. 401–414. © Pleiades Publishing, Ltd., 2009. Original Russian Text © K.A. Naidenova, 2009, published in Izvestiya Akademii Nauk. Teoriya i Sistemy Upravleniya, 2009, No. 3, pp. 73–88.
ARTIFICIAL INTELLIGENCE
Reducing One Class of Machine Learning Algorithms to Logical Operations of Plausible Reasoning K. A. Naidenova Military Medicine Academy, St. Petersburg, Russia Received May 24, 2007; in final form, August 26, 2008
Abstract—The ways to transform a wide class of machine learning algorithms into processes of plausible rea soning based on known deductive and inductive rules of inference are shown. The employed approach to machine learning problems is based on the concept of a good classification (diagnostic) test for a given set of positive and negative examples. The problem of inferring all good diagnostic tests is to search for the best approximations of the given classification (partition or the partitioning) on the established set of examples. The theory of algebraic lattice is used as a mathematical language to construct algorithms of inferring good classification tests. The advantage of the algebraic lattice is that it is given both as a declarative structure, i.e., the structure for knowledge representation, and as a system of dual operations used to generate elements of this structure. In this work, algorithms of inferring good tests are decomposed into subproblems and opera tions that are the main rules of plausible human inductive and deductive reasoning. The process of plausible reasoning is considered as a sequence of three mental acts: implementing the rule of reasoning (inductive or deductive)with obtaining a new assertion, refining the boundaries of reasoning domain, and choosing a new rule of reasoning (deductive or inductive one). DOI: 10.1134/S1064230709030083
INTRODUCTION Online modeling of plausible human reasoning is critical to creating intelligent computer systems. Much effort is put to integrate two mutually comple mentary processes, inductive inference of rules (con cepts) by examples and deductive reasoning. Two approaches to solving this problem are described. The first is implemented within inductive logical program ming (ILP), where the inductive inference is based on reversing deductive rules of inference such as the reso lution rule or implication rule [1]. Within the second approach, the integration is done by combining the wellknown models of deductive inference based on the firstorder predicate logic and inductive methods of machine learning, where symbolic values of attributes are used to describe data and knowledge [2–7]. Machine learning is based on a concept. Within ILP, it is difficult to interpret concepts as subsets of learning examples as it is done for machine learning. In terms of synthesizing plausible reasoning, the JSM method of automated hypothesis generation seems to be the most interesting [8–12]. Theoretically, this method is synthesized of several cognitive proce dures, viz. empirical induction based on simulating the John Stuart Mill’s joint rule of agreement and dis tinction, with the initials forming the name of the method [13], and casual analogy and abduction of C. Peirce [14]. The way the JSM method is imple mented is described in [15–17]. Inductive generaliza tions in the JSM method are inferences of logical for mulas with generality quantifier from the set of exam
ples that can be represented in some extension of the language of multivalued logic of firstorder predicates [17]. Rules of plausible inference are stated in terms of the same language while procedures of inductive infer ence are given as those that calculate the areas of truth of the corresponding predicates. The purpose of this work is to show that one class of machine learning algorithms can be transformed into the process of integral reasoning, where different rules (deductive, inductive, abductive, traductive, etc.) alternate and support each other. These are mental acts that can be found in any reasoning: stating new propositions, choosing the relevant part of knowledge and/or data for further steps of reasoning, involving a new rule of reasoning (deductive, abductive, induc tive, traductive, etc.). The concept acts as the principal “atom” in plausi ble reasoning. This reasoning is based on the knowl edge system, with objects, properties (values of attributes) and classifications (attributes) being its ele ments. If we take into account that implications express relations between concepts (the object the class, the object the property, the property the class), we can assume that schemes of inferring and applying implications (rules of the “if–then” type) form the core of classification processes, which, in turn, form the basis of plausible reasoning. Deductive steps of plausible reasoning imply using known facts and statements of the “if–then” type to infer conse quences from them. To do it, we apply deductive rules of inference, the main forms of which are modus pon
401
402
NAIDENOVA
ens, modus tollens, modus ponendo tollens and modus tollendo ponens. Inductive steps imply apply ing known facts and existing knowledge to infer new implicational statements and correct those that turned out to be in contradiction with the existing knowledge. These steps rely on inductive rules of reasoning repre sented by inductive canons stated in [13]. These can ons are known as five inductive methods, viz. the methods of unique agreement, unique distibction, the joint method of agreement and distinction, the method of concomitant variations and the method of residues. The approach to machine learning problems we apply is based on the concept of a good classification (diagnostic) test for the involved set of positive and negative examples. The problem of inferring all good diagnostic tests is to search for the best approximations of the given classification (partition) on the given set of examples. The concept of the good diagnostic test (GDT) was initially formed within the subject of searching functional and implicational dependencies in relational databases [18]. It was shown later that many problems of searching logical dependencies in machine learning can be reduced to the problem of inferring all GDTs for the given classification of exam ples [19]. We consider the theory of mathematical lattices as a mathematical language for constructing algorithms of inferring good classification tests. The advantage of an algebraic lattice is that, being an algebraic struc ture, it can be described by algebraic expressions (by declarations) and, at the same time, by dual operations (procedures), which generate elements of the lattice and relations between them. At present, many researchers use algebraic lattices to represent data or knowledge in algorithms. In this connection, works that deal with inductive inference of concepts by examples [20], conceptual clusteriza tion [21], inference of functional, implicational and associative dependencies from data [22–32] are of interest. In addition, we mention publications that deal with searching new attributes [33, 34]. In [35], the formal concept analysis (FCA) was proposed on the basis of conceptual lattice. This approach is widely covered in scientific publications [36–38]. Some algo rithms for generating the set of all formal concepts over the set of object descriptions are given in [39–42]. In this work, we use the concept of inductive tran sition introduced in [32] that helps generate all nearest elements for any element of the algebraic lattice. The detailed analysis of algorithms of searching for all GDTs in terms of constructing the algebraic lattice allowed us not only to determine the structure of infer ences but also to decompose algorithms into subprob lems and operations that represent known deductive and inductive modes (modus operandi) of plausible reasoning.
1. LOGICAL RULES OF REASONING To implement plausible reasoning, we need rules of three types. EXAMPLES or expressions for relations between real observable objects (facts or events). We can treat examples as logical rules with the lowest generalization degree. On the other hand, examples act as a source for deriving generalized rules based on learning samples. FIRSTTYPE RULES or knowledge. These rules describe regular relations between objects (classes) and their properties, between classes of objects, and between properties of different objects (classes). First type rules are given either by experts in problem domains or are inductively inferred when analyzing examples. SECONDTYPE RULES or rules of reasoning that help apply, generate and modify firsttype rules. Secondtype rules cover both inductive and deductive rules of inference. 1.1. FirstType Rules Firsttype rules can be represented only by one class of logical propositions based on implicational dependencies between names. We use names to denote (represent) concepts, things, events, situations or any substances. In formal expressions of logical rules, names can be treated as values of attributes. Then, we use the letters A, B, C, D, a, b, c, d… for values of attributes in rules. We consider the following first type rules. Implication a, b, c d. This rule means that if all values in its lefthand side are true simultaneously, the value in the righthand side of the rule is true as well. Inhibition Rule (Inhibition) (an implication of spe cial type) a, b, c false (never). This rule prohibits the combination of values listed in its lefthand side. We can transform inhibition into several implications as follows a, b not c; a, c not b; b, c not a. Compatibility a, b, c VA, where VA is the fre quency of occurrence of the rule. Experts in many research areas use this rule to show the following observation: “values in the lefthand side of the rule do not always exist simultaneously but can occur rather frequently (seldom)". Generally, the compatibility rule represents the most common combination of values which is characterized by an insignificant number of exceptions (contrary examples) from the regularity or the rule that is met always. The compatibility is equiv alent to the collection of implications a, b c VA; a, c b VA; b, c b VA. Diagnostic rule x, d a; x, b not a; d, b false. For instance, d and b can be two values of one and the same attribute. This rule works when it is proved that ‘x’ is true and we need to find whether ‘a’ is true. If ‘x & d’ is true, ‘a’ is true but if ‘x & b’ is true, ‘a’ is false.
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
REDUCING ONE CLASS OF MACHINE LEARNING ALGORITHMS
The rule of alternative a or b true (always); a, b false. This rule tells us that a and b cannot be true simultaneously; either a or b is true but not both. This rule is a version of inhibition. 1.2. SecondType Rules of Deductive Reasoning (Inferences) Deductive steps of reasoning imply inferring con sequences from some observable or known facts when using propositions of the “if–then” type (i.e., knowl edge or firsttype rules). To do this, we apply deductive rules of inference in natural plausible reasoning, such as modus ponens, modus tollens, modus ponendo tol lens and modus tollendo ponens. Suppose x is the collection of true values of some attributes (values, in what follows) observed simulta neously. We consider firsttype rules used in deductive inferences. Implication. Suppose r is the implication, left(r) is the lefthand side of r and right(r) is the righthand side of r. If left(r) ⊆ x, then x can be extended as х х ∪ right(r). The implication is based on modus ponens – if A, then B; A; hence, B. Inhibition. Now, suppose r is the implication of the type y not k. If left(r) ⊆ x, then k is the prohibited value for all extensions of x. The inhibition is applied based on modus ponendo tollens – either A or B (A and B are alternatives); A; hence not B; either A or B; B; hence not A. Compatibility. If r = 'a, b, c k VA’, where VA is the value of the special attribute that characterizes the frequency of occurrence of the rule, and left(r) ⊆ x, then k can be used to extend x along with the calcu lated value VA for this extension. Calculating the esti mate VA requires special consideration. In any case, to do this, we need the function which would be monot onous, continuous and bounded above. The compati bility rule is used on the basis of modus ponens. Diagnostic rule. Suppose r is a diagnostic rule ‘x, d a; x, b not a’, where ‘х’ is true and ‘a’, ‘not a’ are hypotheses on potential values of some attribute. The diagnostic rule is based on modus ponens and modus ponendo tollens. There exists several ways to disprove one of the hypothesis (1) to infer either d or b, using existing knowledge, (2) to involve new facts and/or propositions to infer new firsttype rules that distinguish between the hypotheses ‘a’ and not a’ (using the secondtype rules of inductive reasoning); to apply the newly obtained rules, (3) to get the direct answer on whether d or b is true in this case, involving an external source of knowledge. The rule of alternative. If ‘a’ and ‘b’ are two alter natives (mutually exclusive hypotheses) about the val ues of some attribute, and the truth of one of hypoth eses can be established by the secondtype rule of inference, then the second hypothesis is rejected. The
403
rule is based on modus tollendo ponens: either A or B (A and B are alternatives); not A; hence B; either A or B; not B; hence A. We can call the rules listed above the rules of “for ward inference”. Another way to include implicational patterns (firsttype rules) in natural reasoning is com mon, which can be called “backward inference” and implemented by a number of rules. Generation of hypothesis or abduction rule. Sup k. Then, the hypoth pose r is some implication y esis is generated “if k is true, y may be true”. y
Modus tollens. As before, r is some implication k. If 'not k’ is inferred,‘not y’ is also inferred.
When applied, the above given rules generate the reasoning, which is not demonstrative. The purpose of the reasoning is to infer all possible hypotheses on the value of some objective attribute. It is essential that hypotheses do not contradict knowledge (firsttype rules) and the observable real situation, where the rea soning takes place. Inference of hypothesis is reduced to constructing all intrinsically consistent extensions of the set of values x, in which the number of involved attributes is maximum possible (there are no prohib ited pairs of values in such extensions). Each extension corresponds to one hypothesis only. All hypotheses have different admissibility, which is determined by the quantity and “quality” of rules of compatibility involved in the inference of each of them. 1.3. SecondType Rules of Inductive Reasoning (Inferences) Performing inductive steps of plausible inference, we operate with known facts and propositions, obser vations and experimental results to obtain or correct firsttype rules. A British logician, John Stuart Mill, stated main forms of induction as canons that combine induction method of unique agreement, method of unique distinction, the joint method of agreement and distinction, the method of concomitant variations and the method of residues. In this work, we use first three canons. The method of unique agreement means that if pre vious events (values, in our terms) A, B, and C lead to events (values) a, b, and c and events (values) A, D, and E generate events (values) a, d, and e, then A causes a. The method of unique distinction means that if pre vious events (values, in our terms) A, B, and C lead to (cause) events (values) a, b, and c and events (values) B and C generate events (values) b and c, then A causes a. The joint method of agreement and distinction implies applying two previous methods simulta neously.
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
404
NAIDENOVA
Table 1. Classification of examples 1 (taken from [20]) Example Height index 1 2 3 4 5 6 7 8
Low Low Tall Tall Tall Low Tall Tall
Hair color
Eye color
Example class
Blond Brown Brown Blond Brown Blond Red Blond
Bleu Bleu Embrown Embrown Bleu Embrown Bleu Bleu
1 2 2 2 2 2 1 1
Remark. Here and in what follows, please treat values of attributes as symbols rather than English words.
2. A CONCEPT OF A GOOD DIAGNOSTIC TEST We base our approach to machine learning on the concept of GDT, which is defined as the best approxi mation of classification given on some set of learning examples [18, 19]. We define a GDT. Let R be a set of examples, and let S = {1, 2, …, i, …, n} be a set of indices of examples, where n is the number of examples in the set R. We introduce R(+) and S(+) to denote the set of positive examples and their indices respectively. We use R(⎯) = R/R(+) and S(–) = S/S(+) to express the set of nega tive examples and their indices respectively. Let U be a set of attributes and Т be a set of values of attributes (values, for the sake of brevity), each of which occurs at least in one example of the set R. We use s(A), A ∈ T to denote the subset {i ∈ S: A occurs in ti, ti ∈ R}, where S = {1, 2, …, n}. Following [26], we call s(A) an inter pretation of A in R. We can naturally define s(t) for any collection of values t, t ⊆ T: if t = A1A2 … Am, then s(t) = s(A1) ∩ s(A2) ∩ … ∩ s(Am). D e f i n i t i o n 1. The collection of values t ⊆ T (s(t) ≠ ∅) is a diagnostic test for the set of examples R(+) if and only if the condition is satisfied: t ⊄ t*, ∀t*, t* ∈ R(–) (its equivalent condition is s(t) ⊆ S(+)). Let k be the name of some set of examples R(k). Stating that the collection of values t is the diagnostic test for R(k) is equivalent to stating that t “covers” no example t*, t* ∉ R(k), i.e., t ⊆ t* for all t* ∉ R(k). At the same time, it follows from the condition s(t) ⊆ S(k) that the implicational dependence ‘if t, then k’ is true. Thus, the diagnostic test (DT) as the collection of val ues forms the lefthand side of some firsttype rule. It is obvious that the set of all DTs for the given set of examples R(+) (we denote this set by DT(+) is the set of all collections of values t for which s(t) ⊆ S(+) holds. For any pair of DTs ti, tj ∈ DT(+), only one of the following relations is true: s(ti) ⊆ s(tj), s(ti) ⊇ s(tj), and s(ti) ≈ s(tj), where the last relation means that s(ti) and
s(tj) are not comparable, i.e., s(ti) ⊄ s(tj) and s(tj) ⊄ s(ti). This observation leads to the concept of GDT. D e f i n i t i o n 2. The collection of values t ⊆ T (s(t) ≠ ∅) is a GDT for the set of examples R(+) if and only if s(t) ⊆ S(+) and, simultaneously, the condition s(t) ⊂ s(t*) ⊆ S(+) is not satisfied for any t*, t* ⊆ T such that t* ≠ t. D e f i n i t i o n 3. The collection of values t ⊆ T is irredundant if for any value ∈ t the condition s(t) ⊂ s(t/) is satisfied. If the collection of values t is a good test for R(+) and, simultaneously, an irredundant collection of val ues, each proper subset t is not a test for R(+). D e f i n i t i o n 4. The collection of values t ⊆ T is a maximally redundant set if for any implicational dependence X satisfied in R, the condition X ⊆ t results in the fact that t also includes : ∈ t. If t is the maximally redundant collection of values, then for any value ∉ t, ∈ T, s(t) ⊃ s(t ∪ ) holds. In other words, the maximally redundant collection of values t covers more examples than any collection of values (t ∪ ), where ∉ t. If the diagnostic test t for the given set of examples R(+) is a good test and a maximally redundant collection of values simulta neously, then for any value ∉ t, ∈ T the proposition holds: (t ∪ ) is not a good test for R(+). Any example t in R is the maximally redundant col lection of values since for any value ∉ t, ∈ T, the set s(t ∪ ) is ∅. For instance, in Table 1, the ‘Blond Bleu’ collection acts as the irredundant test for class 1 and is the maxi mally redundant collection of values simultaneously. The ‘Blond Embrown’ collection is the test for class 2 yet not a good one and is the maximally redundant collection of values simultaneously. The ‘Embrown’ value is a good irredundant test for class 2 while the Red value is the good irredundant test for class 1. The Tall Red Bleu collection is a good maximally redun dant test for class 1. It is obvious that good irredundant tests should be the best tests both for pattern recognition and detec tion of data regularities. These tests help construct the shortest firsttype rules with the highest degree of gen eralization. One of possible ways to search for good irredundant tests (GIT) for the given class of positive examples is as follows: first, find all good maximally redundant tests (GMT); then, for each GMT, find all GITs it includes. This search strategy is handy since each GIT belongs to one GMT only, interpretation being the same [18], and, moreover, GITs are gener ated by the same algorithm as GMTs [31]. Note that, taking into account [12] and introduced GIT defini tions, the JSM method of automated inference of hypotheses that employs combinatorial algorithms based on the exhaustive search of intersections of objects allows us to find all maximally redundant intersections (MRI) and, consequently, all good MRIs that correspond to maximum subsets of objects. The
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
REDUCING ONE CLASS OF MACHINE LEARNING ALGORITHMS
JSM method does not construct all good irredundant intersections of objects but for the particular case when the good irredundant intersection coincides with its equivalent (with respect to intersected objects) MRI. Note also that the minimal intersection in the JSM method does not correspond to any good irre dundant collection of values in our approach. 3. DUALITY OF GOOD DIAGNOSTIC TESTS In GDT definition, we implicitly used the relation G on S × T and two mappings S T and T S. Let s ⊆ S, t ⊆ T. We define the mappings as follows: S T: t(s) = {intersection of all ti: ti ⊆ T, i ∈ s} and T S: s(t) = {i: i ∈ S, t ⊆ ti}. These mappings are known in mathematics as Galois correspondences [43, 44] that form the basis of definition of the concept proposed by R. Wille [45]. They are also used in the CHARADE knowledge mining system [46]. When s is extended by adding the index j* of some new example, the number of values in the intersection of examples with indices from the extended set is or is less than the number of values in the intersection of examples with indices from the original set (generating a more general attribute of examples): (s ∪ j*) ⊆ s leads to t(s ∪ j*) ⊇ t(s). When t is extended by adding a new value A, the number of examples possessing the com mon attribute ‘tA’ (covered by the set ‘tA’) is or is less than the number of examples possessing the attribute ‘t’ (covered by the set ‘t’): (t ∪ A) ⊆ t leads to s(t ∪ A) ⊇ s(t). We also introduce the following generalization opera tors (functions) generalization_of(t) = t' = t(s(t)) and generalization_of(s) = s' = s(t(s)). For the generalization of the collection s, i.e., the sequence of mappings s t(s) s(t(s)), we have that s(t(s)) ⊇ s. As a matter of fact, this generalization operator gives the maximum set of examples possess ing the attribute t(s). For the generalization of the set t, i.e., the chain of mappings t s(t) t(s(t)), we have that t(s(t)) ⊇ t. The meaning of this generalization operator is that it finds the maximum common attribute (collection of values) of examples, the indices of which form s(t). These generalization operators are not artificially constructed. One can perform mentally a great deal of such operations in a short period of time. We give sev eral examples. Suppose a person has watched two movies (s) starring Gerard Depardieu (t(s)). Then, this person wants to find all other movies featuring this actor (s(t(s))). Suppose he knows that Gerard Depar dieu plays with Pierre Richard (t) in several movies (s(t)). Then, he can learn that these are movies of the same director, Francis Veber (t(s(t))). It is these gener alization operators used in searching for GDTs. We describe DT as a dual object, i.e., as a pair (SL, TA), SL ⊆ S, TA ⊆ T, SL = s(TA), and TA = t(SL).
405
D e f i n i t i o n 5. Suppose PM = {s1, s2, …, sm} is a family of subsets of some set M. Then, PM is called a Sperner system [47] if the condition si ⊄ sj and sj ⊄ si, ∀(i, j), i ≠ j, i, j = 1, …, m holds. We consider R, S, R(+), S(+), R(–), and S(–) to be defined as before. D e f i n i t i o n 6. Finding all GMTs for the given set of examples R(+) means constructing the family PS of subsets s1, s2, …, sj, …, snp of the set S(+) such that (1) PS is the Sperner system, (2) each sj is the maximum set in a sense that adding the index i of the example ti to it such that i ∉ sj, i ∈ S(+), leads to the condition s(t(sj ∪ i)) ⊄ S(+). In other words, t(sj ∪ i) is not a test for the set R(+) since there exists an example t*, t* ∈ S(–), for which t(sj ∪ i) ⊆ t*. We can give the set TGOOD of all GMTs as {t: t(sj), sj ∈ PS, j = 1, …, np}. D e f i n i t i o n 7. Finding all GITs for the given set of examples R(+) means constructing the family PRT of subsets t1, t2, …, tj, …, tnq of the set T such that (1) tj ⊄ t, ∀j = 1, …, nq, ∀t, t ∈ R/R(+) and simul taneously for any tj, j = 1, …, nq, s(tj) ≠ ∅ there is no such set s* ≠ s(tj), s* ⊆ S that s(tj) ⊂ s* ⊆ S(+), (2) PRT is the Sperner system, (3) each tj is the minimal set in the sense that removing any its value A from it generates the condi tion s(tj without A) ⊄ S(+). This means that the obtained collection of values is not a test for the given set of examples any more. 3.1. Generating Dual Objects by Lattice Operations Suppose that R is a set of examples and S and T have the same meaning as before. We introduce MUT, which is the set of all dual objects, viz. the set of all pairs (s, t), s ⊆ S, t ⊆ T, s = s(t), and t = t(s). This set is partially ordered with respect to the relation ≤, where (s, t) ≤ (s*, t*) holds if and only if s ⊆ s* and t ⊆ t*. The set Ψ = (MUT, ∪, ∩) is an algebraic lattice, where operations ∪ and ∩ are given for all pairs (s*, t*), (s, t) ∈ MUT as follows [45] (s*, t*) ∪ (s, t) = ((s* ∪ s), (t* ∩ t)) and (s*, t*) ∩ (s, t) = ((s* ∩ s), (t* ∪ t)) are the unity and zero in the lattice, respectively. To construct good tests, we need to find, for any element (s*, t*) ∈ MUT, all its nearest elements with respect to the relation ≤ in the lattice, i.e., all such (s, t) that (s*, t*) ≤ (s, t) and there is no element (s**, t**), for which (s*, t*) ≤ (s**, t**) ≤ (s, t), or find all such (s, t) that (s*, t*) ≥ (s, t) and there is no element (s**, t**) that satisfies the condition (s*, t*) ≥ (s**, t**) ≥ (s, t). The inference of chains of the lattice elements ordered with respect to inclusion is the basis of gener ating all types of DTs 1) s0 ⊆ … ⊆ si ⊆ si + 1 ⊆ … ⊆ sm × (t(s0) ⊇ t(s1) ⊇ … ⊇ t(si) ⊇ t(si + 1) ⊇ … ⊇ t(sm)),
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
406
NAIDENOVA
2) t0 ⊆ … ⊆ ti ⊆ ti + 1 ⊆ … ⊆ tm × (s(t0) ⊇ s(t1) ⊇ … ⊇ s(ti) ⊇ s(ti + 1) ⊇ … ⊇ s(tm)). 3.2. Inductive Rules to Construct Elements of the Dual Lattice To construct chains of elements of the dual lattice, we introduce the following variants of inductive tran sition from one element of the chain to its next ele ment. (1) from sq = (i1, i2, …, iq) to sq + 1 = (i1, i2, …, iq + 1); (2) from tq = (A1, A2, …, Aq) to tq + 1 = (A1, A2, …, Aq + 1); (3) from sq = (i1, i2, …, iq) to sq – 1 = (i1, i2, …, iq – 1); (4) from tq = (A1, A2, …, Aq) to tq – 1 = (A1, A2, …, Aq – 1). Inductive transitions are processes of extending or narrowing collections of values (indices of examples). Transitions can be smooth or boundary. For smooth transitions, collections of values (indices of examples) are extended (narrowed) with the controllable prop erty of these collections preserved, such as “is the test for examples of the given class” (“corresponds to the test for examples of the given class”), “is the irredun dant collection of values”, “is not a test for examples of the given class” (“does not correspond to a test for examples of the given class”), “is the good test for examples of the given class” (“corresponds to the good test for examples of the given class”), etc. For bound ary transitions, the controllable property of collections is changed for the opposite. Note that the reasoning starts from using the schemes of reducing the exhaustive search when searching for tests and directed selection of subspaces of test search that necessarily include the desired solu tions. Thus, we need methods that help (a) avoid con struction of all subsets for each set of values (indices of examples) and (b) choose values (indices of examples) that are critical for the solving process. To do this, we introduce concepts of admissible and essential values (indices of examples). Constructing tests implies generating deductive firsttype rules (implications, inhibitions, compatibil ity and diagnostic rules), which are immediately applied to reduce the search space. For inductive tran sitions to be implemented, special rules are needed. 3.2.1. Rule of generalization. We apply the rule of generalization to obtain all collections of indices sq + 1 = {i1, i2, …, iq + 1} from the collection of indices sq + 1 = {i1, i2, …, iq} such that t(sq) and t(sq + 1) act as tests for the given set of positive examples. Extending the set of objects with some its property preserved represents a typical method of inductive rea soning. For instance, Bachet de Meziriac, a mathema tician, discovered the theorem of four squares (though did not prove it), justifying the truth of his hypothesis for all integer numbers up to 325 inclusively [48]. Any chain of form (1) is considered to be terminated if for
all extensions sq + 1 of the collection sq, the collection of values t(sq + 1) is not a test for the given set of positive examples. We consider some implementations of this rule for the GMT search. The main inductive rule for extending sets s. Let S(test) be a partially ordered set of elements s = {i1, i2, …, iq}, q = 1, 2, …, nt – 1 to be extended and satisfying the following condition: t(s) is a test for positive exam ples from R(+). Here, nt is the number of positive examples. We use STGOOD to denote the partially ordered set of elements s for which t(s) is a GMT for positive examples from R(+) and, hence, s cannot be extended. The rule of extending elements of the set S(test), i.e., constructing the collections {i1, i2, …, iq + 1} out of the collections {i1, i2, …, iq}, q = 1, 2, …, nt – 1 is based on the following consideration. If the collection {i1, i2, …, iq + 1} corresponds to the test for positive examples R(+), then all its proper subcollections should also correspond to tests and, consequently, they are to either be included in S(test) or be subcollections of some collections of the set {S(test) ∪ STGOOD}. Hav ing obtained the collection sq + 1 = {i1, i2, …, iq + 1}, we check whether it corresponds to the test. If t(sq + 1) is not a test, we remove sq + 1 from consideration; other wise, this collection is stored in S(test). If all exten sions sq do not correspond to tests, sq corresponds to the GMT and is moved from S(test) to STGOOD. Ele ments of the sets S(test) and STGOOD are arranged in the lexicographic order. The function to_be_test(t) is given as follows: if s(t) ∩ S(+) = s(t), then true; other wise, false. The rule of generalization is a rule of inductive extension that implies selecting indices admissible for extension of sq. We introduce the way of selecting indi ces admissible for extension of s. We assume that S(test) and STGOOD are not empty and s ∈ S(test). We construct the set V: V = {∪ s', s ⊆ s', s' ∈ {S(test) ∪ STGOOD}}. The set V combines all collections of indi ces from S(test) and STGOOD that include s; hence, s can be found in their intersection. If we want the extension of the set s to be found in none of the ele ments of the set {S(test) ∪ STGOOD}, we need to extend s using only those indices that are not present in V and s simultaneously. The set of indices, candidates for the extension of s, is the set CAND(s): CAND(s) = nts/V, where nts = {∪ s, s ∈ S(test)}. Forming CAND(s) is a step of controlling the region of the test search. The index j* ∈ CAND(s) is not admissible for the extension of s if there can be found at least one index i ∈ s such that the pair {i, j*} either does not correspond to the test or correspond to the GMT (i.e., the pair {i, j*} belongs to STGOOD and cannot be extended). We use Q to denote a set of all pairs of indices prohibited for extension Q = {{i, j}: t({i, j}) are not tests for positive examples R(+)}. Then, we finally obtain the set select(s) of indices admissible for the extension of s select(s) = {i, i ∈ CAND(s): ∀j (j ∈ s),
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
REDUCING ONE CLASS OF MACHINE LEARNING ALGORITHMS
407
Table 2. Using deductive and inductive rules for extending collections of indices Inductive rule
Process
Deductive and inductive secondtype rules
Rule of generalization Forming Q Forming CAND(s) Forming select(s) Forming ext(s) Function_to_be test(t) Generalization_of(snew)
{i, j} ∉ {STGOOD or Q}}. Forming the set Q, we generate deductive rules of inhi bition and then use them to select a subset of admissi ble indices. The set Q can be constructed in the very beginning of the process of searching for all GMTs for R(+). For the set select(s), the EXTENSION(s) proce dure returns the set ext(s) of all possible extensions of s in the form snew = (s ∪ j), j ∈ select(s) and such that snew corresponds to the test for R(+). This procedure also executes the function generalization_of(snew) for each element snew ∈ ext(s). If the sets ext(s) and V are empty, s corresponds to some GMT for R(+) and s is moved from S(test) to STGOOD. If ext(s) consists of one and only one element, this element corresponds to some GMT and is put into STGOOD, with s being removed from S(test). For all other cases, the set ext(s) replaces s in S(test). The EXTENSION(s) procedure implements the inductive rule of joint agreement and distinction, selecting only those sets of indices that are in correspondence with DTs, i.e., with the firsttype rules the lefthand side of which characterizes the set of positive examples and, at the same time, can be found in no negative examples. The rule of generalization is a complex process, in which both deductive and inductive secondtype rules are performed, which is illustrated in Table 2. When the collection of indices is extended, new knowledge is formed (sets Q, S(test), STGOOD) that are immediately used to narrow the region of the test search. The exten sion of s results in subsets of positive examples, which are even more powerful and possess even more general attributes (sets of values). For instance, R. Michalski applied the similar rule to generate “stars” in concep tual clusterization [49]. The NIAGaRa algorithm for GMT search based on the rule of generalization given above is considered in [32]. 3.2.2. The rule of specialization. This rule is used to search for all GITs included in some already found GMT. To search for GMT, we need to extend collec tions of values, i.e., to construct chain (2) of type t0 ⊆ … ⊆ ti ⊆ ti + 1 ⊆ … ⊆ tm(s(t0) ⊇ s(t1) ⊇ … ⊇ s(ti) ⊇ s(ti + 1) ⊇ … ⊇ s(tm)). The condition of termination of any chain of spe cializations is as follows: all extensions tq + 1 of the col
Generating inhibition rules Narrowing the search region Using inhibition rules Using the method of joint agreement and distinction Using the implication Using the lattice operation
lection tq are either redundant collections of values or tests for the given set of positive examples. The main inductive rule of extending collections t. Suppose t* is a GMT for R(+), TGOOD is a partially ordered set of elements t ⊆ t* that satisfy the following condition: t is the GIT for R(+). We use SAFE to denote the set of collections t ⊆ t* such that t is an irre dundant collection of values and not a test for R(+). The inductive rule of extending elements of the set SAFE, i.e., constructing collections tq + 1 = (A1, A2, …, Aq + 1) out of tq = (A1, A2, …, Aq) q = 1, 2, …, na* – 1, where na* is the number of values in the collection t*, is based on the following consideration: if the collec tion of values {A1, A2, …, Aq + 1} is irredundant, all its proper subcollections should also be irredundant col lections of values and, hence, they are included in the set SAFE. Forming the collection of values tq + 1 = {A1, A2, …, Aq + 1}, we check whether it is irredundant. If tq + 1 is redundant, we remove it from the set SAFE. If this collection is a test for R(+), we move it from SAFE to TGOOD. If tq + 1 is an irredundant collection of val ues and is not a test for R(+), it remains a candidate to be extended. To extend collections, we use the func tion to_be_irredundant(t) = if ∀Ai (Ai ∈ t), s(t) ≠ s(t/Ai), then true; otherwise, false. This variant of inductive extension of collections of values to construct GITs is considered in [18, 27, 50]. The similar rule of inductive extension of collections of values is also applied in Apriory and AprioryTid algorithms proposed in [51] to search for associative rules in large databases with “baskets of goods” records. The rule of specialization is a rule of inductive extension that implies selection of values admissible for extension of t. In pattern recognition, the process of inferring hypotheses on unknown values of attributes provided that values of some other attributes are true is reduced to maximum possible extension of the collection of known values so that no pair of pro hibited values is included in this extension. Dual rules of generalization (specialization) narrow the collection of values (indices of examples). The dual rule of generalization is used to obtain all collec tions of values tq – 1 = (A1, A2, …, Aq – 1) from the col lection tq = (A1, A2, …, Aq) so that tq and tq – 1 are tests
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
408
NAIDENOVA
for the given set of positive examples. Thus, collections of values are narrowed when dependent or spurious values are removed from some redundant collection of values. The dual rule of specialization is designed for obtaining all collections of indices sq – 1 = (i1, i2, …, iq – 1) out of the collection sq = (i1, i2, …, iq) such that t(sq – 1) and t(sq) are tests for the given set of positive examples. Collections of indices of examples (and, respec tively, collections of objects) are narrowed, for instance, when a subclass of objects is formed within some their class. Thus, we separate lemons in the class of citrus plants. 3.2.3. Rules of implementation of boundary induc tive transitions. We consider boundary transitions for obtaining all collections of (1) values tq = (A1, A2, …, Aq) out of tq – 1 = (A1, A2, …, Aq – 1) so that tq – 1 is not, whereas tq is, a test for the given set of positive examples; (2) values tq – 1 = (A1, A2, …, Aq – 1) out of tq = (A1, A2, …, Aq), and tq is a test for the given set of positive exam ples while tq – 1 is not; (3) indices sq – 1 = (i1, i2, …, iq – 1) out of sq = (i1, i2, …, iq), provided that t(sq) is not a test and t(sq – 1) is a test for the given set of positive examples; (4) indices sq = (i1, i2, …, iq) out of sq – 1 = (i1, i2, …, iq – 1) so that t(sq – 1), unlike t(sq), is a test for the given set of positive examples. All boundary transitions are interpreted as mental operations. Transition (1) is interpreted as, for instance, the search for attribute that makes a differ ence between two illnesses with similar symptoms. Transition (2) is when some class of objects is included in a more general class, for instance, squares included in the subclass of parallelograms with equal sides. Some intelligent psychological tests ask us to exclude the superfluous object from the given collection of objects (a rose, a butterfly, a phlox, and a dahlia) (tran sition (3)). Transition (4) can be seen as a search for a disproving example when extending the inductive base of some hypothesis. Boundary transitions implement the method of unique distinction or method of concomitant varia tions. To perform such transitions, we need special inductive diagnostic rules of reasoning. These rules assume searching essential values and indices of essen tial examples. The inductive diagnostic rule forms the collection of values tq + 1 = {A1, A2, …, Aq + 1} out of the collection tq = {A1, A2, …, Aq} so that tq, unlike tq + 1, is not a test for the given set of positive examples. To extend the collection tq, those values are used that simultaneously occur in examples from R(+) and are not present in any negative example from R(+). The set of essential values ess(t) is V(+)/V(–) for t, where sets V(–) and V(+) are constructed as follows: V(–) = {∪ t': t ⊆ t', t' ∈ R(–)}; V(+) = {∪ t': t ⊆ t', t' ∈ R(+)}.
D e f i n i t i o n 8. Let t be a collection of values that represents a test for the given set of positive exam ples. The value A in t is essential if t/A is not a test for the given set of positive examples. In the general case, it is important to find the max imum subset sbmax(t) ⊂ t such that t is, whereas t’ = sbmax(t) is not, a test for the given set of positive exam ples. Then, sbmin(t) = t/sbmax(t) is the minimal set of essential values in t. Applying the inductive diagnostic rule results in obtaining diagnostic firsttype rules. We can consider the very inductive diagnostic rule as an implementa tion of the inductive method of unique distinction. R. Michalski applied a similar rule [52]. The dual inductive diagnostic rule forms the collec tion of indices sq – 1 = (i1, i2, …, iq – 1) out of the collec tion of indices sq = (i1, i2, …, iq) so that t(sq – 1) is and t(sq) is not a test for the given set of positive examples. This rule requires a method for searching indices that can be removed from sq. Similarly to an essential value, we define an essential example. D e f i n i t i o n 9. Let s be a subset of indices of positive examples. We assume that t(s) is not a test. The example tj, j ∈ s is called essential if t(s/j) turns out to be a test for the given set of positive examples. In the general case, it is important to find the max imum subset sbmax(s) ⊂ s such that t(s) is not, whereas t' = t(sbmax(s)) is, a test for the given set of positive examples. Then, sbmin(s) = s/sbmax(s) is the minimal set of indices of essential examples in s. We use the dual inductive diagnostic rule to infer diagnostic firsttype rules of compatibility. The num ber of indices in sbmax(s) can be treated as the mea sure of reliability for the rule being formed, i.e., t(sbmax(s)) k(R(+)), frequently, where k(R(+)) is the name of the set of positive examples. We assume that s* is the collection of indices of positive examples such that t(s*) is not a test. In [32], the procedure is given that forms a quasimaximum subset qsbmax(s*) ⊂ s* such that t(qsbmax(s*)) is a test for the given set of pos itive examples. The dual inductive diagnostic rule is based on the method of unique distinction. Deductive firsttype rules of compatibility formed when searching for good tests are used in the same process to speed up the search of new tests. The rules of constructing DTs as elements of the dual lattice are used to search for deductive firsttype rules of reasoning, as shown in Table 3. 4. DECOMPOSING THE SEARCH OF GOOD DIAGNOSTIC TESTS INTO SUBPROBLEMS To transform the search of GDTs into a stepby step process, we introduce two kinds of subproblems [53] stated for the given set of positive examples (1) for the particular positive example t, find all GMTs included in t,
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
REDUCING ONE CLASS OF MACHINE LEARNING ALGORITHMS
(2) for some nonempty collection of values X (it may consist of one value only), which is not a test, find all GMTs that include X. Subproblem of the first kind. We introduce the con cept of projection of the given positive example onto the given set of positive examples. The projection proj(R)[t] is the set Z = {z: (z is a nonempty intersection of t and t' & (t' ∈ R(+)) & (z is the test for R(+))}. If the projection proj(R)[t] is nonempty and con sists of more than one element, it forms a subproblem for inferring all GMTs included in t. If the projection includes one and only one element that equals t, then t is a GMT. Subproblem of the second kind. We use the concept of an attribute projection of the given value A onto the given set of positive examples proj(R)[A] = {t: (t ∈ R(+)) & (A appears in t)}. There is another way to define the attribute projection proj(R)[A] = {ti: i ∈ (s(A) ∩ s(+))}. If the attribute projection is not empty and consists of more than one element, it forms the subproblem for inferring all GMTs that include A. If A appears in one and only one example, A does not belong to any GMT different from this example. It makes sense to obtain the projection of A if A is not a test, which is true for the intersection of all positive examples that include A, i.e., s(A) ⊄ s(+), and t' = t(s(A) ∩ s(+)) is not a test for the given set of positive examples. To decompose the inference of good tests into sub problems, we need to introduce special rules to imple ment the following operations: choose the example (the value) for the subproblem, form the subproblem, remove values or examples from the subproblem and some other rules that control the inference of good tests. The following theorem justifies reducing projec tions of both the first and second kind. T h e o r e m. Let A be a value from the set T, X be a GMT for the given set R(+) of positive examples and s(A) ⊆ s(X). Then, A does not belong to any GMTs for R(+) that differ from X. The theorem is proved in [30, 32]. It is convenient to choose essential values in some example and essential examples in some projection for decomposing the GMT inference into subproblems of the first and second kinds. We give an example of infer ring all GMTs for examples of class 1 given in Table 4. Table 4 shows that s(+) = {1,2,3}, s(Low) {1,2,6}, s(Brown) {2,3,5}, s(Bleu) {1,2,5,7,8}, s(Tall) {3,4,5,7,8}, s(Embrown) {3,4,6}, and s(Blond) {1,4,6,8}. We find that the value ‘Low’ is essential in examples 1 and 2. Then, it is reasonable to form the subproblem of the second type as shown in Table 5. Table 5 gives s(+) = {1,2}, s(Low) {1,2,6}, s(Brown) {2}, s(Bleu) {1,2}, and s(Blond) {1,6}. We find that s(Bleu) = {1,2} ⊆ s(+). This means that the collection of values ‘Low Bleu’ is the test for the examples of class 1. Similarly, for the value
409
Table 3. Deductive firsttype rules formed by inductive rules of test inference Inductive rules Rule of generalization Rule of specialization Inductive diagnostic rule Dual inductive diagnostic rule
Inference of deduc tive firsttype rules
Actions Extending s (narrowing t) Extending t (narrowing s) Searching for essential values Searching for essential examples
Implications Implications Deductive diagnostic rules Compatibility rules (approximated implications)
Table 4. Classification of examples 2 Eye color
Example class
Bleu Bleu Embrown Embrown Bleu Embrown Bleu Bleu
1 1 1 2 2 2 2 2
Example Height Hair color index 1 2 3 4 5 6 7 8
Low Low Tall Tall Tall Low Tall Tall
Blond Brown Brown Blond Brown Blond Red Blond
Table 5. Subproblem for the value ‘Low’ Example Height index 1 2 6
Low Low Low
Hair color
Eye color
Example class
Blond Brown Blond
Bleu Bleu Embrown
1 1 2
‘Brown’, we have s(Brown) = {2} ⊆ s(+). Hence, the collection of values ‘Low Brown’ is the test for the examples of class 1; however, it is not a good test since s(Brown) = {2} ⊆ s(Bleu). It is obvious that these values cannot belong to any GMT different from already obtained tests. We remove ‘Brown’ and ‘Bleu’ from further consideration in this subproblem but once it is removed, examples 1 and 2 stop being tests for examples of class 1. Hence, the subproblem is finished. We return to Table 4. Now, we can remove the value ‘Low’ from further analysis since we obtained all GMTs for class 1 that include this value. However, we know that the value ‘Low’ is essen tial for examples 1 and 2, which means these examples stop being tests once this value is removed. At the next step, we can infer all irredundant tests for class 1 that cover only one example 3 and are
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
410
NAIDENOVA
included in this example. This search yields the collec tion of values ‘Brown Embrown’. The recursive procedure based on using attribute subproblems for GMT inference can be found in [30]. In the next section, we give the algorithm based on the subproblem of the first kind and combined with the search for essential examples. This algorithm con structs GMTs only. 4.1. Algorithm for Deriving GMT Based on the Subproblem of the First Kind The DIAGaRa algorithm is the main recursive pro cedure for solving subproblems of the first kind. The projection of a positive example onto the current set of positive examples is the initial information for search ing for all GMTs included in this example. It is essen tial that the projection is just a subset of examples given on the bounded subset of values t*. Let s* be a subset of indices of positive examples included in the projec tion. We introduce the characteristic W(t) for any col lection of values t, which is called the weight of value t in the projection W(t) = ||s(t) ∩ s*|| and represents the number of positive examples of the projection that include t. Let WMIN be the minimum admissible value of the weight, and STGOOD be a partially ordered set of elements s such that t(s) is the GMT for R(+). The main algorithm is to perform the following steps. S t e p 1. We check whether the intersection of all examples (elements) of the projection is a test and, if so, we store s* in STGOOD provided that s* corre sponds to the good test at the current instant. In this case, the subproblem is solved. Otherwise, we perform the next step (use the function to_be_test(t): if s(t) ∩ s(+) = s(t) (s(t) ⊆ s(+)), then true; otherwise, false). S t e p 2. For each value A in the projection, we give the set splus(A) = {s* ∩ s(A)} and the weight W(A) = ||splus(A)||; if the weight is less than minimal admissi ble weight WMIN, we remove the value A from the projection. We can also remove the value A if W(A) is WMIN and t(splus(A)) is not a test—in this case, A cannot appear in any GMT t with W(t) that is equal to or greater than WMIN. S t e p 3. We perform the generalization operation t' = t(splus(A)), A ∈ t*; if t' is a test, we remove the value A from the projection and store splus(A) in STGOOD if splus(A) corresponds to the good test at the current instant. S t e p 4. We can remove the value A from the pro jection if splus(A) ⊆ s' for some s' ∈ STGOOD. S t e p 5. If we removed at least one value from the projection, the projection should be reduced. The reduction implies removing elements of the projection that are not tests any more (as a result of previous elimination of values). If at least one element left the projection in the course of reduction process, we repeat Steps 2–5.
S t e p 6. We check whether the subproblem is fin ished. The subproblem is finished when either the pro jection is empty or the intersection of all its elements is a test (see Step 1). If the subproblem is not solved, we choose an essential example in the projection and use it to form a new subproblem. New subsets s* and t* are constructed and the main algorithm works at a new recursion level. Forming the partially ordered set STGOOD is an important part of the algorithm. An approach to forming the set STGOOD. Let L(S) be a set of all subsets of the set S; L(S) is a lattice of sets [54]. Elements in the lattice of sets are arranged according to the relation of the settheoretic inclu sion. We say that the subset s1 is absorbed by the subset s2, i.e., s1 ≤ s2, if and only if the inclusion relation is ful filled for them, i.e., s1 ⊆ s2. When we form STGOOD, we store the collection of indices s in it if and only if it is not absorbed by any element of this set. We also need to remove all elements absorbed by s from STGOOD if s is stored in STGOOD. Thus, once the algorithm is fin ished, the set STGOOD includes all collections of indi ces that correspond to GMTs for the given set of posi tive examples and nothing but such collections. It is essential that STGOOD is formed by the incremental search for all maximum elements of the partially ordered set. The set STGOOD of all GMTs can be obtained as follows TGOOD = {t: t = t(s), (∀s) (s ∈ STGOOD)}. We give an example of how the DIAGaRa algo rithm operates in the Appendix. 4.2. Approach to Incremental Inference of Good Diagnostic Tests We need the incremental process when new por tions of examples appear occasionally or examples appear sequentially in time. Suppose each new exam ple comes along with the indication of its belonging to some particular class. Then, the following actions should be performed once a new example appears if possible, generalize the existing tests for the class, to which the new example belongs (the class of positive examples), i.e., extend the set of examples covered by the existing tests (if the new example includes these tests – collections of values), infer new tests included in the new example, check the validity of existing tests for negative examples and, if necessary, correct tests that turned out to be invalid (a test for negative examples is not valid if it is included in the new positive example, i.e., it treats a positive example as if it were a negative example). The incremental process of inferring tests is decomposed into subproblems that are matched with three acts of reasoning: recognizing patterns or using existing rules (tests) to determine the class of belong ing of the new positive example and generalization of
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
REDUCING ONE CLASS OF MACHINE LEARNING ALGORITHMS Table 6. Set of positive examples R(+) R(+) A1 A2 A5 A6 A21 A23 A24 A26 A4 A7 A8 A9 A12 A14 A15 A22 A23 A24 A26 A3 A4 A7 A12 A13 A14 A15 A18 A19 A24 A26 A1 A4 A5 A6 A7 A12 A14 A15 A16 A20 A21 A24 A26 A2 A6 A23 A24 A7 A20 A21 A26 A3 A4 A5 A6 A12 A14 A15 A20 A22 A24 A26 A3 A6 A7 A8 A9 A13 A14 A15 A19 A20 A21 A22 A16 A18 A19 A20 A21 A22 A26 A2 A3 A4 A5 A6 A8 A9 A13 A18 A20 A21 A26 A1 A2 A3 A7 A19 A20 A21 A22 A26 A2 A3 A16 A20 A21 A23 A24 A26 A1 A4 A18 A19 A23 A26 A23 A24 A26
rules that take part in recognition (using implications or deductive secondtype rules of reasoning and increasing the degree of belief of the inferred knowl edge), inferring extra rules (tests) generated by the new positive example, which corresponds to forming new bonds with different groups of positive examples (inductive inference of new knowledge), correcting rules (tests) for alternative (negative) classes, which treat the new positive example as a neg ative example (these rules do not allow distinguishing the new positive example from some negative exam ples), based on deductive and inductive diagnostic procedures. The first act determines existing firsttype rules met by the new example; the inductive base of these rules is extended. It is done by implications or deductive sec ondtype rules. The second act requires solving the subproblem of inferring tests of the first kind. The third act can be reduced to applying the inductive diagnostic rule and then forming and solving subprob lems of inferring tests of the first kind. This act can also be reduced to forming and solving subproblems of inferring tests of the second kind. Thus, decomposing the inference of tests into sub problems allows singling out fragments of knowledge relevant to newly arrived examples in the course of stepbystep reasoning and solving local problems of knowledge modification both from the point of con firming the existing knowledge and from the point of developing the structure of knowledge or correcting it. CONCLUSIONS In this work, we were to show that one can trans form a sufficiently large class of machine learning
411
problems into problems of plausible inference based on using well known deductive and inductive logical rules of reasoning. We considered learning problems within a single approach to them as to problems of classification approximation. Earlier we showed that many machine learning problems such as inferring functional, implicational and associative dependencies from data, constructing decision classification trees by examples, etc. can be reduced to the problem of inferring GDTs for the given partition of the set of learning examples into classes. The problem of inferring GDTs serves as the ideal model for inductive reasoning since its very statement requires applying inductive canons discovered in our thinking and stated by the British logician John Stuart Mill. We relied upon the theory of algebraic lattices as the mathematical language for constructing GDTs. The DT was defined as a dual object, i.e., the element of conceptual lattice considered in formal conceptual analysis. Inferring chains of elements of the lattice ordered by the inclusion relation is the basis for generating all types of DTs. We studied four variants of inductive transition from one element of the chain to its nearest elements in the lattice. We constructed rules for imple menting these transitions, rules of generalization and specialization, inductive diagnostic rule and dual inductive diagnostic rule. To show that the constructed rules of inductive transitions can be reduced to pro cesses of plausible reasoning, we needed to consider plausible reasoning (inductive and deductive) as an integral system of interconnected rules. We distin guished two types of rules of plausible reasoning. First type rules are implicational logical statements that help describe observable objects. Secondtype rules (rules of reasoning) are used to apply, modify firsttype rules or infer them from existing data and already made conclusions. The method of unique agreement, the method of unique distinction and the joint method of agreement and distinction represent inductive sec ondtype rules. Deductive secondtype rules are based on rules of reasoning that use implications (modus ponens, modus ponendo tollens, modus tollendo pon ens, and modus tollens). Having analyzed algorithms of constructing ele ments of conceptual lattice, we could show that rules of inductive transitions implement the reasoning involving both inductive and deductive secondtype rules. In the course of generating the conceptual lat tice, new firsttype rules (implications, inhibition and compatibility rules) are formed, which are immedi ately involved in the inference to help direct and con trol the lattice construction. We also used other induc tive methods of constructing algorithms, viz. decom position of the GDT inference into subproblems. Decomposition allowed for principle transformation of processes of inferring good tests into incremental processes of reasoning. The DIAGaRa algorithm
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
412
NAIDENOVA
Table 7. Set of negative examples R(–) R(–)
R(–)
A3 A8 A16 A23 A24 A7 A8 A9 A16 A18 A1 A21 A22 A24 A26 A1 A7 A8 A9 A13 A16 A2 A6 A7 A9 A21 A23 A10 A19 A20 A21 A22 A24 A1 A20 A21 A22 A23 A24 A1 A3 A6 A7 A9 A16 A2 A6 A8 A9 A14 A15 A16 A1 A4 A5 A6 A7 A8 A16 A7 A13 A19 A20 A22 A26 A1A2 A3 A6 A7 A16 A1 A2 A3 A5 A6 A13 A16 A1 A3 A7 A13 A19 A21 A1 A4 A5 A6 A7 A8 A13 A16 A1 A2 A3 A6 A12 A14 A15 A16 A1 A2 A5 A6 A14 A15 A16 A26
A1 A2 A3 A7 A9 A13 A18 A1 A5 A6 A8 A9 A19 A20 A22 A2 A8 A9 A18 A20 A21 A22 A23 A26 A1 A2 A4 A5 A6 A7 A9 A13 A16 A1 A2 A6 A7 A8 A13A16 A18 A1 A2 A3 A4 A5 A6 A7 A12 A14 A15 A16 A1 A2 A3 A4 A5 A6 A9 A12 A13 A16 A1 A2 A3 A4 A5 A6 A14 A15 A19 A20 A23 A26 A2 A3 A4 A5 A6 A7 A12 A13 A14 A15 A16 A2 A3 A4 A5 A6 A7 A9 A12 A13 A14 A15 A19 A1 A2 A3 A4 A5 A6 A12 A16 A18 A19 A20 A21 A26 A4 A5 A6 A7 A8 A9 A12 A13 A14 A15 A16 A3 A4 A5 A6 A8 A9 A12 A13 A14 A15 A18 A19 A1 A2 A3 A4 A5 A6 A7 A8 A9 A12 A13 A14 A15 A1 A3 A4 A5 A6 A7 A12 A13 A14 A15 A16 A23 A24 A1 A2 A3 A4 A5 A6 A8 A9 A12 A14 A16 A18 A22 A2 A8 A9 A12 A14 A15 A16
Table 8. The set SPLUS of collections splus(A) for all A in Tables 6 and 7 SPLUS = {splus(Ai): s(Ai) ∩ s(+), Ai ∈ T}: splus(A*) splus(A13) splus(A16) splus(A1) splus(A5) splus(A12) splus(A18) splus(A2) splus(A+) splus(A19)
{2, 8, 10} {3, 8, 10} {4, 9, 12} {1, 4, 11, 13} {1, 4, 7, 10} {2, 3, 4, 7} {3, 9, 10, 13} {1, 5, 10, 11, 12} {2, 3, 4, 7, 8} {3, 8, 9, 11, 13}
splus(A22) splus(A23) splus(A3) splus(A4) splus(A6) splus(A7) splus(A24) splus(A20) splus(A21) splus(A26)
serves as an example of a stepbystep algorithm of inferring all GMTs from data. APPENDIX Example of operation of the DIAGaRa algorithm. Initial data are given in Table 6 (the set of positive examples) and Table 7 (the set of negative examples). We start with s* = S(+) = {{1}, {2}, ..., {14}}, t* = T = {A1, A2, …, A26}, SPLUS = {splus(Ai): Ai ∈ t*} (the set SPLUS is given in Table 8). In Tables 8 and 9, A* stands for the collection of values {A8, A9} while A+ combines values {A14, A15} since splus(A8) = splus(A9) and splus(A14) = splus(A15). The DIAGaRa algorithm finds all good maximally redundant tests, the weight of
{2, 7, 8, 9, 11} {1, 2, 5, 12, 13, 14} {3, 7, 8, 10, 11, 12} {2, 3, 4, 7, 10, 13} {1, 4, 5, 7, 8, 10} {2, 3, 4, 6, 8, 11} {1, 2, 3, 4, 5, 7, 12, 14} {4, 6, 7, 8, 9, 10, 11, 12} {1, 4, 6, 8, 9, 10, 11, 12} {1, 2, 3, 4, 6, 7, 9, 10, 11, 12, 13, 14}
which is equal to or greater than WMIN = 4 for the set of positive examples given in Table 6. First of all, note that splus(A12) = {2,3,4,7}, and t({2,3,4,7}) is a test; hence, we can remove A12 from t* and put splus(A12) into STGOOD. Then, W(A*), W(A13), and W(A16) are less than WMIN; hence, we can remove A*, A13, and A16 from t*. Now, the example t10 is not a test any more and can be removed. Having modified splus(A) for A5, A18, A2, A3, A4, A6, A20, A21, and A26, we find that W(A5) = 3; hence we can remove A5 from t*. Then, W(A18) turns out to be less than WMIN, and we remove A18, which results in eliminat ing t13. Then, we modify splus(A) for A1, A19, A23, A4, and A26, and find that splus(A4) = {2,3,4,7}; A4 is removed from t*. Finally, W(A1) is less than WMIN,
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
REDUCING ONE CLASS OF MACHINE LEARNING ALGORITHMS Table 9. The sets STGOOD and TGOOD for examples from Tables 6 and 7 No.
STGOOD
TGOOD
1 2 3
{2, 3, 4, 7} {1, 2, 12, 14} {4, 6, 8, 11}
A4 A12 A+ A24 A26 A23 A24 A26 A7 A20 A21
and we remove A1. We also remove values A2 and A19 since W(A2), W(A19) = 4, and t(splus(A2)), t(splus(A19)) are not tests and, hence, these values will not appear in any maximally redundant test with the weight that is equal to or greater than 4. Having removed these values, we can remove examples t9 and t5 since A19 is the essential value in t9, and A2 is the essential value in t5. Then, it turns out that splus(A23) = {1,2,12,14}, and t({1,2,12,14}) is the test; thus, A23 is removed from t* and splus(A23) is put to STGOOD. We treat A22 and A6 similarly since W(A22) and W(A6) are now 4, t(splus(A22)) and t(splus(A6)) are not tests, and these values will not appear in any max imally redundant test with the weight that is equal to or greater than 4. Now, t14 and t1 are not tests, and we can remove them. We choose the example t12 to form the subproblem since now this example is essential in splus(A21) and splus(A24). Having solved this subproblem, we find that t12 includes no new test, and, hence, we do not con sider this example in what follows. Now, splus(A21) is {4,6,8,11}, t({4,6,8,11}) is a test, and, hence, we remove A21 from t* and put splus(A21) into STGOOD. We can also remove the value A24 since t(splus(A24)) is the known (obtained earlier) good maximally redun dant test. We remove the value A3 since W(A3) is now 4, t(splus(A3)) is not a test, and this value will not appear in any maximally redundant test with the weight that is equal to or greater than 4. We remove the example t6 since it is not a test any more and the value A20 since t(splus(A20)) is the good maximally redundant test obtained earlier. These removals mean that all the rest rows (examples) t2, t3, t4, t7, t8, and t11 are not tests any more. Table 9 lists the obtained good maximally redundant tests with the weight that is WMIN = 4. REFERENCES 1. S. Muggleton and L. De Raert, “Inductive Logic Pro gramming: Theory and Methods,” Journal of Logic Programming 19/20, 629–679 (1994). 2. N. Lavrac, D. Gamberger, and V. Jovanoski, “A Study of Relevance for Learning in Deductive Databases,” Journal of Logic Programming 40 ((2/3)), 215–249 (1999). 3. N. Lavrac and P. Flash, An Extended Transformation Approach to Inductive Logic Programming, CSTR00 002 (University of Bristol, Department of Computer Science, Bristol, 2000).
413
4. F. Lisi and D. Malerba, “Inducing MultiLevel Associ ation Rules from Multiple Relations,” Machine Lean ing 55, 175–210 (2004). 5. M. SchmidtSchauss and G. Smolka, “Attributive Concept Descriptions with Complements,” Artif. Intell. 48 (1), 1–26 (1991). 6. C. Ceri, G. Gotlob, and L. Tanca Logic Programming and Databases (Springer, 1990). 7. C. GiraudCarrier and T. Martinez, “An Incremental Learning Model for Commonsense Reasoning,” in Proceedings of 7th International Symposium on Artificial Intelligence, ISAI’94, 1994, pp. 134–141. 8. V. K. Finn, “Inductive Models of Knowledge Repre sentation in Man–Machine and Robotic Systems,” in Proceedings of VINITI A, 58–76 (1984). 9. V. K. Finn, “Plausible Conclusions and Plausible Rea soning,” Itogi Nauki i Tekhniki, Ser. Probability The ory, Mathematical Statistics, and Technical Cybernet ics 28, 3–84 (1988). 10. V. K. Finn, “Plausible Reasoning in Intelligent Systems of JSM type,” Itogi Nauki I Tekhniki, Ser. Informatika 15, 54–101 (1991). 11. V. K. Finn, “Synthesis of Cognitive Procedures and Induction Problem,” NTI, Ser. 2., Nos. 1, 2, 8–44 (1999). 12. M. I. Zabezhailo, V. G. Ivashko, S. O. Kuznetsov, et al., “Algorithmic and Programming JSM ToolsAlgorit micheskie i programmnye sredstva JSM, a Method of Automatic Hypotheses Generation,” NTI, Ser. 2., No. 10, 1–14 (1987). 13. D. S. Mill, A System of Syllogistical and Inductive Logic (Knizhnoe delo, Moscow, 1900) [in Russian]. 14. C. S. Peirce, Philosophical Writings of Peirce, Ed. by L. Buchler (Dover Publications, N.Y., 1995), pp. 150– 156. 15. D. V. Vinogradov, “Logical Programs for QuasiAxiom atic Theories,” NTI, Ser. 2., Nos. 1, 2, 61–64 (1999). 16. B. A. Galitsky, S. O. Kuznetsov, and D. V. Vinogradov, JASMINE: A Hybrid Reasoning Tool for Discovering Causal Links in Biological Data, http://www.dcs. bbk.ac.uk/~galitsky/Jasmine. 17. O. M. Anshakov, D. P. Skvortsov, and V. K. Finn, “Log ical Tools of the JSMMethod of Automatic Genera tion of Hypotheses: A System of Rules and Its Deduc tive Simulation,” NTI, Ser. 2., No. 11, 21–30 (1987). 18. K. A. Naidenova and Yu. G. Polezhaeva, “An Algo rithm for Constructing Good Diagnostic Tests,” in Pro ceedings of 4th AllUnion Conference on Application of Methods of Mathematical Logic, Tallinn, Estoniya, 1986, pp. 63–67. 19. K. A. Naidenova, “Reduction of Machine Learning Problems to Approximation of a Given Classification on the Set of Examples,” in Proceedings of 5th National Conference with International Participation on Artificial Intelligence (Tatarstan, Kazan, 1996), Vol. 1, pp. 275– 279 [in Russian]. 20. J. Ganascia, “Gabriel. EKAW89 Tutorial Notes: Machine Learning,” in Proceedings of 3rd European Workshop on Knowledge Acquisition for Knowledge Based Systems, Paris, France, 1989, pp. 287–296.
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009
414
NAIDENOVA
21. C. Carpineto and G. Romano, “A Lattice Conceptual Clustering System and Its Application to Browsing Retrieval,” Machine Learning, No. 24, 95–122 (1996). 22. J. Demetrovics and D. T. Vu, “Generating Armstrong Relation Schemes and Inferring Functional Depen dencies from Relations,” Int. J. Information Theory & Applications 1 (4), 3–12 (1993). 23. H. Mannila and K. J. Raiha, “On the Complexity of Inferring Functional Dependencies,” Discrete Applied Mathematics 40, 237–243 (1992). 24. H. Mannila and K. Raiha, “J. Algorithm for Inferring Functional Dependencies,” in Data & Knowledge Engi neering 12, 83–99 (1994). 25. Y. Huntala, J. Karkkainen, P. Porkka, et al., “TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies,” The Computer J. 42, 100–111 (1999). 26. S. Cosmadakis, P. Kanellakis, and N. Spyratos, “Parti tion Semantics for Relations,” J. Computer and System Sciences 33 (2), 203–233 (1986). 27. I. A. Megretskaya, “Construction of Natural Classify ing Tests in Generation of Knowledge Bases,” in Prob lems of Application of Expert Systems in National Econ omy (Moldaviya, Chisinau, 1988), pp. 89–93 [in Rus sian]. 28. K. A. Naidenova, “Machine Learning as a Diagnostic Task,” in Proceedings of the ShortTerm Scientific Work shop on KnowledgeDialogueSolution Ad. by I. Arefiev, StPetersburg, Russia, 26–36 (1992). 29. K. A. Naidenova, Yu. G. Polezhaeva, and Yu. G. Iserlis, “The Concept of Knowledge Eliciting Based on the Best Diagnostic Tests,” in Proceedings of International Conference on KnowledgeDialogueSolution," Yalta, Ukraine, 1995, Vol. 1, pp. 85–95. 30. K. A. Naidenova, M. V. Plaksin, and V. L. Shagalov, “Inductive Inferring All Good Classification Tests,” in Proceedings of International Conference on Knowledge DialogSolution, Yalta, Ukraine, 1995, Vol. 1, pp. 79–84. 31. K. A. Naidenova, “DIAGARA: An Incremental Algo rithm for Inferring Implicative Rules from Examples,” Int. J. Information Theories and Applications 12 (2), 171–186 (2005). 32. K. A. Naidenova, “An Incremental Learning Algorithm for Inferring Implicative Rules from Examples,” in Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, Ed. by G. Trian taphillou and G. Felici (Springer, New York, 2006), pp. 89–147. 33. G. Stumme, R. Taouil, Y. Bastide, et al., “Fast Compu tation of Concept Lattices Using Data Mining Tech niques,” in Proceedings of 7th International Workshop on Knowledge Representation Meets Databases, KRDB’2000, pp. 129–139. 34. N. E. Mephu and P. Njiwoua, “Using Lattice Based Framework as a Tool for Feature Extraction,” in Fea ture Extraction, Construction, and Selection: A Data Mining Perspective, Ed. by H. Lui and H. Motoda (Klu wer, 1998). 35. R. Wille, “Concept Lattices and Conceptual Knowl edge System,” Computer Math. Appl. 23 (6–9), 493– 515 (1992).
36. G. Stumme, R. Taouil, Y. Bastide, et al., “Fast Compu tation of Concept Lattices Using Data Mining Tech niques,” in Proceedings of 7th International Workshop on Knowledge Representation Meets databases, KRDB’2000, pp. 129–139. 37. C. E. Dowling, “On the Irredundant Generation of Knowledge Spaces,” J. Math. Psych. 37 (1), 49–62 (1993). 38. S. Salzberg, “A Nearest Hyper Rectangle Learning Method,” Machine Learning 6, 277–309 (1991). 39. B. Ganter, Two Basic Algorithms in Concepts Analysis, FB4Preprint No. 831, TH Darmstadt, 1984. 40. L. Nourine and O. Raynaud, “A Fast Algorithm for Building Lattices,” Information Processing Letters 71, 199–204 (1999). 41. S. O. Kuznetsov, “A Fast Algorithm for Constructing All Intersections of Objects from a Finite SemiLattice,” NTI, Ser. 2., No. 1, 17–20 (1993). 42. S. O. Kuznetsov and S. A. Obiedkov, “Comparing Per formance of Algorithms for Generating Concept Lat tices,” J. Exp. Theor. Artif. Intell. 14 (2, 3), 183–216 (2001). 43. O. Ore, “Galois Connections,” Trans. Amer. Math. Society 55 (1), 493–513 (1944). 44. J. Riguet, “Relations Binaires, Fermetures, Correspon dences De Galois,” Bull. Soc. Math. 76 (3), 114–155 (1948). 45. R. Wille, “Restructuring Lattice Theory: An Approach Based on Hierarchies of Concept,” in Ordered Sets (Reidel, DordrechtBoston, 1982), pp. 445–470. 46. J.G. Ganascia, “CHARADE: A Rule Learning Sys tem,” in Proceedings of 10 th International Joint Confer ence on Artificial Intelligence (IJCAI), Milan, Italy, 1987, pp. 345–347. 47. E. Sperner, “Eine Satz Uber Untermengen Einer Endlichen Menge,” Mat. Z. 27 (11), 544–548 (1928). 48. D. Polya, Mathematics and Plausible Reasoning (IL, Moscow, 1957) [in Russian]. 49. R. S. Michalski, “A Theory and Methodology of Induc tive Learning,” Artif. Intell. 20, 111–161 (1983). 50. K. A. Naidenova and J. G. Polegaeva, “SISIF, the Sys tem of Knowledge Acquisition from Experimental Facts,” in Proceedings of 3 rd International Conference on Industrial Applications of Artificial Intelligence, Ed. by L. James and L. I. Mikulich (NorthHolland, Amsterdam, 1991), pp. 87–92. 51. R. Agraval and R. Srikant, “Fast Algorithms for Mining Association Rules,” in Proceedings of 20 th VLDB Con ference, Santiago, Chile, pp. 487–499. 52. R. S. Michalski and I. B. Larsen, “Selection of Most Representative Training Examples and Incremental Generation of VL1 Hypotheses: The Underlying Meth odology and the Description of Programs ESEL and AQII,” Report No. 78867, University of Illinois at UrbanaCampaign, IL, USA, 1978. 53. K. A. Naidenova and A. E. Ermakov, “Decomposition of the Inference Algorithm for Good Diagnostic Tests,” in Proceedings of 4 th International Conference on Com puterAided design of Discrete Devices, CAD DD'2001, Minsk, Belarus, 2001, Vol. 3, pp. 61–69. 54. H. Rasiova, An Algebraic Approach to NonClassical Logic (NorthHolland, AmsterdamLondon, 1974).
JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL
Vol. 48
No. 3
2009