APPLIED PROBLEMS
A Stochastic Approach for Association Rule Extraction A. A. Oliinyk and S. A. Subbotin Zaporizhzhya National Technical University, ul. Zhukovskogo 64, Zaporozhye, 69063 Ukraine email:
[email protected] Abstract—This paper addresses the problem of association rule extraction. To extract quantitative association rules from given sets of observations, a stochastic method is proposed. The developed method improves the reliability and interpretability of recognition models based on association rules, employs the stochastic approach to search through various combinations of sets of elements in association rules, and uses a priori information about the informativity of intervals of feature values. A system of criteria for estimating associa tion rules is developed that can be used to automate the analysis of properties and to compare various models based on association rules when solving pattern recognition problems. Keywords: association rules, confidence, extraction, feature interval, feature, pattern recognition, stochastic approach. DOI: 10.1134/S1054661816020139
INTRODUCTION When processing datasets that contain a lot of missing values or are represented in the form of trans action databases, in which all observations (transac tions) contain values of certain possible features of objects being recognized, it is reasonable to employ methods of association rule extraction [1, 2]. Such methods make it possible to find hidden relationships in data and to reduce the dimension of data, thus increasing the level of generalization and reducing the structural and parametrical complexity of models syn thesized based on these relationships. Such cases involve analyzing input data in which certain feature values or output parameter values may be undefined. Methods of association rule extraction yield a set A = { A1, A2,..., A N A } containing rules of the form Ar : Pr → Tr , where Pr is the antecedent (left part of the rth rule Ar ) that specifies the set of conditions for execution of the rule Ar , Tr is the consequent (right part of the rth rule Ar ) that defines the value of the out put parameter provided that the conditions Pr for the rule Ar are met, and N A = A is the number of extracted rules [3–6]. When generating frequently encountered sets in the process of rule synthesis, wellknown methods of association rule extraction (SCF, SETM, Apriori, DHP, Eclat, etc.) [1–4] use the antimonotonic prop erty of support (according to which the support of a set of elements does not exceed that of any of its subsets) or other procedures [2–6], which accounts for the fol lowing disadvantages of these methods: Received January 15, 2015
• such an approach makes it impossible to generate (in the process of search) rules Ar from feature sets that contain combinations with low values of support (rarely encountered combinations of features); • similarly to the greedy search, all possible combi nations with high values of support are analyzed, which, given a great number of features P in the initial set S, requires checking a large number of feature combinations Pr, which, in turn, involves performing a great many searches through the base S and requires a great deal of memory and time resources; • when using such an approach, only the rules based on frequently encountered sets are extracted, and promising rules Ar : Pr → Tr with high confidence conf ( Ar ) but low support supp ( Ar ) are skipped. This considerably reduces the approximation and generali zation capabilities of the recognition model synthe sized based on the extracted set A = { A1, A2,..., A N A } of association rules. Moreover, most of the methods for association rule extraction are designed to process binary data. How ever, most of the real pattern recognition problems involve numerical data processing when most of the features take values in a certain interval. The necessity for overcoming these disadvantages motivates the development of a new method for asso ciation rule extraction. The purpose of this work is to develop a stochastic method for association rule extraction.
1. FORMULATION OF THE PROBLEM Let a set of observations S = 〈P,T 〉 be defined, where P is the set of characteristics (features) of obser vations, T is the set of values of the output parameter,
ISSN 10546618, Pattern Recognition and Image Analysis, 2016, Vol. 26, No. 2, pp. 419–426. © Pleiades Publishing, Ltd., 2016.
420
χk:
OLIINYK, SUBBOTIN p1
p2
p3
p4
p5
pM
T
g1
g2
g3
g4
g5
gm
gNg
Δpn1 Δpn2 Δpn3 Δpn4 Δpn5
ΔpnM ΔTn
Representation of the structure χ k for the process of asso ciation rule extraction.
result, a set of clusters is obtained Clm = {Cl1m, Cl 2m,...Cl N intm}. When using cluster analysis methods that require specifying the number of clusters N int (for example, fuzzy cmeans clustering [5, 6]), like the parameter N int m , the number (reduced by a factor of NA) of exemplars N(pm) for which the value of the feature pm is defined can be used. The parameter NA is found by the formula M ⎛ ⎞ 1 (2) N A = ceil ⎜ N ( pm ) ⎟ , ⎜ M ⋅ N int mean ⎟ ⎝ ⎠ m =1 where N int mean is the expected mean of the partition intervals (clusters) for each feature pm (m = 1,2,..., M ) and ceil ( x) is the function that returns the integer part of x. The number N int mean should be small (it is recom mended to set N int mean = 10, N int mean Q ) to ensure low requirements for memory resources when carrying out the corresponding computations and, at the same time, to provide satisfactory partitioning of the feature values into intervals.
∑
pqm is the value of the mth attribute of the qth observa tion (m = 1, 2, ..., M , q = 1, 2, ..., Q ), tq is the value of the output parameter for the qth observation, M is the number of attributes, and Q is the number of observa tions. Thus, the problem of extracting quantitative asso ciation rules consists in constructing the rule base A = { A1, A2,..., A N A } in which the rules of the form Ar : Pr → Tr satisfy a certain value of a given quality criterion. As such a criterion, the confidence of the rule [1–4] can be used, which is defined as the ratio between the support of the rule supp ( Pk ∪ Tk ) and the support of its antecedent supp ( Pk ) :
supp ( Pk ∪ Tk ) , (1) supp ( Pk ) where supp ( Pk ∪ Tk ) is the support of the rule Ak : Pk → Tk , which is the ratio between the number of exemplars N ( Pk ∪ Tk ) in the sample S that contain sets of conditions Pk and are characterized by the value of the output parameter (Tk) and the number of exem plars N(Pk) in the sample S that meet conditions of the antecedent Pk of the rule Ak : Pk → Tk . conf ( Ak ) =
2. THE STOCHASTIC METHOD FOR ASSOCIATION RULE EXTRACTION In this paper, to extract association rules Ar : Pr → Tr from given numerical datasets S, the value ranges of features P are preliminarily partitioned into intervals taking into account the width of the value ranges and the accuracy with which features fall within each of the intervals; then, using the stochastic approach, the association rules Ar : Pr → Tr with a high level of confidence conf ( Ar ) are found. At the initial stage of the proposed stochastic method for extracting quantitative association rules, the values of features P are partitioned into intervals. For this purpose, cluster analysis is performed in the set of values pqm of each feature pm [7–11], or other methods of artificial intelligence are used [6, 11, 13– 16] to select groups of exemplars compactly arranged in the onedimensional space of each feature. As a
Next, based on the boundaries Cl n min m and Cl n max m (n = 1,2,..., N int m , m = 1,2,..., M ) of the selected clus ters, intervals of values (terms) [Cl n min m; Cl n max m) of the features pm are found. Then, Nχ solutions are generated for the stochastic search procedure. When extracting association rules, the solution χk is represented as a set of parameters χ k = {g1k , g 2k ,..., g N g k }, where gmk is the mth parameter of the solution, which contains information about the number of the interval Δpnm of the mth feature pm (or about its absence) in the kth association rule Ak : Pk → Tk (see figure). It can be seen that the solution χk also contains information about the interval ΔTn of the output parameter T (provided that it takes real values). If the output parameter T is binary, then the solution χk does not contain the last parameter g N g . For each kth solution χ k , the values of the mth parameter gm are found as follows. Random number rnd is generated on the interval [0;1]: rnd = rand[0;1], where rand[0;1] is the function that returns the number randomly generated on the interval [0;1]. If the gener ated number rnd does not exceed the frequency ν m of occurrence of the mth feature pm in the sample S (rnd ≤ ν m ), then the value of the parameter g m in the solution χk is assigned zero ( g mk = 0 ) to indicate that there is no mth feature in the kth association rule χ k . The frequency νm is found by the formula
νm =
N ( pm ) . Q
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 26
(3) No. 2
2016
A STOCHASTIC APPROACH FOR ASSOCIATION RULE EXTRACTION
If rnd > ν m , then integer random number rndc is generated on the interval [1; N int m]; rndc defines the number of the interval of the feature pm in the rule χk: rndc = randc[1; N int m]. If the condition
rnd ∈ randc[ν m ; ν m + ν nm (1 − ν m )]
(4)
(5)
holds, the parameter g m is assigned the value rndc: gm = rndc. Condition (5) implies that the nth interval Δpnm of the mth feature pm has a probability of inclusion into the kth rule χk that increases with increasing frequency ν nm of its occurrence in the observations of the sample S (which characterize the values of the mth feature):
N ( pnm ) . (6) N ( pm ) When the output parameter T takes a numerical value, the value of ν nm for the kth solution χk is found as a frequency of the occurrence of the interval Δ pnm in the observations of the sample S (which characterize the values of the mth feature), and the value of the out put parameter T is g M +1,k . If condition (5) does not hold, then it is believed that the nth interval Δ pnm of the feature pm cannot be included in the rule χ k ; so, random number rnd = rand[0;1] is generated again and condition (5) is checked; the process continues until the value of the parameter gmk is found. Similarly, M parameters gm are generated for all solutions Nχ of the initial population R(0) = ν nm =
(0) {χ1(0), χ (0) 2 ,..., χ N χ }, which is a set of association rules. In the case of a numerical (not binary) output parameter T of the sample S, the value of the parame ter g M +1,k , which defines the number of the interval for the output parameter in the rule χk, is generated ran domly depending on the frequency of occurrence of the intervals in the sample S. For this purpose, the probability that the value of the output parameter T for exemplars of the sample S falls within the nth interval Tn of its variation range is found by the formula
N (Tn ) ; (7) Q next, each interval Tn is made to correspond to the interval with ρi (Tn ) ∈ (Tn−1,max;Tn−1,max + ρ (Tn )] ρi (T1 ) ∈ [0; ρ (T1 )]. After that, random number rnd = rand[0; 1] is generated. The value of the parameter g M +1,k corresponds to the number of the interval ρi (Tn ) that includes random number rnd: g M +1,k = n, rnd ∈ ρi (Tn ) . ρ (Tn ) =
PATTERN RECOGNITION AND IMAGE ANALYSIS
421
Then, the confidence conf ( χ k ) of each solution is evaluated. For this purpose, the solutions χk are trans formed into association rules: χ k → Ak . The associa tion rule Ak is composed of nonzero parameters gmk of the solution χ k ; thus, we have the implication of the form
∩( p
(8) ∈ pnm ) → T . For real values of the output parameter T, the con sequence of implication (8) can be represented as T ∈ ΔT ( g M +1,k ), where ΔT ( g M +1,k ) is the interval Tn of the output parameter T, which is defined by the num ber specified in the last parameter g M +1,k of the solu tion χ k . Upon transforming χ k → Ak , the confidence conf ( Ak ) of the rules Ak is evaluated. Then, confidence association rules Ak (confidence conf ( Ak ) of which exceeds minconf: conf ( Ak ) ≥ min conf) are included into the set A. For a more detailed analysis of the association rules, in addition to confidence criterion (1), we propose some other crite ria that characterize the frequency of rule execution, as well as the informativity and interpretability (easi ness for a human to comprehend) of rules: • the support of the rule supp ( Ak ) allows estimat ing the execution frequency of rules in the sample S; this criterion is calculated as follows: (9) supp ( Ak ) = supp ( Pk ∪ Tk ) = N ( Pk ∪ Tk ) , N ( Pk ∪ Tk ) , (10) supp ( Ak ) = supp ( Pk ∪ Tk ) = Q where formula (9) defines the absolute support as the number of exemplars in the sample S for which the rule Ak : Pk → Tk is executed, while formula (10) defines the relative support as the ratio between N ( Pk ∪ Tk ) and the total number of exemplars Q in the sample S; • the general support of the rule suppG ( Ak ) takes into account not only the fulfillment frequency of pos itive conditions of rules Pk → Tk but also that of nega tive conditions Pk → Tk ; this criterion is calculated by the formula suppG ( Ak ) = supp ( Pk ∪ Tk ) + supp ( Pk ∪ Tk ) ; (11) the greater the value of this criterion, the more signif icant the rule Ak : Pk → Tk in the sample S, since a large number of simultaneous fulfillments (simulta neous nonfulfillments) of conditions in the antecedent Pk and consequence Tk is indicative of an essential connection between Pk and Tk ; • the general confidence of the rule suppG ( Ak ), similarly to suppG ( Ak ), takes into account the fulfill ment frequency of both positive and negative condi m
Vol. 26
No. 2
2016
422
OLIINYK, SUBBOTIN
tions of rules Ak : Pk → Tk ; this criterion is calculated by the formula
confG ( Ak ) = 1 (conf ( Pk → Tk ) + conf ( Pk → Tk )) 2 (12) ⎛ supp ( Pk ∪ Tk ) supp ( Pk ∪ Tk ) ⎞ 1 = ⎜ + ⎟; 2 ⎜⎝ supp ( Pk ) supp ( Pk ) ⎟⎠ • the complexity of the rule VS ( Ak ) is the ratio between the number of features (conditions) N ( Pk ) in the left part (antecedent) of the rule Ak : Pk → Tk and the total number of features M in the sample S: N ( Pk ) ; (13) M the greater the value of this criterion, the simpler (more interpretable) the rule Ak : Pk → Tk and the wider its coverage, which provides better generaliza tion of data. The more conditions N ( Pk ) that are in the antecedent (or the smaller the value of VS ( Ak ) ), the more specific the rule; • the index characterizing the maximum confi dence of the rule confG ( Ak ) and individual estimates of feature informativity VC ( Ak ) : VS ( Ak ) = 1 −
VC ( Ak ) = confG ( Ak )
∑
Vm ,
(14)
m: pm∈Pk
where V m is the estimate of individual significance of the mth feature pm in the antecedent Pk of the rule Ak : Pk → Tk . Note that, along with the estimate of confG ( Ak ), the confidence value conf ( Ak ) can also be used in formula (14); • the index characterizing the maximum confi dence of the rule confG ( Ak ) and minimum number of selected features VM ( Ak ) :
VM ( Ak ) =
1 confG A ; ( k) N ( Pk )
(15)
• the informativity of the rule VI ( Ak ) is a criterion that takes into account both the confidence (signifi cance) of the whole rule Ak : Pk → Tk and the individ ual informativity Vm of each feature pm in its anteced ent Pk. This criterion is calculated by the formula
VI ( Ak ) =
∑
1 confG A ( k) N ( Pk ) m: p
Vm .
(16)
m∈Pk
This criterion allows finding rules Ak : Pk → Tk with the maximum confidence confG ( Ak ) , maximum estimates of individual informativity Vm of features, and the minimum number of features pm in the ante cedent Pk of the rule.
The estimate of individual informativity Vm for the mth feature pm is defined as the sum of individual informativities Vnm of intervals of this feature: N int m
Vm =
∑V
nm ,
(17)
n =1
where Vnm stands for the estimate of informativity of the nth interval Δpnm for the mth feature pm , which, in turn, is found by the formula 1 , (18) V = nm
N int T
1+ e
− ∑ ρ( Δpnm,Tl ) log ρ( Δpnm,Tl ) l =1
N ( pnm,Tl ) is the conditional prob N ( pnm ) ability that the value of the output parameter T falls within the lth interval Tl on condition that the mth fea ture pm falls within the nth interval Δpnm; N ( pnm,Tl ) is the number of exemplars (in the sample S) for which the value of the output parameter T falls within the lth interval of its variation range Tl on the condition that the value of their mth feature falls within the nth inter val pmn; and N int (T ) is the number of intervals into which the value range of T is partitioned. The proposed system of criteria (9)–(16), which takes into account various characteristics of associa tion rules such as confidence, execution frequency, informativity, and interpretability, can be used to auto mate the analysis of properties and to compare between models based on association rules when solv ing problems of diagnostics, pattern recognition, and nondestructive testing. It should be noted that, to select rules Ak : Pk → Tk that are to be included into the set A = { A1, A2,..., A N A } (which is a base of synthesized association rules), one can use a single criteria (for example, the informativity VI ( Ak ) ) or a combination of several criteria from the proposed system (9)–(16). Moreover, estimates of these criteria can also be obtained on a test sample, which allows taking into account generalization prop erties of extracted rules. Upon estimating the quality of extracted rules χk ( Ak : Pk → Tk ), k = 1,2,..., N χ , and upon including the best ones into the set A = { A1, A2,..., A N A } , the search termination criteria are checked: reaching the maximum number of rules in the set A ( N A ≥ N A max ), exceeding the maximum number of iterations N It , and the impossibility (for a given number of iterations) of extracting the rules Ak : Pk → Tk with satisfactory values of the estimation criteria. If these search termination criteria are not met, new Nχ solutions χk are generated. A set RP (i) of solu tions χk eligible for generating a new set of solutions where ρ ( Δpnm,Tl ) =
R(i +1) is constructed. The most fitted structures χk from
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 26
No. 2
2016
A STOCHASTIC APPROACH FOR ASSOCIATION RULE EXTRACTION
{
}
the set of solutions R (i) = χ1(i), χ(2i),..., χ(Ni)χ (which are obtained at the ith iteration of search) are included (depending on the values of the criteria for estimating the association rules Ak ) into the set RP (i). Then, based on two solutions χparant1 = {g1 parent1, g 2 parent1,..., g N g parent1} ∈ RP (i) and χparant2 =
{g1 parent 2, g 2 parent 2,..., g N g parent 2} ∈ RP (i) , a new solution is generated χ child . The parameters g mchild of the child χ child are found by the formula ⎧ ⎡ ⎞ Vmparent1 ⎟; ⎪ g mparent1, rnd ∈ ⎢0; ⎪ ⎣ Vmparent1 + Vmparent 2 ⎠ (19) g mchild = ⎨ ⎡ ⎤ V 1 mparent ⎪g , rnd ∈ ⎢ ;1⎥ , ⎪ mparent 2 ⎩ ⎣Vmparent1 + Vmparent 2 ⎦ where V mparent1 and V mparent 2 are the informativity char acteristics of the g mparent1 th and g mparent 2 th intervals of the mth feature, respectively, and rnd = rand[0;1]. Formula (19) enhances the probability of including the parameters gmk, which correspond to the intervals Δ pnm of the features with high estimates of individual informativity Vnm, into the new solution χchild; it can be seen from formula (19) that the higher the relative informativity V mparent1 (V mparent1 + V mparent 2 ) of the g mparent1 th interval associated with the mth feature of the first parent χ parent1, the higher the probability that the parameter g mchild is formed based on the corre sponding parameter g mparent1 of the first parent. Thus, N cross = β N χ solutions of the form χ(ki +1) = (i +1) (i +1) (i +1) {g1k , g2k ,..., g N gk } are generated, where β is the parameter that characterizes the significance of con structing a new set of solutions R(i +1) according to the crossover procedure described above. Then, N mutation = γ N χ solutions are generated by applying the mutation operator, where γ is the param eter that characterizes the significance of constructing a new set of solutions R(i + 1) according to the mutation procedure. The solution χparent, in which values of cer tain parameters g mparent determining the number of the interval Δpnm of the mth feature pm in the correspond ing association rule are varied to obtain a new solution χchild, is randomly selected from the set RP (i). New val ues of the modified parameters g mchild are found sto chastically taking into account estimates of individual significances Vnm of intervals Δpnm of the correspond ing feature. Each interval Δpnm (n = 1,2,..., N int m ) of the mth feature pm is made to correspond to the interval gI ( Δ pnm ) ∈ [ gI min ( Δpnm ) ; gI max ( Δpnm )) , where PATTERN RECOGNITION AND IMAGE ANALYSIS
423
gI min ( Δpnm ) = gI max ( Δpn−1,m ) is the minimum value of the interval = gI ( Δ p1m ); gI max ( Δpnm ) gI min ( Δpnm ) + VN nm is the maximum value of the interval gI ( Δ pnm ) ; VNnm is the normalized estimate of individual significance V nm of the interval Δpnm, which is obtained by the formula max Vnm − Vnm n =1,2,...,N int m ; (20) VN nm = max Vnm − min Vnm n =1,2,...,N int m
n =1,2,...,N int m
and gI min ( Δp1m ) = 0 is the minimum value in the interval gI ( Δ p1m ) of the first interval Δ p1m for the fea ture pm. Thus, the greater the value of Vnm (VNnm), the wider the value range gI ( Δ pnm ) of the interval Δpnm. Upon determining the boundaries of the intervals gI ( Δ pnm ) (n = 1,2,..., N int m , m = 1,2,..., M ), a random number is generated rnd = rand [0;1) . New values for the parameters g mchild of the solution χparent selected for mutation correspond to the number n of the interval gI ( Δ pnm ) within which the number rnd falls: (21) g mchild = n, rnd ∈ gI ( Δpnm ). Therefore, the wider the range gI ( Δ pnm ) , the higher the probability for the interval to be included into the solution χchild. In addition to the N cross = β N χ and N mutation = γ N χ solutions obtained by the crossover and mutation pro cedures, the N elite = α N χ most fitted solutions χ(ki) ∈ R (i) (α is the parameter that characterizes the sig nificance of including the best solutions into a new set R(i +1) ) are also included into the set R(i +1) ; these solu tions have the best values of the estimation criteria for association rules Ak in the population R (i) . Then, the confidence conf ( χ k ) and other criteria for estimating the solutions χ k from the new popula tion R(i + 1) are calculated, and the best solutions χk are included into the set A = { A1, A2,..., A N A } ; thus, a new set of solutions R(i + 2) is constructed until the stochas tic search is terminated. The stochastic search yields the set A = { A1, A2,..., AN A } of association rules Ak : Pk → Tk with satisfactory values of the given estimation criteria. The proposed stochastic method for extracting quantitative association rules implies that feature val ues are preliminarily partitioned into intervals taking into account the width of the value range and the fre quency with which features fall within each of these intervals. The proposed method employs a probabilis tic approach to search through various combinations of antecedents and consequences of association rules and uses a priori information about the significance of intervals and features, which makes it possible to pro Vol. 26
No. 2
2016
424
OLIINYK, SUBBOTIN
Experimental results on extraction of association rules Method
supp, % conf, % confG, %
FARM [17] FWARM [18] Method for association rule synthesis consid ering significance of features (MCSF) [5] Stochastic method for extracting quantitative association rules Method of evolutionary search (MES) [15] Simple genetic algorithm (SGA) [6, 12, 16]
NA
N(Pk)
VS
VC
VM
VI 6.63 7.73 9.40
6.2 5.4 4.7
82.3 87.7 91.2
73.7 77.5 82.1
121 87 82
6.78 6.32 6.46
0.52 0.55 0.54
3.05 3.09 3.92
10.87 12.26 12.71
4.2
90.1
89.2
133
5.06
0.64
3.61
17.63 14.10
5.7 5.6
88.9 88.3
85.7 87.2
114 108
5.51 5.18
0.61 0.63
3.59 3.48
15.55 11.82 16.83 12.96
cess numerical information when extracting associa tion rules, to avoid a large number of searches through a given base of observations, and to find rules with high values of confidence and other quality criteria. 3. EXPERIMENTS AND RESULTS Let us investigate experimentally the proposed sto chastic method for extracting quantitative association rules by comparing it with some wellknown methods for association rule extraction: FARM [17], FWARM [18], and the method for association rule synthesis considering the significance of features (MCSF) that was proposed in [5]. It should be noted that, in this experimental investigation, we consider problems of rule extraction from bases of numerical observations, which makes it difficult to apply some wellknown methods (Apriori, SETM, etc.), since they are designed to extract association rules from binary data. To extract association rules from given bases of obser vations S = 〈P,T 〉 with the proposed and wellknown methods, software modules were written in C#. These modules were used to solve problems of pattern recog nition in technical diagnosis of aircraft engines. When testing aircraft engines, the parameters char acterizing the quality of their operation in different modes are checked [19]. Such a test process, however, is rather timeconsuming, involves a great number of trials (cycles) for each product in different test modes, and requires substantial material costs (fuel) for test trials in each cycle. Moreover, test equipment has a limited capacity. Therefore, it is important to reduce the test time and the number of test modes in order to reduce material costs for aircraft engine production. This requires finding hidden dependencies between characteristics of aircraft engines in the process of test ing. Finding such dependencies will make it possible to reduce the number of test modes. The data sample contains values of the following characteristics measured during test trials in four test modes (takeoff, nominal, first cruising, and second cruising) [19]: p1 is the compressor turbine speed (rpm), p2 is the turbine inlet temperature (°C), p3 is the turbine flow, p4 is the engine inlet temperature
(°C), p5 is the number of stages, p6 is the blade angle of the inlet guide vane, p7 is the reduced power, p8 is the airflow, p9 is the air compression ratio, p10 is the adia batic pressure (mm), and p11–p14 are the flow sections for the nozzle diaphragm of the first, second, third, and fourth stages, respectively. Some data, however, may be missing from the sam ple due to the human factor, malfunction, failure of the measuring equipment, etc. Moreover, for some aircraft engines, information for only a few test modes is available. Taking into account that the original sam ple S contains missing values, it is reasonable to use the apparatus of association rules to find hidden relation ships in the data. The experimental results of investigating different methods for association rule extraction when solving the problem of finding hidden dependencies between the parameters of aircraft engines in different test modes are shown in the table (the original sample con tains information about 484 products). For the sto chastic method, the method of evolutionary search (MES) [15], and a simple genetic algorithm (SGA) [6, 12, 16], the average informativity of the rule VI(Ak) is used as a quality criterion that takes into account the confidence of the rule, as well as the individual infor mativity V m of all features pm in its antecedent Pk. The results shown in the table allow comparing the proposed method and the wellknown methods for association rule extraction (FARM [17], FWARM [18], and MCSF [5]), including popular stochastic methods (MES [15] and SGA [6, 12, 16]). Let us first compare the proposed method with other methods for association rule extraction: FARM [17], FWARM [18], and MCSF [5]. The table gives average values of the parameters supp, conf, confG, N ( Pk ), VS, VC, VM, and VI, which characterize the quality of association rules. As a result of the investigation, the dependencies Ak : Pk → Tk are found between different parameters of aircraft engines; these parameters describe the quality of engine operation in different modes, which allows developing some recommendations on reducing the
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 26
No. 2
2016
A STOCHASTIC APPROACH FOR ASSOCIATION RULE EXTRACTION
number of test trials in order to reduce the material cost for aircraft engine production. It can be seen (see table) that the average support of the association rules extracted by the proposed method (supp = 4,2 ). is somewhat below that of the rules extracted by the FARM [17] (supp = 6,2 ), FWARM [18] (supp = 5,4 ), and MCSF [5] (supp = 4,7 ); this is due to the fact that, in addition to frequently encountered confident rules, the proposed method also finds dependencies based on rarely encountered rules. This fact is also supported by the large number of extracted rules NA ( N A = 133 for the proposed method, N A = 121 for the FARM, N A = 87 for the FWARM, and N A = 82 for the MCSF). The average confidence of the quantitative associa tion rules extracted by the proposed stochastic method (conf = 90,1). is higher than that of the rules extracted by the FARM [17] (conf = 82,3 ) and FWARM [18] (conf = 87,7)), which testifies that the proposed method extracts more confident rules (owing to sto chastic searching through various combinations of antecedents and consequences of association rules and owing to taking into account a priori information about the significance of intervals and features). As compared to the MCSF (conf = 91,2 )., the stochastic method has a slightly lower confidence (conf = 90,1), since the proposed method uses the informativity of the rule VI ( Ak ) as a criterion for estimating associa tion rules, which takes into account not only the con fidence (confG) but also other characteristics of rules. The proposed method yields a base A = { A1, A2,..., A N A } of association rules Ak : Pk → Tk that is characterized by a higher average total confi dence of rules (confG = 89,2. versus confG = 73,7 for the FARM, confG = 77,5 for the FWARM, and confG = 82,1 for the MCSF), which takes into account the fulfillment frequency of both positive Pk → Tk and negative PPk → T conditions. k → Tkk Higher values of the criteria N(Pk), VS, VC, VM, and VI (for example, the average complexity VS ( Ak ) of association rules is VS = 0.64 for the proposed method versus VS = 0.52, 0.55, and 0.54 for the FARM, FWARM, and MCSF, respectively) are also due to using the informativity VI(Ak) (the criteria N(Pk), VS, VC, VM, and VI are interdependent) as a criterion for estimating the quality of extracted rules. This allows extracting a greater number N A of associ ation rules Ak : Pk → Tk that are simpler and more interpretable (have a lesser number of conditions N(Pk) in the antecedent Pk), as well as more confident and informative (have higher values of confG, VC, VM, and VI), as compared to the rules extracted by the other methods. As compared to other stochastic methods (MES and SGA) [6, 12, 15, 16], the proposed method shows PATTERN RECOGNITION AND IMAGE ANALYSIS
425
(see table) higher results, particularly, higher average informativity of rules VI ( Ak ) (14.10 versus 11.82 and 12.96 for the MES and SGA, respectively); as men tioned above, this criterion is used as an optimality cri terion for the search process. Moreover, it should be noted that the runtime of the proposed method (t = 82.7 ms) is considerably shorter than that of the MES and SGA (t = 492.1 and 561.9, respectively). This is due to the fact that the proposed method uses a priori information about the significance of feature intervals to construct a new set of solutions by applying the probabilistic operators of crossover and mutation (19)–(21), which, in turn, makes it possible to move the initial points of search closer to the optimal ones, thus reducing the time of optimization. In summary, the proposed method allows extract ing quantitative association rules with higher quality characteristics. CONCLUSIONS In this paper, the important problem of automating the extraction of quantitative association rules is solved and mathematical support of association rule extraction is developed, which allows improving the confidence and interpretability characteristics of rec ognition models based on association rules. The scientific novelty of this paper is that it pro poses the stochastic method for extracting quantitative association rules that implies preliminary partitioning of feature values into intervals taking into account the width of the value range and the frequency with which features fall within each of these intervals. The pro posed method employs the probabilistic approach to search through various combinations of antecedents and consequences of association rules and uses a priori information about the significance of intervals and features, which makes it possible to process numerical information when extracting association rules, to avoid a large number of searches through a given base of observations, and to find rules with high values of confidence and other quality criteria. In contrast to other stochastic methods (for exam ple, methods of genetic and evolutionary search) [6, 11, 12, 15, 16], the proposed method takes into account a priori information about the significance of feature intervals both at the stage of initialization (when generating the initial set of solutions) and in the process of search when constructing a new set of solu tions by applying the probabilistic operators of cross over and mutation (19)–(21); this makes it possible to move the initial points of search closer to the optimal ones, thus reducing the time of optimization. A system of criteria is proposed that takes into account different characteristics of association rules, such as confidence, execution frequency, informativ ity, and interpretability. This system of criteria can be used to automate the analysis of properties and to Vol. 26
No. 2
2016
426
OLIINYK, SUBBOTIN
compare between various models based on association rules when solving pattern recognition problems. The usefulness of the results obtained in this work is that software modules (written in C#) were devel oped that allow extracting association rules from given bases of observations by using the proposed and well known methods. These software modules were used to solve the problem of decision making in technical diagnosis of aircraft engines. ACKNOWLEDGMENTS This work was carried out in the framework of the state research and development project “Intelligent information technologies of automating the design, modeling, control, and diagnosis of manufacturing processes and systems” (project no. 0112U005350) and the international project “Centers of Excellence for Young Researchers” (CERES) as part of the pro gram “Tempus” of the European Commission (project no. 544137TEMPUS120131SKTEMPUS JPHES).
14. 15. 16. 17.
18. 19.
odology, structure, and applied problems,” Pattern Recogn. Image Anal. 24 (3), 333–346 (2014). V. I. Donskoi and Yu. Yu. Dyulicheva, “Inductive model of rcorrect empirical forest,” in Proc. Int. Conf. on Inductive Simulation (Lvov, 2002), pp. 54–58. M. Yu. Gen, Introduction to Evolutionary Algorithms (Decision Engineering) (Springer, London, 2010), p. 418. S. Smith and S. Cagnoni, Genetic and Evolutionary Computation: Medical Applications (John Wiley and Sons, Chichester, 2011), p. 250. D. Dubois, E. Hullermeier, and H. Prade, “A system atic approach to the assessment of fuzzy association rules,” Data Mining Knowledge Discovery 13, 167– 192 (2006). M. S. Khan, M. Muyeba, and F. Coenen, “Weighted association rule mining from binary and fuzzy data,” Lecture Notes in Comput. Sci. 5077, 200–212 (2008). A. V. Boguslaev, Al. A. Oleinik, An. A. Oleinik, D. V. Pavlenko, and S. A. Subbotin, Progressive Technologies for Simulating, Optimizing and Intellectual Automation for Aircraft’s Engines Lifetime Stages (Motor Sich, Zaporozhye, 2009).
Translated by Yu. Kornienko
REFERENCES 1. J. Rauch, Observational Calculi and Association Rules (SpringerVerlag, Berlin, 2013), p. 296. 2. J.M. Adamo, Data Mining for Association Rules and Sequential Patterns: Sequential and Parallel Algorithms (SpringerVerlag, New York, 2001), p. 259. 3. L. Petrala, Discovery of Association Rules in Datasets via Evolutionary Algorithms (Lambert Acad. Publ., Ham burg, 2014), p. 96. 4. J.M. Adamo, Computational Structures and Algorithms for Association Rules: the Galois Connection (Cre atespace, Seattle, 2011), p. 276. 5. A. A. Oleinik, T. A. Zaiko, and S. A. Subbotin, The Way to Synthesize Diagnostic and Recognition Models on the Base of Hybrid NeuroFuzzy Technologies of Computa tional Intelligence (Kompaniya Smit, Kharkov, 2014), p. 284 [in Russian]. 6. A. Engelbrecht, Computational Intelligence: an Intro duction (John Wiley and Sons, Sidney, 2007), p. 597. 7. J. Abonyi and B. Feil, Cluster Analysis for Data Mining and System Identification (Birkhäuser, Basel, 2007), p. 303. 8. I. T. Jolliffe, Principal Component Analysis (Springer Verlag, Berlin, 2002), p. 489. 9. A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis (John Wiley and Sons, New York, 2001), p. 481. 10. J. A. Lee and M. Verleysen, Nonlinear Dimensionality Reduction (Springer, New York, 2007), p. 308. 11. C. Y. Shin and C. Xu, Intelligent Systems: Modeling, Optimization, and Control (CRC Press, Boca Raton, 2009), p. 456. 12. S. Bow, Pattern Recognition and Image Preprocessing (Marcel Dekker, New York, 2002), p. 698. 13. I. B. Gurevich and Yu. I. Zhuravlev, “Computer sci ence: subject, fundamental research problems, meth
Andrii Oleksandrovich Oliinyk. Born 1983. Graduated from Zapor izhzhya National Technical Univer sity in 2007. Received candidate’s degree in 2009. Associate Professor at the Department of Software Tools of Zaporizhzhya National Technical University. Scientific interests: artifi cial intelligence, pattern recognition, evolutionary optimization, multi agent algorithms, decision trees, asso ciation rules, neuro and neurofuzzy technologies. Winner of the Prize of the Verkhovna Rada of Ukraine.
Sergey Aleksandrovich Subbotin. Born 1978. Graduated from Zapor izhzhya National Technical Univer sity in 2000. Received candidate’s degree in 2005 and doctoral degree in 2014. Professor at the Department of Software Tools of Zaporizhzhya National Technical University. Scien tific interests: pattern recognition, neurofuzzy technologies, technical and biomedical diagnostics. Winner of the Prize of the President of Ukraine and Prize of the Verkhovna Rada of Ukraine.
PATTERN RECOGNITION AND IMAGE ANALYSIS
Vol. 26
No. 2
2016