Neural Comput & Applic (2009) 18:891–901 DOI 10.1007/s00521-008-0206-2
ORIGINAL ARTICLE
Review of input determination techniques for neural network models based on mutual information and genetic algorithms Maria Paula da Costa Couto
Received: 17 January 2008 / Accepted: 30 October 2008 / Published online: 28 November 2008 Springer-Verlag London Limited 2008
Abstract The use of artificial neural networks (ANNs) models has grown considerably over the last decade. One of the difficulties in using ANNs is the fact that in most cases there are several numbers of input variables available. In the past, there was a tendency to use a large number of inputs in ANNs applications. This can have a number of detrimental effects on the network during training and it also requires a greater amount of data to efficiently estimate the connection weights. Additional inputs tend to increase the required time for training and the risk of the training algorithm becoming stuck in a local minimum. A large number of inputs also increases the risk of including spurious variables that merely increase the noise in the forecasts. Consequently, it is important to use an appropriate selection technique of the input variables in order to obtain the smallest number of independent inputs that are useful predictors for the system which is being researched. The aim of this paper is to review techniques that will allow the selection of appropriate model inputs based particularly on mutual information and genetic algorithms. Keywords Input determination techniques Artificial neural networks Mutual information Genetic algorithms
estimation. One of the main features of ANNs is their capacity for generalization; this is, to successfully interpolate based on previously considered examples. Moreover, once trained, ANNs can run at a much higher speed than iterative matrix methods. The selection of an appropriate set of variables of a set of measured potential input variables to include as input variables to model the system being researched is a vital step in the development of the model. This is particularly important in data-driven techniques, such as ANNs as the performance of the final model is heavily dependent on the input variables used to develop the model. If a linear relation characterises the underlying system (the dependent variable being a linear function of all model predictors) a measurement of linear dependency such as the correlation coefficient can be sufficient. If the system is more complex which is normally the case of any real physical system, linear methods are likely to result in a misleading predictor and a badly formulated forecast approach. Consequently, the dependencies between input and output variables, as well as the conditional dependencies between variables are a difficult measurement. This article reviews various techniques; for the linear case, correlation coefficient, components analysis and for the non-linear case, brute force, techniques based on mutual information and genetic algorithms.
1 Introduction Artificial neural networks (ANNs) are good functional approximators and are potentially well suited to the task of
2 Variable subset selection: filters, wrapper and embedded techniques
M. P. da Costa Couto (&) Departamento de Matema´tica, Faculdade de Cieˆncias e Tecnologia da UNL, Quinta da Torre, Monte de Caparica, 2829-516 Monte de Caparica, Portugal e-mail:
[email protected]
Many variable selection techniques include variable ranking as a principal or auxiliary selection mechanism because of its simplicity, scalability, and good empirical success. Following the classification of Kohavi and John [15]
123
892
Neural Comput & Applic (2009) 18:891–901
variable ranking is a filter technique: it is a pre-processing step, independent of the choice of the predictor. The wrapper methodology, popularized by Kohavi and John [15], offers a simple and powerful form of solving the problem of the selection of variables, regardless of the chosen learning machine. In fact, the learning machine is considered a perfect black box and the technique lends itself to the use of off-the shelf machine learning software packages. In its most general formulation, the wrapper methodology consists in using the prediction performance of a given learning machine to assess the relative usefulness of subsets of variables [9]. In the filters algorithm the best subset of features is selected in one pass by evaluating some predefined criteria independent of the actual generalization performance of the learning machine. So a faster speed can usually be obtained. However, it might fail to select the right subset of variables if the used criterion deviates from the one used for training the learning machine. The wrapper methodology offers a simple but powerful way to address the problem of variable selection, despite that it involves some more computational complexity and requires more execution time than that of the filters methodology. It is argued that filters have better generalization properties since it is independent of any specific learning technique. In addition to wrappers and filters, the embedded techniques are another category of variable selection, which perform the selection in the process of training and are usually specific to given learning machines [9].
3 Review of input determination techniques 3.1 Correlation coefficient/determination coefficient Pearson’s correlation is the measure of the linearity of the relation between two variables. The determination coefficient is the square of the correlation coefficient and might only have positive values which are between 1 (for a perfect correlation) and 0 (for a non existent correlation). Consider a set of m examples {xk, yk}(k = 1,…, m) consisting of n input variables xk,i(i = 1,…, n) and an output variable yk. Let us consider first the prediction of a continuous outcome y. The Pearson correlation coefficient is defined as: covðXi ; YÞ RðiÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi varðXi ÞvarðYÞ
ð1Þ
where cov designates the covariance and var the variance. The estimate of R(i) is given by:
123
Pm k¼1 ðxk;i xi Þðyk yÞ : RðiÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pm 2 Pm 2 ðx x Þ ðy yÞ k;i i k k¼1 k¼1
ð2Þ
Correlation criteria such as R(i) can only detect linear dependencies between the variable and the target. A simple way of overcoming this restriction is to make a non-linear fit of the target with single variables and rank according to the goodness of fit. Due to the risk of overfitting, we can alternatively use non-linear pre-processing (e.g. squaring, taking the square root, the log, the inverse, etc.) and then using a simple correlation coefficient. 3.2 Principal components analysis A common technique for the reduction in the number of variables used as input in a network involves taking the main components of the original data set. It is quite obvious that after calculating the covariance matrix it is advisable to combine pairs of variables with high covariance in such a manner that a new singular variable is created and that it describes in the best way the original data. This is equivalent to drawing a regression line through the points described by the original data pairs, and describing a new point as the distance along that line. Repeating this N times for an N dimensional data set produces a new set of points which still describe the original data perfectly, but in a different coordinate system. The advantage of the new coordinate system is that we can order the coordinates with respect to the amount of variance in the data set they account for. We can simply throw away those coordinates which account for the least. The reduction developed by the main components takes a covariance matrix and produces a set of vectors called eigenvectors that can be used to project a set of variables from the original system of coordinates in the new system. A value called eigenvalue corresponds to each eigenvector. The percentage of the sum of all eigenvalues for which any individual value contributes tells us the percentage of the variation in the data which is accounted for by its associated vector. This technique presents a clear disadvantage which is the fact of considering merely the linear case and ignoring existing non-linear structures. 3.3 Brute force A technique designated brute force for the removal of the model variables is to build a neural model with many inputs and a small number of invisible units. If the weights in the network are initialised with small, random values, then there will be no pressure for the weights from the less useful input variables to move far from their original
Neural Comput & Applic (2009) 18:891–901
position. Consequently, the input units in the trained network which have not moved far from their original values may be most safely discarded. A second network may now be trained on the new data set. The disadvantage of this technique is the time that it takes to train a large network and obviously the number of inputs cannot be allowed to became large otherwise combinational explosion will occur. The advantage is that the performance of the second network can be compared to the first so as to analyse whether the reduction of the variables was carried out successfully. There is yet another advantage: there is no need for extra mathematical analysis, to write software for calculus of entropy values or values of main components. 3.4 Mutual information techniques If a linear relationship characterizes the underlying system (the dependent variable being a linear function of all model predictors) then linear dependence measurements are sufficient. But if the system is more complex, which is usually the case, new measurements that take into account not only the linear relationships but also the non-linear relationships. This is the case of measurements based on Mutual Information. 3.4.1 MI (mutual information) calculus Mutual information measures dependence between two variables. The mutual information function between variables X and Y, is expressed as: Z Z fx;y ðx; yÞ MIðX; YÞ ¼ fx;y ðx; yÞ loge dx dy ð3Þ fx ðxÞfy ðyÞ where fx(x) e fy(y) are the marginal probability density functions of X and Y, respectively, and fx,y(x,y) is the joint probability density function of X and Y. Mutual information measures the reduction in uncertainty of Y when variable X is known. If there is dependence between X and Y, then the two random variables are statistically independent fx,y(x,y) would equal the product of marginal densities ( fx,(x) f,y(y)). The MI score in (3) would, in that case, equal a value of 0 (the ratio of the joint and marginal densities being one, giving the logarithm a value of zero). A high MI value indicates a strong dependence between the two variables [8, 19, 20]. For any given bivariate sample, the MI score in (3) can be estimated as: n 1X fx;y ðxi ; yi Þ MI ¼ loge ð4Þ n i¼1 fx ðxi Þfy ðyi Þ where xi and yi are the ith bivariate sample data pair in a sample of size n, and fx(xi), fy(yi) and fx,y(xi,yi) are the respective univariate and joint densities estimated at the sample data points.
893
The key to an accurate estimate of the MI is the accurate estimation of the marginal and joint probability densities in (4). However, the key to successful computation of MI score in (4) is the accurate estimation of these probability distributions from a finite set of examples. Some of the earlier versions of the MI function used crude measures of the probability density such as a histogram having a fixed bin width. More recently, kernel estimation techniques have been used because of their stability, efficiency and robustness [22, 24]. The kernel density estimator adopted can be a Gaussian kernel function [22], and is expressed as: n 1X 1 f^x ðxÞ ¼ n i¼1 ð2pÞd=2 kd detðSÞ1=2 ! ðx xi ÞT S1 ðx xi Þ exp ð5Þ 2k2 where f^x ðxÞ is the multivariate kernel density estimate of the d-dimensional variable set X at coordinate location x; xi the ith multivariate data point, for a sample of size n; S the sample covariance of the variable set X; det() represents the determinant operation; and k a smoothing parameter, known as the ‘‘bandwidth’’ of the kernel density estimate. In order to obtain an accurate density estimate, the bandwidth, k, should be obtained in a very careful manner. A large value of k results in an over-smoothed probability density, with subdued modes and over-enhanced tails. A low value, on the other hand, can lead to density estimates overly influenced by individual data points, with noticeable bumps in the tails of the probability density. Only for intermediate values of k can a good density estimate be obtained. Several operational rules for choosing optimal values of the bandwidth k are available in the literature. Sharma [23] used a relatively simple and computationally efficient bandwidth choice, known as the Gaussian reference bandwidth [22, 24]. 1=ðdþ4Þ 4 kref ¼ nð1=ðdþ4ÞÞ ð6Þ dþ2 where n and d refer to the sample size and dimension of the multivariate variable set, respectively. The Kernel density implementation of the MI criterion was first proposed by Moon et al. [19], this was an improvement over the original MI criterion [8] that used a histogram as the means of estimating the joint and marginal probabilities. 3.4.2 Partial mutual information The MI technique can be used to identify appropriate model inputs but cannot be used if redundant inputs have been detected [23]. For instance, if input x has a strong dependency relationship with input y as indicated in the MI
123
894
Neural Comput & Applic (2009) 18:891–901
value, then a second input z = 2x also has a high MI value. However, this new variable which is considered as a significant input is a redundant input. Recently, Sharma [23] proposed an input determination technique based on the partial mutual information (PMI) criterion to overcome the limitation of the correlation coefficient in selecting appropriate model inputs. The PMI criterion is able to adequately capture all dependence (linear and/or nonlinear) between two variables and avoids the need for making any major assumptions about the underlying model structure and as such, can be considered a model-free approach. The PMI criterion is an extension of the concept of mutual information. The PMI between the output (dependence variable) y and the input (independent variable) x, for a set of preexisting inputs z, can be given by Z Z fx0 y0 ðx0 ; y0 Þ 0 0 0 0 PMI ¼ fx y ðx ; y Þ loge ð7Þ dx0 dy0 fx0 ðx0 Þfy0 ðy0 Þ
x ¼ x E½xjz;
0
y ¼ y E½yjz
ð8Þ
where E[] denotes the expectation operation. By using the conditional expectations in (7), the variables x0 and y0 only contain the residual information after the effect of the preexisting set of the pre-existing set of inputs z has been taken into consideration.The discrete version of (7) can be used to provide a sample estimate of the PMI score, and is given by n 1X fX 0 Y 0 ðx0i ; y0i Þ PMI ¼ loge ð9Þ n i¼1 fX 0 ðx0i ÞfY 0 ðy0i Þ where xi0 and yi0 are the ith residuals in a sample data set of size n; fx0 (xi0 ), fy0 (yi0 ) and fx0 ,y0 (xi0 ,yi0 ) are the respective marginal and joint probability densities. 3.4.3 PMI modified In this technique a criterion of alternative stoppage for the implementation of the PMI technique which is efficient and robust is used [7]. In this technique, PMI scores are treated as outliers in the computed PMI scores. A robust outlier detection technique, the Hampel identifier [5], is used to evaluate the significance of selected candidate inputs. The suggested technique is used to treat significant PMI scores as outliers in the computed PMI scores. If the highest PMI score is found to be an outlier among the computed PMI scores for a particular step, then the input corresponding to the highest PMI scores for a particular step, then the input corresponding to the highest PMI score is a significant input.
123
1. 2. 3. 4.
5. 6.
To identify the set of potential inputs which we will designate as zin. To calculate the PMI between the output and each of the potential new inputs in zin. To calculate Hampel distance corresponding to the highest PMI score in step 2. If the Hampel distance for the highest PMI value [3, add the input variable corresponding to the highest PMI score to selected input set z and remove it from zin. If the highest PMI is not an outlier, go to step 6. Repeat steps 2–4 as many times as needed. This step is only attained when all significant inputs are obtained.
3.4.4 Mutual information-based feature selection Let’s consider an approach to the following problem:
where 0
The algorithm of the modified PMI technique is defined as follows:
Given an initial set F with n features; find subset S F with k features that minimize H ðCjSÞ; this is, that maximize mutual information I ðC; SÞ: The MI between vector variables is approximate using MI between the individual components of the vectors. Instead of calculating the mutual information I( fC) between feature vector f and the class variable c we only calculate I( f; C)e I( f; f 0 ) when f and f 0 are individual features. In this case in which it is computationally impossible the exact MI calculus is replaced by a series of feasible calculations. Then the analysis of the subsets is replaced by the ‘‘greedy’’ algorithm. Given the set of already-selected features, the algorithm chooses the following feature as the one that maximizes the information about the class corrected by subtracting a quantity proportional to the average MI with the selected features. In order to be selected, a feature must be informative about the class without being predictable from the current set of features. For example, if two features f and f 0 are highly dependent, I( f; f 0 ) will be large and, after the better one is picked, the selection of the second one is penalized [2]. The MIFS algorithm is defined as follows: 1. 2. 3.
(Initialization) Set F / ‘‘initial set of n features;’’ S / ‘‘empty set.’’ (Computational of the MI with the output class) for each feature F [ F compute I(C; f ). (Choice of the first feature) find the feature f that maximizes I(C; f ); set F F=f f g; set F / {f}.
Neural Comput & Applic (2009) 18:891–901
4.
5.
(a) (Computation of the MI between variables) for all couples of variables ( fs) with f [ F, s [ S compute I( f; s), if it is not already available. (b) (Selection of the next feature) choose feature f as the P one that maximizes IðC; f Þ b s2S Ið f ; sÞ; set F F=f f g; set S S [ f f g: Output the set S containing the selected features.
Parameter b determines the relative importance of MI between the candidate feature and already-selected features. If b is zero, only MI with the output class is considered for each selected feature. If b increases, this measure is discounted by a quantity proportional to the total MI with respect to the already-selected features. In practice, one concludes that the value of b between 0.5 and 1 is the appropriate value for many classification tasks. 3.4.5 Modified mutual information feature selection The calculus of the exact MI value amongst all the possible subsets and output classes is computationally impossible [1]. On the other hand, considering individual features (like the MIFS algorithm does) is not an appropriate solution, as it does not consider sets of features together. The MMIFS algorithm, on the other hand, finds the first top four features, considering MI between the output classes and subsets of four features (not all subsets are considered). So as to select the fifth feature, four MI values are calculated between outputs and subsets of four features (three selected features and a new one). The feature that gives the maximum sumo f those four MI’s is selected. This assures that the new selected feature has in average the maximum MI value in relation to the four previously selected features. From the four previously selected features, the three features that work better than the new selected feature will be used with the new one in the selection of the sixth feature. This procedure is repeated until the wished number of features is attained, M. The MMIFS algorithm is defined as follows: 1. 2. 3.
4.
Set F / ‘‘initial set of N features’’; S / {0}. Compute I(C; f ), f [ F. Set F1 / ‘‘the M/2 features that maximise I’’. (No. of calculated I’s = N). For each f_ ¼ F1 ð jÞ; j ¼ 1 : M=2; compute IðC; f ; f_Þ; f 2 F; f 6¼ f_: Set F2 / ‘‘the M/2 subsets (of 2 features) that maximise I’’ (No. of calculated I’s = M/2(N - 1)). For each subset f_; f€ ¼ F2 ðjÞ; j ¼ 1 : M=2 compute IðC; f ; f_; f€Þ; f 2 F; f 62 F2 ð jÞ: Set F3 / ‘‘the M/2 subsets (of 3 features) that maximise I’’. (No. of calculated I’s = M/2(N - 2)).
895
5.
6.
::: For each subset f_; f€; f ¼ F3 ðjÞ; j ¼ 1 : M=2 com::: pute IðC; f ; f_; f€; f Þ; f [ F, f 62 F3(j). Using the ::: f ; f_; f€; f (No. maximum value of I, Set F4 ¼ S of calculated I’s = M/2(N - 3)). Repeat until |S| = f [ F, f 62 S, com::: M, for each ::: ::: pute IðC;::: f ; f_; f€; f Þ þ IðC; f ; f_; f€; f Þ þ IðC; f ; f_; f€; f Þþ IðC; f ; f_; f€; f Þ: (No. of calculated I’s = 4(N - |S|)). Substitute f that gives the maximum value with one of the F4 elements that has less I compared to the other three; set S S [ f f g:
3.4.6 MIFS-U Fano [6] has shown that maximizing the mutual information between the transform data and the desired target achieves a lower bound to the probability of error. Bearing this in mind Battiti developed his feature selection method (MIFS) (Kwak et al. 2002). The technique evaluates mutual information between individual feature and class labels, and selects those features that have maximum mutual information with class labels and are less redundant. However, because of large errors in estimating the mutual information, the performance of MIFS is degraded. Kwak and Choi [18] enhanced the MIFS technique under the assumption of uniform distributions of information of input features and presented an algorithm called MIFS-U. Algorithm MIFS-U and algorithm MIFS differ in step 4. This way in the case MIFS-U we have the following: 4.
(Greedy selection) repeat until desired number of features are selected. (a) (Computation of entropy) s [ S, compute H(s) if it is not already available. (b) (Computation of the MI between variables) for all couples of variables ( f, s) with f [ F, s [ S compute I( f; s), if it is not already available. (c) (Selection of the next feature) choose feature f [ F as P the one that maximizes I(C; f ) - b s[SI(C; s)/ H(s)I( f; s); set F Fnf f g; set S S [ f f g: where b is a redundancy parameter which is used considering the redundancy among input features. If b = 0, mutual information between input features is not taken into consideration and the algorithm selects features in the order of the mutual information between input features and output classes, the redundancy between input features is never reflected. As b increases, mutual information between input features starts to influence the selection procedure and redundancy is reduced. But in the case in which b is too wide, the algorithm only considers the relationship between inputs and does not reflect well the input-output relationship.
123
896
Neural Comput & Applic (2009) 18:891–901
3.4.7 MICC (A filter approach to feature selection based on mutual information) MIFS-U provides a better estimation of the mutual information criterion than MIFS, but it fails to give the accurate estimation when the information distributions of input features considerably deviate from the uniform (Jinjie et al. 2006a). In addition to this, in both methods MIFS and MIFS-U, parameter b related to the redundancy of selected features needs to be preset by users. In fact an inappropriate b value can affect greatly the right choice of the feature. But the choice of an appropriate value for b has been a difficult problem to solve. To overcome the limitations of MIFS and MIFS-U, a constructive criterion for feature ranking under arbitrary distributions of input features based on mutual information is devised, and a new feature selection algorithm called MICC algorithm was built. The MICC algorithm differs from algorithm MIFS and MIFS-U in step 4: 4.
(Greedy selection) repeat until desired number of features are selected (a) (Computation of entropy) Vs [ S, compute H(s) if it is not already available. (b) (Computation of the MI between variables) for all couples of variables ( fs) with f [ F, s [ S compute H( f) e I( fs), if it is not already available. (c) (Selection of the next feature) choose feature f [ F as the one that maximizes 1 P IðC;f ÞIðf ;sÞ Ns
IðC; f Þ; set F
Fnf f g; set S
The ACO algorithm is defined as follows: Set parameters and initialize pheromone trails repeat for K-1 to m do construct an assignment Xk update pheromone trails using { Xi ,…,Xm } until best solution or maximum cycle reached
3.4.8.2 The feature selection technique In real systems of ants, in addition to the co-operation between ants by exchanging the pheromone trails, each ant also has the ability to exploit a food source. In ACO, this self-exploiting ability is determined by random. 3.4.8.2.1 The self-exploiting of individual ant using mutual information We substituted the mutual information randomly as some problem-specific local heuristics as follows: 1.
2.
s2S minfHðf ÞHðsÞg
f:
This algorithm was designated MI-based constructive approach (MICC) which has the same computational cost as MIFS-U in which the computational effort increases around n2 as the number of features n increases for given number of examples. This indicates that the MICC algorithm can be applied to large problems without excessive computational cost.
3.4.8.2.2 The procedure of feature selection using the hybrid of ACO and MI (ACOMI) Successive populations are generated using the co-operation between ants and selfexploiting for each ant. The procedure of feature selection using the hybrid of ACO and MI (ACOMI) is as follows: 1.
3.4.8 ACOMI 2. In this technique designated ACOMI we use both ant colony optimization (ACO) and the mutual information [27]. 3.4.8.1 Ant colony optimization Ant colony optimization (ACO) is obtained through the behaviour of real ants in nature. For the real case, in addition to the ability of cooperation between ants by exchanging the pheromone trails, each ant has the ability to exploit a food source. In ACO, these counterparts can be found, and the parameter Exploit is in charge of the balance of two abilities.
123
We estimated the mutual information between each input candidate and each output by using the Fraser and Swinney technique [8]. In the process of self-exploiting of every ant, the input feature is calculated based on the probabilities P1, P3 and random. If the mutual information for i the candidate input is large, then there is a large probability P1; if the mutual information for i the candidate input is moderate, so randomly include it into input feature subset; if the mutual information for the ith candidate input is little, then exclude it from input feature subset with large probability P3.
3.
4.
A determined number of generations and the population of ant colony are initialized; Each individual ant should visit all features and calculate solutions completely (keeping all features as binary bits into bit string); After all individuals complete building solutions, the selected features are used for ANNs training. The accuracy rates of ANNs tested by forecaster are the objective function and determining whether the termination condition is satisfied (if true, the system exits, otherwise the system continues). The global optimal solution from the beginning of a trial is reserved, and the pheromone intensities of its features Dski;0 and Dski;1 are defined by:
Neural Comput & Applic (2009) 18:891–901
Dski;1 ¼ Dski;0 ¼
897
ARatek =Q if ant k select feature i 0 otherwise
ð10Þ
ARatek =Q if ant k don’t select feature i 0 otherwise
ð11Þ
where Q is a constant; ARatek is the accuracy rates of ant k. 5.
The pheromone intensity of each feature is given by: Dsi;0 si;0 ¼ ð1 qÞsi;0 þ ð12Þ N0 si;1 ¼ ð1 qÞsi;1 þ
Dsi;1 N1
ð13Þ
where q, represents evaporation rate in order to avoid unlimited accumulation of trails: N X Dsi;0 ¼ Dski;0 ð14Þ i¼1
Dsi;1 ¼
N X
Dski;1
ð15Þ
i¼1
are, respectively, defined as the sum of the pheromone intensities of non-selected and selected features laid by all ant individuals; N, the population of ant colony: N X N0 ¼ Solutionk;i;0 ð16Þ k¼1
N1 ¼
N X
Solutionk;i;1
ð17Þ
k¼1
are, respectively, the sum of the ants in which the feature i is not selected and the sum of the ants in which the feature i is selected. 6.
Generating a random, and comparing with the exploitable probability parameter Exploit (if this random is greater than or equal to the parameter Exploit, whether the feature i is selected or not is decided by the pheromone trails; otherwise, self-exploiting ability of an ant). If belongs to the latter, whether the feature selected or not is determined by mutual information, if belongs to the former, the feature i is chosen from the larger between the two following probability parameters: a P1 ¼ si;1 ½di þ 1b ð18Þ a b P0 ¼ si;0 ½di ð19Þ iP 1 Solutionk;j;1 the sum of selected features where di ¼ j¼1
in which is equal to 1 if the feature is selected, otherwise, 0; a, b factors, respectively, governing the influence of pheromone intensity and the number of selected features on this feature’s selection.
3.5 Techniques based on genetic algorithms and hybrid genetic algorithms Genetic Algorithms are search algorithms based on the theory of natural selection and population genetics. In this application, each individual in the population is a string containing a bit ‘‘1’’ (gene) and a bit ‘‘0’’(gene) and it indicates a possible selection for inputs. Successive populations are generated using a breeding process that favours fitter individuals. Individuals featuring high fitness have a greater probability of contributing towards descendants in the next generation. There are three types of operators that combine with each other to give rise to a new generation. In replication individual vectors are copied into the next generation and the higher the adjustment value of an individual, the greater the probability of being copied. New individuals are originated through mating of existing individuals. The probability for a vector to be chosen as a parent depends on its adjustment. A number of crossover points is randomly chosen along the vector. A child originates by copying from one parent until a crossover point is reached, copying then switching to the other parent and repeating this process as many times as necessary. Vectors originated by both reproduction and crossover can be mutated. Mutation is necessary so that new generations are more than the reorganisation of genetic material. After each generation each individual is assessed and the process is repeated until an optimum solution is obtained. Procedure of genetic algorithm for feature selection [26]: Initialization N P Pc Pm T K
? ? ? ? ? ?
Population size Initial population with N subsets of Y Crossover probability Mutation probability Maximum number of generation 0
Evolution Evaluation of fitness of P While (K
123
898
Neural Comput & Applic (2009) 18:891–901
3.5.1 Feature selection approach by hybrid genetic algorithm Assuming A as the set of initial features, A = S [ F, where S is the subset of already-selected features, F is the subset of unselected features, C is the output classes and fi [ F [11–14]. It was theoretically proved that genetic algorithms can find the optimum solution of a problem in the sense of probability in a random manner. However, the simple genetic algorithms have some flaws such as premature convergence and the weak capability of fine-tuning near local optimum points in applications. So as to improve the fine-tuning capability and efficiency of simple GAs in the feature selection problems, a hybrid GA with local search operations was suggested, in which I(C; fi|S) is used to measure rank candidate features. The local search measure I(C; fi|S) has three properties: 1.
2.
3.
If I(C; fi|S) = 0, it means that feature fi cannot offer any information about classification after set S has been selected, and fi cannot be included in S; If I(C; fi|S) [ 0, it means that feature fi can offer some additional information about classification after set S is selected, and fi has a close relation to the set of outputs C, so fi must be included in S. If I(C; fi|S) \ 0, it means that, instead of offering new information about classification after subset S is elected, feature fi produces a reduction in the quantity of mutual information between output class C and subset S. Obviously in this case feature fi must be excluded.
The local search operations are performed in a filter manner and they have all advantages of the methods of filter type such as efficiency, quick calculus and simplicity. First they eliminate the ‘‘bad genes’’ of all the chromosomes of a population, corresponding to the elimination of insignificant features for the subset generated by GA in each generation, and then learning machines are trained on the pre-processed feature subsets. The optimal subset of features is then found by the GA optimization through generations according to the performance of the series of learning machines trained. The implementation of the hybrid genetic algorithm for feature selection mainly includes the encoding schemes of chromosomes, evaluating fitness function, local searching operations, designing for the selection, genetic operations such as crossover and mutation, and the stopping criterion. Procedure HGA for the selection of feature: M is the size of the population P(t).
123
Begin Initialize P(0) t =0 While ( t ≤ T ) do Local improvement of each chromosome of P (t ) by I (C ; f i | S ) for i = 1 to M do Train a classifier on each selected feature subset corresponding to each individual of P (t ) and evaluate the fitness; end for if stopping conditions are satisfied, break; Selection operation to P (t ) ; Crossover operation to P (t ) ; Mutation operation to P (t ) ; For i = 1 to M do P(t + 1) = P (t ) End for t = t + 1 End while End
3.5.2 Mutual information and genetic algorithm combined In order to reduce the calculation time of MI between a singular input and the whole data set, we randomly select some data from a set of data with a probability of 0.5 [26]. The set of data selected is called the MI set. Using Fraser & Swinney’s technique [8], the mutual information xi between each candidate input and each output in MI set is estimated, which construct a data set D = {xi, i = 1,…, N}, xi represents the mutual information of i the candidate input, and N means there are N candidate inputs. We calculate the mathematical statistics of xi; the mean x and standard deviation SN N 1X xi N i¼1 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u1 X ðxi xÞ2 SN ¼ t N i¼1
x¼
ð20Þ
ð21Þ
We consider three sets that verify D = D1 [ D2 [ D3: SN D 1 ¼ xi : xi x [ ð22Þ 2 SN SN xi x ð23Þ D 2 ¼ xi : 2 2 SN ð24Þ D3 ¼ xi : xi x\ 2 In GA, we use the mutual information between each input possibility and its output, to orientate the mutation, according to certain criteria 8 xi 2 D 1 < 1 gi ¼ rand xi 2 D2 ð25Þ : 0 xi 2 D 3
Neural Comput & Applic (2009) 18:891–901
899
where gi represents i the gene in a binary chromosome, it means i the candidate input. If the mutual information xi of i the candidate input belongs to D1, it means it is a highly correlated input for each output, so include it into input feature subset; if the mutual information xi of i-th candidate input belongs to D2, it means it is a general correlated input for each output, so randomly include it into input feature subset; if the mutual information xi of i-th candidate input belongs to D3, it means it is little correlated input for each output, so exclude it from input feature subset. The procedure of the proposed technique for feature selection is same as the procedure of GA for feature selection except the step of ‘‘mutation with Pm’’: Mutation with Pm If xi of ith candidate input belongs to D1, include it into input feature subset; If xi of ith candidate input belongs to D2, randomly include it into input feature subset; If xi of ith candidate input belongs to D3, exclude it from input feature subset. 3.5.3 SOM-GAGRNN
" # n X 1=2 xi wji ; Dj ¼ X Wj ¼
j ¼ 1; . . .; M
i¼1
ð28Þ The PE with the lowest value for Dj is the winner during recall. To ensure biological plausibility, lateral interaction with neighbouring PEs is enforced by applying arbitrary network structures called neighbourhood sets, Nc. Throughout the process, all PEs within the winner’s neighbourhood set will have their weights updated, whilst PEs outside of this set left intact. The width or radius of Nc can be time variable. The following procedure is given by Wj ðtÞ þ aðtÞðXðtÞ Wj ðtÞÞ if j 2 Nc ðtÞ Wj ðt þ 1Þ ¼ Wj ðtÞ if j 62 Nc ðtÞ ð29Þ where a is a scalar-valued adaptation gain 0 \ a(t) \ 1. After the weights have been updated, the next input is presented to the network and the process continues until convergence has been reached. After successively presenting different inputs to the SOM, the net effect is that the weights reflect the topological relationship that exists within the input data [10].
3.5.3.1 SOM (self-organizing map) The self-organizing map (SOM); Kohonen [16, 17] is a technique which is able to arrange a set of multivariate data by similarities maintaining the topological structure of the data [3]. In competitive learning, neurons in the network adapt gradually to become sensitive to different input categories. The SOM network generally consists of two layers, an input layer and a Kohonen layer, which in most common applications is two-dimensional. None of the processing elements (PEs) in the Kohonen layer are connected to each other. The PEs in the Kohonen layer are connected to each other. The PEs in the Kohonen layer measure the distance of their weights to the input pattern. The procedure to determine the winning PE is as follows:First, we determine the extent to which the weights of each PE match the corresponding input pattern. The input data is defined as
3.5.3.2 General regression neural networks The general regression neural network (GRNN) is a type of supervised feedforward ANN and has been shown to be a universal approximator for smooth functions [21]. Determined by Specht [25] the GRNN does not require any special form than the regression function to be assumed, but rather, it uses a non-parametric estimate of the probability density function for the observed data. The advantage of using the general regression neural networks is
X ¼ fxi ; i ¼ 1; . . .; ng
1. Implementation of SOM The implementation of SOM aims at defining groups of similar input variables. By the sampling of one input from each cluster, it is possible to remove highly correlated, redundant variables from the original data set. There is no pre-defined criterion to determine the optimum size of the Kohonen layer [4], hence, the Kohonen layer should be kept large enough to ensure that a suitable number of clusters are formed from the training
ð26Þ
each of the M PEs in the Kohonen layer will also have n weight values is defined as ð27Þ Wj ¼ wi;j ; j ¼ 1; . . .; M; i ¼ 1; . . .; n For each of the M Kohonen PEs, the Euclidean distance is calculated using the following formula:
• • •
The non-linear relationships between inputs and outputs can be analysed. The network architecture does not need to be determined. Training times are much quicker than other ANNs.
3.5.3.3 Implementation of technique SOM-GAGRNN
123
900
Neural Comput & Applic (2009) 18:891–901
data. For each Kohonen layer PE containing a cluster, only one input is sampled. The choice criterion of a representative input for each cluster is to choose the input with the smallest Euclidean distance to the cluster’s weights.
papers. The purpose of the next paper is to do the comparison of water resources through ANNs applications.
The SOM input procedure is implemented by each variable in order to obtain a subset of lags with limited cross-dependence. Then, the lags that have been identified for each variable are combined into one data set. The SOM input determination procedure is then conducted again, this time to ensure that there is limited cross-dependence between the variables selected.
References
2. Implementation of Hybrid GAGRNN Substantial variation amongst the input variables does not necessarily imply any relationship with the output variable. After reducing the number of input variables a new technique (e.g. GAGRNN) is used to define which independent variables have a more significant relationship with the output. For the problem of determining inputs to a GRNN, a population of randomly selected GRNNs is generated, each with a different subset of input variables as depicted by a binary string. Therefore, for a number of variables n the vector has size n. If the jth of the vector is equal to 1 the input is considered, if equal to 0, the input will not be included. The models are trained and the output of each GRNN is used to determine the predictive error, or fitness of the solution. Based on the fitness of each member in the population, a selection criterion is used to create a new population for the next generation. In this way, fit solutions are allowed to live and subsequently breed in the process of crossover. In the crossover process, two parent strings are cut and part of the strings exchanged to produce two new individuals. To maintain genetic diversity and ensure that no important genetic material is overlooked, a mutation operation introduces small random changes. An iterative method is used until the error from GRNN model converges to an acceptable value or a pre-specified maximum number of generations is completed. Due to the selective pressure applied over the generations, the overall trend is the evolution of chromosomes with higher fitness, representing improved input subsets.
4 Conclusion The intention was to collect techniques of reduction of input variables based on mutual information and genetic algorithm for neural networks models. The techniques proposed have been tested on a wide variety of data sets which limit the possibility of making comparisons across
123
1. Ahmed A, Dey L (2005) A feature selection technique for classificatory analysis. Pattern Recognit Lett 26(1):43–56. doi: 10.1016/j.patrec.2004.08.015 2. Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550 3. Bowden GJ, Maier HR, And Dandy GC (2005) Input determination for neural network models in water resources applications, Part 1. Background and methodology. J Hydrol (Amst) 301:75– 92. doi:10.1016/j.jhydrol.2004.06.021 4. Cai S, Toral H, Qiu J, Archer JS (1994) Neural network based objective flow regime identification in air-water two phase flow. Can J Chem Eng 72:440–445 5. Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792. doi:10.2307/2290763 6. Fano RM (1961) Transmission of information: a statistical theory of communications. Willey, New York 7. Fernando TMG, Maier HR, Dandy GC, May R (2005) Efficient selection of inputs for artificial neural network models 8. Fraser AM, Swinney HL (1986) Independent coordinates for strange attractors for mutual information. Phys Rev A 33(2): 1134–1140. doi:10.1103/PhysRevA.33.1134 9. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. doi:10.1162/ 153244303322753616 10. Islam S, Kothari R (2000) Artificial neural networks in remote sensing of hydrological processes. J Hydrol Eng 5(2):138–144. doi:10.1061/(ASCE)1084-0699(2000)5:2(138) 11. Huang J, Cai Y, Xu X (2006a) A filter approach to feature selection based on mutual information. Proceedings of 5th IEEE international conference on cognitive informatics 12. Huang J, Lv N, Li W (2006b) A novel feature selection approach by hybrid Genetic Algorithm. In: PRICAI 2006, LNAI, vol 4099, pp 721–729 13. Huang J, Cai Y, Xu X (2006c) A wrapper for feature selection based on mutual information. The 18th international conference on pattern recognition (ICPR’06) 14. Huang J, Cai Y, Xu X (2007) A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognit Lett 28:1825–1844. doi:10.1016/j.patrec.2007.05.011 15. Kohavi R, John G (1997) Wrappers for feature selection. Artif Intell 97(1–2):273–324. doi:10.1016/S0004-3702(97)00043-X 16. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69. doi:10.1007/ BF00337288 17. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480. doi:10.1109/5.58325 18. Kwak N, Choi C-H (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159. doi: 10.1109/72.977291 19. Moon YI, Rajagopalan B, Lall U (1995) Estimation of mutual information using kernel density estimators. Phys Rev A 52(3): 2318–2321 20. Moon YI, Rajagopalan B, Lall U (1995) Estimation of mutual information using kernel density estimators. Phys Rev R 52(3):2318–2321
Neural Comput & Applic (2009) 18:891–901 21. Sarle WS (1997) Neural network FAQ, periodic posting to the Usenet newsgroup. www.comp.ai.neural-nets 22. Scott DW (1992) Multivariate density estimation: theory, practice and visualization. Probability and mathematical statistic. Wiley, New York, p 317 23. Sharma A (2000) Seasonal to inter-annual rainfall probabilistic forecast for improved water supply management: Part 1—a strategy for system predictor identification. J Hydrol (Amst) 239:232–239. doi:10.1016/S0022-1694(00)00346-2 24. Silverman BW (1986) Density estimation for statistics and data analysis, monographs on statistics and applied probability. Chapman and Hall, New York, p 175
901 25. Specht DF (1991) A general regression neural network. IEEE Trans Neural Netw 2(6):568–576. doi:10.1109/72.97934 26. Zhang CK, Hu H (2005a) An effective feature selection scheme via genetic algorithm using mutual information. In: Fuzzy systems and knowledge discovery, PT 2, Proceedings, vol 3614, pp 73–80 Part 2 27. Zhang CK, Hu H (2005b) Feature selection using the hybrid of ant colony optimization and mutual information for the forecaster. Proceedings of 2005 international conference on machine learning and cybernetics, vol 1–9, pp 1728–1732
123