First-Order Sensitivity Analysis for Hidden Neuron Selection in Layer-Wise Training of Networks

Multilayer neural networks are current trends in machine learning. Although complex architectures bring high performance, having sparse neurons in eac...

2 downloads 18 Views 4MB Size

Download PDF

Neural Process Lett https://doi.org/10.1007/s11063-017-9764-6

First-Order Sensitivity Analysis for Hidden Neuron Selection in Layer-Wise Training of Networks Bo Li1 · Cheng Chen2

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Abstract Multilayer neural networks are current trends in machine learning. Although complex architectures bring high performance, having sparse neurons in each layer can save memory, energy, and computational resources. In this paper, we aim to balance benefits between the complexity of architectures and the sparsity of neurons. An algorithm is proposed to prune neurons in multilayer neural networks through the global sensitivity analysis. Motivated by layer-wise training, we construct autoencoders with linear decoders, so mathematical models of multilayer neural networks can be considered as additive models. Hence, a first-order sensitivity analysis method, called random balance designs (RBD), is employed to select redundant neurons in hidden layers. This paper provides a novel framework to apply RBD in multilayer neural networks. Multiple experimental results demonstrate the generality and effectiveness of the proposed approach on structural learning of neural networks. After removing superfluous hidden neurons, higher accuracy can be obtained in most cases with less computation. Keywords Layer-wise training · Hidden neuron selection · Feature detection · Sensitivity analysis · Neural networks

1 Introduction Compared with shallow architectures, deep structures of neural networks always exhibit more power to handle highly non-linear and highly varying problems. In recent years, deep

B

Bo Li [email protected] Cheng Chen [email protected]

1

College of Engineering, Department of Electrical Engineering, Nanjing Agricultural University, Nanjing 210031, Jiangsu, People’s Republic of China

2

Advanced Computing and Big Data Laboratory, Global Energy Interconnection Research Institute, Nanjing 210000, Jiangsu, People’s Republic of China

123

B. Li, C. Chen

models such as deep belief networks (DBNs) [1], stacked autoencoders (SAEs) [2], convolutional neural networks (CNNs) [3], and recurrent neural networks (RNNs) [4] have become increasingly used in pattern recognition, data mining, and so on. The learning of deep models primarily relies on layer-wise training, which is an effective way to solve the long-standing difficulties in error backpropagation (BP). Therefore, this topic receives considerable attention in the field of computational intelligence. With the successes of huge networks, the scales of neural network parameters have continuously expanded from millions [3] to billions [5]. However, because of the lack of a priori knowledge about the reasonable structure, neural networks are always over-parameterized. Neural networks with oversized architectures not only cost more chip resources, such as memory, energy, and computational time, but it is more likely to cause overfitting. Therefore, how to obtain a sparse network without decreasing accuracy is also a non-ignorable direction. As another important part of network learning, this is called structural optimization or structural learning, which together with parameter learning is to be considered a complete process of network learning. According to their different relationships with the parameter learning of networks, the existing structural optimization methods can be grouped into two categories: tight coupling approaches and loose coupling approaches.

1.1 Tight Coupling Approaches Tight coupling approaches modify the network simultaneously with parameter learning. Thus, it is difficult to separate them from the iterative training phase. Moreover, their only objective is error minimization. Typical algorithms such as Dropout [6], DropConnect [7], and hard dropout [8] set some neuron weights to zero or delete neurons according to a certain probability or a threshold.

1.2 Loose Coupling Approaches Loose coupling approaches have weak connections with network training and can easily be separated from parameter learning. Thus, these approaches can be used as validation tools to evaluate the rationalities of network structures. Among such approaches, sensitivity analysis is an important technology. Sensitivity analysis was originally used for the structural optimization of neural networks in [9]. Sensitivity analysis evaluates the extent to which the outputs of neural networks are influenced by network components such as input neurons, hidden neurons, and weights. These influences are typically placed mainly on the errors [10,11] or variances [12,13] of the output. In this way, the structures of neural networks are optimized by deleting the components that have minor effects on output errors or output variances. Compared with error-based algorithms that focus too much on error minimization, variance-based algorithms focus on the fluctuation of output values; thus, these algorithms have better generalization ability and avoid over-fitting to some extent. Hence, we primarily focus on variance-based sensitivity analysis. Variance-based sensitivity analysis consists of local sensitivity-based analysis and global sensitivity-based analysis. Local sensitivity-based methods are stable in cases of small variations of neural networks, but they are sensitive to significant network changes. By contrast, methods based on global sensitivity analysis have two main advantages. First, the factor space is limited. Second, the global analysis takes the variation of all factors into consideration. As the quickest global sensitivity-based method, the extended Fourier amplitude sensitivity test (EFAST) computes the total effects for all network components and achieves satisfactory performance in structural optimization [14–16]. In our previous work [17], another global

123

First-Order Sensitivity Analysis for Hidden Neuron...

sensitivity-based algorithm called random balance designs (RBD) was introduced for the structural optimization of neural networks. RBD is a first-order sensitivity analysis method, and it only computes the main effects of network parameters rather than all interaction effects. Hence, RBD has improved computational efficiency and runs faster than EFAST. However, this method is restricted to be applied to networks that have a single hidden layer and single output neuron. In this paper, we improve our previous work in [17] and generalize the RBD-based structural learning algorithm to multilayer neural networks. First, following the framework of layer-wise training, a multilayer neural network is decomposed into stacked networks with a single hidden layer. Then, an RBD-based algorithm is proposed. Our main contribution is to provide a bridge between first-order sensitivity analysis and structural learning of multilayer neural networks. According to the sensitivity analysis of RBD, redundant neurons in hidden layers are selected and deleted. Thus, the structure of the original neural network is optimized. Compared with existing methods, our method has two advantages. First, it is based on the sensitivity method of RBD, which is quicker than all existing global sensitivity-based methods with the same accuracy. Second, our method is a loose coupling approach. In other words, it can be individually performed after the parameter training has finished. The remainder of this paper is organized as follows. First, a problem description is provided to illustrate how to extend our former work from neural networks with a single hidden layer to general multilayer networks in Sect. 2. Then, a brief introduction to sensitivity analysis is presented in Sect. 3. In this section, we also introduce the theory of RBD, which is the core of our proposed approach. In Sect. 4, the implementation details of the proposed algorithm based on RBD for hidden neuron selection are presented. Multiple experiments are conducted to illustrate the advantages of our algorithm in Sect. 5. A discussion about the algorithm is also provided in this section. Finally, we draw a conclusion and present the directions for future work in Sect. 6.

2 Description of Hidden Neuron Selection 2.1 Additive Models for Multilayer Neural Networks In our previous work [17], self-organizing neuro-fuzzy networks were taken as an example to study how to rationalize the structure of neural networks. As typical multiple-input singleoutput (MISO) models, self-organizing neuro-fuzzy networks have only a single hidden layer and one output neuron. In cases where the activation functions of output neurons are linear, the output of networks can be fitted to additive models [18], which take the following form: k y = β0 + Σi=1 f i (xi ) + ε

(1)

where xi is the input factor, k is the number of factors, y is the output, f i (.) is a smooth function fit from the data, and β0 and ε denote a constant and the white noise, respectively. Because of this connection to additive models, a first-order sensitivity analysis method, RBD, is utilized for structural adjustment of neural networks. Compared to EFAST, RBD operates quicker without loss of accuracy [17]. Thus, RBD can be generalized to all neural networks with a single hidden layer and a single linear output neuron. Although our RBD-based structural optimization algorithm achieves reliable results, its application requires strict constraints on the structures of networks. For a multilayer neural network, which is multiple-input multiple-output (MIMO) and more common, the RBD-

123

B. Li, C. Chen

based algorithm cannot be directly utilized because the output neurons of such networks are not additive models. In this paper, the connection between multilayer networks and additive models is established. A novel framework, which is motivated by the learning algorithm of deep neural networks, is provided to generalize the RBD-based algorithm to multilayer neural networks. In contrast to the traditional training algorithms of neural networks, deep learning algorithms are composed of two stages: layer-wise pre-training and fine-tuning. In layer-wise pre-training, each hidden layer and its previous layer are separately extracted to construct an autoencoder, which is a three-layer neural network with an output layer that has the same number of nodes as the input layer. For the purpose of reconstructing its own inputs, the outputs of an autoencoder should be equal to its inputs. The mathematical expression is given as follows: n H j = f Σi=1 wi1j Ii + B 1j (2) Ot = g Σ kj=1 w 2jt H j + Bt2 I , H , and O denote variables from the input layer, the hidden layer, and the output layer respectively. In autoencoders, these two equations individually correspond to encoding and decoding phases. Inputs are encoded to hidden variables. Then, the hidden variables are decoded into outputs, which are equal to inputs. wi1j , w 2jt , B 1j and Bt2 are the weight and bias for encoding and decoding, respectively. n and k are the numbers of input and hidden neurons, respectively. i, j and t are indices of neurons in each layer. Through these operations, a multilayer neural network can be decomposed into several single-hidden-layer networks. Next, the output activation functions g(.) of autoencoders in Eq. 2 are discussed. Since the outputs are equal to the inputs in autoencoders, if g(.) are nonlinear, such as sigmoid functions, then the inputs should be constrained in the range of (0, 1) [19]. However, it is difficult to ensure that all types of input data meet this requirement after preprocessing, particularly for image data with colour. Therefore, the output activation functions in this paper are defined as g(x) = x, which is easier to apply and more robust to parameter variations. The decoders of autoencoders in this paper are linear, as expressed in Eq. 3. Ot = Σ kj=1 w 2jt H j + Bt2

(3)

Consequently, the output models for autoencoders are converted to additive models. Our RBD-based algorithm for structural adjustment is generalized to multilayer neural networks.

2.2 Fundamental Framework We take SAEs as an example to illustrate the fundamental framework of the proposed hidden neuron selection algorithm, which is shown in Fig. 1. This framework is similar to the layerwise training of deep networks, and each hidden layer is individually simplified. First, an autoencoder is constructed. The first autoencoder A1 is composed of the hidden layer H1 , its previous layer, i.e. the input layer, and an output layer that has the same number of neurons as the input layer. A1 has a linear decoder, and its inputs are the training set. Next, our structural optimization algorithm is operated on the autoencoder to prune redundant hidden neurons in H1 . After the structural optimization of H1 , a new iteration starts to extract the next hidden layer H2 , and so forth. Assume that the input SAE in Fig. 1 has m hidden layers; m autoencoders are constructed for hidden neuron selection in total. For autoencoder Al (1 < l ≤ m), its inputs are equal to the outputs of Hl−1 . The parameters of each autoencoder are already available for a well-trained original SAE. In this case, structural optimization is indepen-

123

First-Order Sensitivity Analysis for Hidden Neuron...

Fig. 1 General flow of hidden neuron selection in multilayer neural networks

dent of the parameter learning without an influence on network weights. However, when the original SAE has not been trained, the autoencoder construction can be embedded in the layer-wise training. In this way, the structural optimization and the parameter learning are performed simultaneously. Finally, all hidden layers have been optimized and are hierarchically stacked as a new SAE model, which is more compact and efficient. Considering that linear autoencoders can be utilized in the pre-training of CNNs for feature extraction, the proposed algorithm is also available in CNNs to reduce the number of convolutional filters. The core of the framework is the structural optimization algorithm based on RBD, which is a specific sensitivity analysis method. Next, some details of the sensitivity analysis are presented, as well as how to generalize the sensitivity analysis of MISO networks to MIMO models.

3 Sensitivity Analysis 3.1 Description of Variance-Based Sensitivity Analysis Sensitivity analysis is widely used to determine insignificant model parameters, which can thus be eliminated. Hence, sensitivity analysis can also be used to quantify the relevance among the components in neural networks. The most common sensitivity analysis is variancebased sensitivity analysis [14]. Figure 2 depicts the basic operation flow of variance-based sensitivity analysis by sampling. The main steps are listed as follows [20]: 1. Define the model, its input factors, and output variables. 2. Assign probability density functions or ranges of variation to each input factor.

Fig. 2 The operation flow of variance-based sensitivity analysis by sampling

123

B. Li, C. Chen

3. Generate an input matrix through sampling design. 4. Evaluate the output. 5. Assess the relevance of each input factor with the output variable. The outcome of variance-based sensitivity analysis represents the contribution of each input factor to the variance of the output. Thus, it is possible to identify the most influential factors. EFAST and RBD are two types of variance-based sensitivity analysis methods by sampling. Because RBD uses random permutation of sampling nodes in each input space, this leads to two merits: simpler sampling without considering the interference of multifactors and a smaller sample set. Thus, RBD performs faster than EFAST. However, RBD needs one more precondition than EFAST, namely, the objective model must be an additive model. In [21], for one generative model y = f (x1 , x2 , · · · , xk ), the following equation is established: p Σi=1 ST i = 1 (4) where x1 , x2 , · · · , xk are the input factors of the system and y is its output. ST i is the total sensitivity index of factor xi . It can be calculated by adding all the sensitivity indices corresponding to this factor, i.e. ST i = Si + Σ j=i Si j + · · · + S12···k

(5)

where Si is the first-order sensitivity index and Si j to S12···k are the high-order sensitivity indices. For a purely additive model, Eq. 6 is established. p

Σi=1 Si = 1

(6)

Thus, the high-order sensitivity indices of factors are equal to zero, which means that Eq. 7 works. Si = ST i (7) Hence, if the neural network is one additive model, then RBD is available for sensitivity analysis in this network. Through the computation of RBD, the first-order sensitivity index of Ii can represent all its contributions on the variance of the output y.

3.2 Description of RBD In RBD, all factors are sampled with the same frequency ω. This avoids the time-consuming process of frequency selection for the total effect. The parametric equation used in EFAST is given as follows: Ii (si ) = G i (sin ωi s j ), ∀i = 1, · · · , k; ∀ j = 1, · · · , N

(8)

where Ii is the ith input factor of the model. G i is a manually designed function for obtaining the desired probability density of Ii . s is a varying parametric variable, which is obtained by N -times sampling in the range of (− π, π). If all frequencies of ω are set to the same value, the parametric curve of each input only covers a subset of its respective input space. For example, we set ω to 1, and the result of sampling in the input space of two factors is shown in Fig. 3a. Hence, a random permutation is applied to the sampling of each input to generate different points. The parametric equation is changed as follows: Ii (si j ) = G i (sin ωsi j ), ∀i = 1, · · · , k; ∀ j = 1, · · · , N

123

(9)

First-Order Sensitivity Analysis for Hidden Neuron...

Fig. 3 A sampling example of two factors with different sampling strategies

where si1 , si2 , · · · , si N denotes the ith random permutation of N points. Then, to systematically explore the space of input factors, the variation of Ii is represented in Eq. 10 if its range is restricted in [a, b]. b+a b−a Ii (si j ) = (10) + arcsin(sin ωsi j ) 2 π The sampling result in the input space of two factors is shown in Fig. 3b, where the ranges of variations of two factors are [− 1,1]. With the sample of size N , the model is evaluated using Eq. 11. Y (s j ) = f (I1 (s1 j ), I2 (s2 j ), · · · , Ik (sk j ))

(11)

Therefore, we obtain the model output of size N . Note that the size of the sample set of RBD only needs to meet the following condition: N ≥ (2Mωh + 1), where M is always set from 4 to 6 and ωh is the highest frequency assigned. Whereas the size of the sample set of EFAST N is defined as follows: N = k(2Mωh + 1), where k is the number of factors needed for sensitivity analysis. Obviously, the size of the sample set of RBD is considerably smaller than that of EFAST. That is why RBD always run faster than EFAST. Then, according to the values of Ii (si j ) ranked in increasing order, Y (s j ) are reordered as Y (s j ) R . The sensitivity of Y to Ii is determined by the harmonic content of the reordered model output, which is quantified by its Fourier spectrum in Eq. 12. 1 N F(w) = | Σ j=1 Y (s j ) R exp(−iωs j )|2 π

(12)

If the highest order of harmonics is restricted to M, the estimation of the Vi with Ii is computed as Eq. 13. M F(ω) (13) Vˆi = V [E(Y |Ii )] = Σω=1 Hence, the estimation about the first-order effect of Ii to y can be computed as Eq. 14. M F(ω) Σω=1 Vˆi = Sˆi = R V E[Y − E(Y R )]2

(14)

123

B. Li, C. Chen

Fig. 4 Hidden neuron selection in a MIMO model with a single hidden layer. Dotted lines mean removing a hidden neuron and its relative links

4 Algorithm Based on RBD The layer-wise training of a hidden layer is always one MIMO, except for neural networks with a single output. Hence, in this section, we first propose the method to realize the sensitivity analysis for MIMO models. Then, according to this method, the entire framework for the structural adjustment of multilayer neural networks is presented.

4.1 Sensitivity Analysis for MIMO Models In [17], we used the sensitivity analysis based on RBD for MISO models. The relevance between each hidden neuron and the single output can be represented as Si , i = 1, 2, · · · , k, where k is the number of neurons in the hidden layer. We set λdown as the threshold to evaluate the importance of a neuron in the network. When Si < λdown , the hidden neuron i and its relative links will be removed from the network. For the MIMO model in Fig. 4, the relevance between hidden neurons and output variables is denoted as S = {Si j }, i = 1, 2, · · · , k; j = 1, 2, · · · , m. For hidden neuron i, we have sensitivity index vector Si = {Si1 , Si2 , · · · , Sim }. The mathematical equation of each output variable can be expressed as follows: k y j = Σi=1 wi j h i , j = 1, 2, · · · , m

123

(15)

First-Order Sensitivity Analysis for Hidden Neuron...

where wij is the weight between hidden neuron i and output neuron j. h i is the output of hidden neuron i. Since each output variable is represented as an additive model in Eq. 15, S can be calculated through RBD. The importance of hidden neuron i is evaluated by its effect index in Si . As in Eqs. 16, 17, the maximum of Si is denoted by Simax , and the sum of Si is denoted by SiT . Simax = max(Si ) = max(Si1 , Si2 , · · · , Sim ). SiT

= sum(Si ) =

Σm j=1 Si j .

(16) (17)

If Simax or SiT of hidden neuron i is lower than a certain threshold, hidden neuron i is a negligible neuron in this network. In other words, when Simax < ζimax and SiT < ζiT , hidden neuron i and its relative links will be removed from the network as neuron Hk−1 in Fig. 4.

4.2 Algorithms and Tricks The main steps of our approach for SAEs and CNNs are given in Algorithm 1 and Algorithm 2, respectively. Both algorithms are applied in Sect. 5. In Algorithm 2, due to the interlayer treatment on CNNs, sensitivity analysis should be deployed at the construction of each convolution layer. Algorithm 1 Structure learning on SAEs. Require: The collection of influence threshold, ζ T and ζ max ; The number of stacked autoencoders with linear decoder in SAE, m; The number of neurons in each hidden layer, K = k1 , · · · , km ; Ensure: 1: Train the first layer as an autoencoder by minimizing the objective function; 2: Train the seconder layer as an autoencoder by taking the output of the first layer as input. Iterate this step until the desired number of layers is reached; 3: Set l = 1; 4: For the lth autoencoder S, conduct the global sensitivity analysis on the first-order index of influence among every hidden neuron; 5: If l > 1, set all first-order indices of influence Si j to zero, where j is the index of the deleted neuron in the (l − 1)th hidden layer; 6: Iterate over kl neurons in lth hidden layer. If SiT < ζiT and Simax < ζimax , the ith neuron will be removed. Weights that are related to this neuron are also deleted; If l < m, set l = l + 1, go to step 4, Else, go to step 7; 7: Fine-tune parameters of all layers with the BP method in a supervised way;

In Algorithms 1 and 2, the collection of influence threshold ζ T = {ζiT } and ζ max = {ζimax } are set in percentage form for total or max sensitivity values of all hidden neurons. This parameter setting trick is easier than a fixed value. According to the results of the sensitivity analysis, the useful level of a neuron is determined by ranking. In Sect. 5, we use the upper quartile, median, and lower quartile to obtain various networks and their performances. According to different performances, the percentage parameter can be adjusted to obtain a satisfactory network.

123

B. Li, C. Chen

Algorithm 2 Structure learning on CNNs. Require: The collection of influence threshold , ζ T and ζ max ; The number of convolution and pooling layers in CNN, m; The number of neurons in each hidden layer, K = k1 , · · · , km ; Ensure: 1: Set l = 1; 2: Generate the lth convolution layer by an autoencoder with a linear decoder for feature extraction; 3: For each hidden neuron of the autoencoder, perform global sensitivity analysis on the first-order index of influence; 4: Iterate over kl neurons in lth hidden layer. If SiT < ζiT and Simax < ζimax , the ith neuron will be removed. Weights that are related to this neuron are also deleted; 5: Set the hidden layer of the training autoencoder as the convolution layer and add pooling layers. If l < m, set l = l + 1, go to step 2; Else, go to step 6; 6: Fine-tune parameters of all layers with the BP method in a supervised way;

5 Experiments In this paper, three types of experiments are performed to demonstrate the generality and effectiveness of the proposed approach. Among these experiments, there are two image recognition cases and one traffic flow forecasting case, which correspond to pattern classification and data prediction problems, respectively. The image recognition cases use the MNIST [3] and STL−10 [22] public databases by SAEs and CNNs. The data for the traffic flow forecasting case come from Guangzhou, China, which is a real-world problem. Then, the evaluation criteria for both types of problems are briefly introduced. The accuracy of classification problems is calculated as follows: n f lagi Accuracy = i=1 (18) n where n is the size of the test database and f lagi reflects the classification result. If an image is correctly classified, f lagi = 1. Otherwise, f lagi = 0. The precision in prediction problems is measured by root-mean-square error (RMSE) and mean arithmetic error (MAE), which are defined as follows: 1/2 1 n 2 RMSE = Σ (y(i) − yˆ (i)) (19) n i=1 1 n M AE = Σi=1 y(i) − yˆ (i)1 (20) n where n is the number of test samples and y(i). yˆ (i) are the actual output and desired output of sample i, respectively. As most commonly used metrics in time series forecasting, RMSE and MAE have different merits. For example, RMSE is better to show big deviations, while MAE is easier for interpretation. As a result, both are provided to comprehensively reflect characteristics of different algorithms.

5.1 SAEs for MNIST The MNIST database for handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. All digit images are size-normalized and centred in a fixed size of 28 ∗ 28. In this paper, the SAE model is utilized for this recognition task. Raw images in the database are taken as the inputs of SAE. Ten category labels of digits are set as outputs.

123

First-Order Sensitivity Analysis for Hidden Neuron... Table 1 Performance comparison of SAEs and pruned SAEs for different scales in MNIST Original scalea

Pruning rule

Final scalea

Accuracy (%)

400–100

-b

400–100

97.97

355,110

Number of weights

Compression ratio -c

25, 75%

326–30

97.80

266,030

0.75

50, 75%

243–31

97.67

198,639

0.56

-b

800–100

98.02

709,110

-c

50, 75%

475–31

98.03

387,951

0.55

1200–100

-b

1200–100

98.16

1,063,110

50, 75%

696–31

98.12

568,287

1400–100

-b

1400–100

98.12

1,240,110

50, 75%

807–33

98.13

660,499

800–100

-c 0.53 -c 0.53

a The scales before pruning and after pruning are the original scale and final scale, respectively. All scales are

represented by the number of neurons in hidden layers

b “-” Indicates that no pruning rule is conducted on the original network c “-” Indicates that the final scale of the original network is the same as the original one without compression

The experimental results of our framework on SAE with dual-hidden layers in four pruned scales are listed in Table 1. The desired average activation of the hidden units is 0.1. The weight decay parameter is set to 0.003, and the weight of the sparsity penalty term is 3. In this paper, the structure of a neural network is represented as k I − (k1 − k2 − · · · kl · · · − km ) − k O , where k I and k O denote the number of neurons in the input layer and the output layer respectively. m is the number of hidden layers and kl denotes the number of neurons in lth hidden layer. Sometimes k I and k O can be omitted. Correspondingly, the pruning rule is defined in percentage form as (a1 %, a2 %, · · · , al %, · · · , am %), where al % determines the reduced proportion of neurons in the lth hidden layer. It is assigned to the value of ζiT and ζimax , which take the same value for simplicity in the implementation. For networks with dual hidden layers in this experiment, the pruning rule is represented as (a1, a2%). Meanwhile, the compression ratio is defined in Eq. 21 to indicate the pruning scale in a more straightforward manner: n−1 P P (m i + 1) ∗ m i+1 C R = i=1 (21) n−1 i=1 (m i + 1) ∗ m i+1 where n is the number of layers in the original neural network, including the input, hidden and output layers. m i and m iP are the numbers of neurons in layer i of the original neural network and its pruned neural network, respectively. The accuracy comparison between pruned networks and initial networks in different scales is shown in Fig. 5, where the pruning rule is fixed at (50, 75%) and the compression ratio is approximately 0.5. It is observed that in most cases, the pruned network can obtain slightly higher accuracy than the original one. However, if the original network is limited, which means having a relatively small number of neurons, the pruned network may perform slightly worse than the original one. For example, the initial network with a (400–100) scale has better performance than the pruned one with a (243–31) scale in Fig. 5. This result may be due to inevitable losses of some important features after structure simplification. The network is over-pruned to obtain the initial classification result. This phenomenon will be further discussed in Sect. 5.5. With the help of fine-tuning, the refined network typically has better generalization ability and obtains a slight improvement in accuracy.

123

B. Li, C. Chen

Fig. 5 Correlation between number of weights and accuracy under the (50, 75%) pruning rule

Fig. 6 The statistical analysis of accuracy for SAEs at multiple runs. All sparse networks are obtained through the (50, 75%) pruning rule

Meanwhile, we perform twenty independent runs for each scale of the original network used in Table 1. The sparse networks all used the (50%, 75%) pruning rule. The maximum, minimum and mean value of each group are shown in Fig. 6. It is observed that the randomness of parameters will not lead to great influence on the final performance.

5.2 CNNs for STL−10 This experiment takes its data from the STL−10 dataset in [22], which is an image recognition dataset. The data consist of ten classes of images. All images are 96*96 with colour. Then, 2000 training images and 3200 test images are randomly selected from the dataset. Following the approach in [19], we construct one CNN with one convolutional layer and one pooling layer. In this experiment, we take three scales of the convolutional layer for testing. As shown in Table 2, with the 50% pruning rule, the accuracy of the refined CNN is nearly equal to that of the original CNN. However, with the simplified network model, the CPU time for

123

First-Order Sensitivity Analysis for Hidden Neuron... Table 2 Performance comparison of convolutional layer for different scales in STL−10

Original scale

Pruning rule

300 400

CNN has one convolutional layer and one pooling layer a “-”Indicates that no pruning rule is conducted on the original network

600

Final scale

Accuracy (%)

-a

300

80.188

50%

237

80.188

-a

400

80.188

25%

316

80.531

50%

217

79.969

-a

600a

80.656

50%

323

80.344

Fig. 7 The feature acquired from the autoencoder with a linear decoder. a 300 features for the original convolutional layer; b 237 features for the refined convolutional layer

classification is clearly reduced. Moreover, the larger the pruned scale is, the more prominent this phenomenon becomes. Besides, it can be observed that in the test with network scale of 400, the accuracy of 25% pruning rule is higher than that of 50% pruning rule, even surpass that of the original network. This interesting result will be discussed in Sect. 5.5. Next, we show the parameter settings of the autoencoder with a linear decoder, which is used for feature extraction from images in this experiment. The desired average activation of hidden units is set to 0.035. The weight decay parameter and the weight of the sparsity penalty are set to 0.003 and 5, respectively. After the 50% pruning rule is used, the scale of the convolutional layer is reduced from 300 to 237. The original 300 features are also reduced to 237 features, as shown in Fig. 7.

5.3 Short-Term Traffic Flow Forecasting Because of the complexity of traffic systems, short-term traffic flow forecasting is a difficult problem and a hot topic in transportation research. To handle its tidal characteristic, a traffic flow forecasting system is indispensable to traffic management systems. In this paper, we use

123

B. Li, C. Chen

Link 2

Fig. 8 The intersection model in a road network for traffic flow forecasting

Link 3

Link 1

Link 4

Intersection

Upstream Link Downstream Link

Table 3 Performance comparison of different algorithms in traffic flow prediction Algorithm

Scale

Predictive accuracy RMSE

SAEs Bayesian approach

MAE

50–50

87.6318

70.2279

25–27a

80.7753

63.4404

-b

92.1224

79.0676

a Using the (50, 50%) pruning rule to prune the original network b “-”Indicates that there is no meaning for hidden neuron

the field data from the PtMS of Tianhe area in Guangzhou [23]. The size of the collected data is one week, and the record interval of vehicle flow rate is 5 minutes. The first six days are used as the training data. The last day is used as the test data. The model of one intersection in the road network is shown in Fig. 8. Let G(∗) be a function that reflects the relationship of traffic flows between upstream and downstream links. The traffic flow in the downstream link at time t is predicted by the traffic flow in upstream links at time (t − 1). The mathematic model is shown as follows: t f 4 (t) = G(t f 1 (t − 1), t f 2 (t − 1), t f 3 (t − 1))

(22)

where t f i (t) is the vehicle flow rate on link i at time t. Thus, the aim of the traffic flow forecasting is to find an optimal approximation model for the function G(∗). We first use SAEs to predict traffic flow on the downstream link, as shown in Table 3. The SAE model is initialized with dual hidden layers in 50 − 50 scale. Meanwhile, the (50, 50%) pruning rule is used to refine the network. In Table 3, the values of RMSE and MAE in the refined network are clearly lower than those in the original network. This result means that the refined network has a better performance. One comparison method is the Bayesian approach [24]. The number of Gaussian models in the mixture Gaussian model of the Bayesian approach is optimized using X-means clustering [25].

5.4 Comparison Experiment Since EFAST is the quickest global sensitivity-based method that achieves satisfactory performance in structural optimization, it is taken as a benchmark for comparison. We take the task of traffic flow forecasting as an example and implement RBD and EFAST on the same

123

First-Order Sensitivity Analysis for Hidden Neuron...

Fig. 9 Indices of RBD and EFAST on the autoencoder 3-(50)-3 of SAE. Computational results of RBD and EFAST are represented by ◦ and , respectively

SAE. The structure of the SAE is 3-(50-50)-1, which is the same as the neural network in Sect. 5.3. Thus, two autoencoders are constructed in the layer-wise learning. The first one has a scale of 3-(50)-3 and the second one is 50-(50)-50. Figure 9 shows the result of sensitivity analysis between 50 hidden neurons and 3 different output neurons on the first autoencoder. The x-axis denotes the label of each hidden neuron, and the y-axis denotes the score obtained by sensitivity analysis between hidden neurons and output neurons. It can be observed that the total sensitivity indices and first-order sensitivity indices of the sensitivity analysis are equal. Hence, outcomes of sensitivity analysis using RBD and EFAST are the same. There is no difference between the accuracy of these two methods. However, compared to EFAST, the advantage of RBD is the low computational load. The computational time on two autoencoders are respectively given in Table 4. All implementations are performed on a PC with 2.30 GHz Intel Core i5-6200U CPU and 8GB memory. From Table 4, RBD could greatly save the time for computation.

5.5 Discussion In this section, we briefly discuss the balance between precision and compactness of the structure in our algorithm. The aim of our approach is to remove superfluous neurons from neural networks. This operation indeed makes the original network more compact. However, the threshold between unimportant neurons and important neurons is a subjective decision.

123

B. Li, C. Chen Table 4 The average CPU times for sensitivity analysis by RBD and EFASTa on SAEs Method

Computational time on the first autoencoder(s)b

Computational time on the second autoencoder(s)c

RBD

0.509528

4.352818

EFAST

8.614174

70.311076

a The source code is got from http://malthus.micro.med.umich.edu/lab/usadata/ b First autoencoder takes the form of 3 − (50) − 3 c Second autoencoder takes the form of 50 − (50) − 50

Although ranking hidden neurons and parameters in percentage are used in the Sect. 4.2 to make the decision more reasonable and manoeuvrable, when the refined network is too small, some hidden neurons of feature details will be deleted, and the precision will be slightly decreased. This phenomenon occurs in the first two experiments. Hence, we still need to determine the pruning level through multiple trials.

6 Conclusion and Future Work In this paper, the linear decoder is represented as an additive model. We prove that the global sensitivity analysis for the first-order effect can be used for hidden neuron selection and feature detection with layer-wise training. To support this result, RBD is implemented on two classic multilayer neural network models, SAEs and CNNs. To demonstrate the generality and effectiveness of the proposed algorithm in structural learning, experiments are performed on two benchmark problems in image recognition and on one real-world problem. The advantages of our framework can be summarized as follows: 1. Inheriting the advantages of loose coupling approaches, it does not interact with the training stage. Therefore, it requires no modification of existing training algorithms. 2. Our framework saves more time than methods based on EFAST because of the good characteristics of RBD. 3. Our framework is suitable for all models whose training process can be replaced by layer-wise training. 4. It can be used for validating the rationality of networks. 5. Our approach is a potential tool to address various real world problems, such as image recognition and traffic flow forecasting. In the future, we will attempt to extend our structural adjustment technology in this paper to multilayer neural networks with self-organization. Meanwhile, more advanced technologies in sensitivity analysis can be applied in our framework for further improvements. Acknowledgements This work is supported by the National Natural Science Foundation of China under Grant 61503187 and the Fundamental Research Funds for the Central Universities KJQN201624.

References 1. Hinton EG, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

123

First-Order Sensitivity Analysis for Hidden Neuron... 2. Schölkopf B, Platt J, Hofmann T (2007) Greedy layer-wise training of deep networks. In: Proceedings of international conference and workshop on neural information processing systems, pp 153–160 3. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 11:2278–2324 4. Sutskever I (2013) Training recurrent neural networks. Ph.D. Dissertation, University of Toronto, Canada 5. Coates A, Huval B, Wang T, Wu D, Catanzaro B, Andrew N (2013) Deep learning with COTS HPC systems. In: Proceedings of the 30th international conference on machine learning, pp 1337–1345 6. Nitish S, Geoffrey H, Alex K, Ilya S, Ruslan S (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958 7. Wan L, Zeiler M, Zhang S, Cun YL, Fergus R (2013) Regularization of neural networks using dropconnect.’ In: Proceedings of the 30th international conference on machine learning, pp 1058–1066 8. Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural networks. arXiv:1506.02626 9. Engelbrecht A (2001) A new pruning heuristic based on variance analysis of sensitivity information. IEEE Trans Neural Netw 12(6):1386–1399 10. Mozer MC, Smolensky P (1989) Skeletonization: a technique for trimming the fat from a network via relevance assessment. Adv Neural Inf Process Syst 1:107–115 11. Karnin ED (1990) A simple procedure for pruning backpropagation trained neural networks. IEEE Trans Neural Netw 1:239–242 12. Ruck DW, Rogers SK, Kabrisky M (1990) Feature selection using a multilayer perceptron. Neural Netw Comput 2(2):40–48 13. Tarr G (1991) Multilayered feedforward networks for image segementation. Ph.D. Dissertation, Air Force Institute of Technology 14. Lauret P, Fock E, Mara T (2006) A node pruning algorithm based on a Fourier amplitude sensitivity test method. IEEE Trans Neural Netw 17(2):273–293 15. Fock E (2014) Global sensitivity analysis approach for input selection and system identification purposes: a new framework for feedforward neural networks. IEEE Trans Neural Netw Learn Syst 25(8):1484–1495 16. Han H, Qiao J (2010) A self-organizing fuzzy neural network based on a growing-and-pruning algorithm. IEEE Trans Fuzzy Syst 18(6):1129–1143 17. Chen C, Wang F-Y (2013) A self-organizing neuro-fuzzy network based on first order effect sensitivity analysis. Neurocomputing 118:21–32 18. Hastie T, Tibshirani R (1990) Generalized additive models. Taylor and Francis, Abingdon 19. Ng A, Ngiam J, Foo CY, Mai Y, Suen C, Coates A, Maas A, Hannun A, Huval B, Wang T, Tandon S (2016) Unsupervised feature learning and deep learning tutorial. http://ufldl.stanford.edu/tutorial/ 20. Saltelli A, Chan K, Scott E (2000) Sensitivity analysis. Wiley, New York 21. Saltelli A (2008) Global sensitivity analysis: the primer. Wiley, New York 22. Coates A, Lee H, Ng AY (2011) An analysis of single layer networks in unsupervised feature learning. Jf Mach Learn Res 15:215–223 23. Wang F-Y (2010) Parallel control and management for intelligent transportation systems: concepts, architectures, and applications. IEEE Trans Intell Transp Syst 11(3):630–638 24. Sun S, Zhang C, Yu G (2006) A bayesian network approach to traffic flow forecasting. IEEE Trans Intell Transp Syst 7(1):124–132 25. Pelleg D, Moore A (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning. San Francisco: Morgan Kaufmann, pp 727–734

123

First-Order Sensitivity Analysis for Hidden Neuron Selection in Layer-Wise Training of Networks

Recommend Documents