Algorithmica (1998) 22: 198–210
Algorithmica ©
1998 Springer-Verlag New York Inc.
A Class of Asymptotically Stable Algorithms for Learning-Rate Adaptation1 S. M. R¨uger2 Abstract. A stability criterion for learning is given. In the case of learning-rate adaptation of backpropagation, a class of asymptotically stable algorithms is presented and studied, including a convergence proof. Simulations demonstrate relevance and limitations. Key Words. Convex optimization, Neural learning, Learning-rate adaptation, Line search, Stability criterion.
1. Introduction. Many neural-learning algorithms aim at optimizing a certain function like training error, generalization error, likelihood of the data, model quality, etc. In general, it is a nontrivial problem to find a weight vector w∗ ∈ Rn that minimizes a specific cost function E: Rn → R. There is a huge amount of mathematical literature on convex optimization that has already been transformed into well-known algorithms (Press et al., 1992). Nevertheless, one of the most popular neural-learning approaches to minimize the function E is gradient descent. Given w0 ∈ Rn and a fixed η > 0, the iteration rule is simply (1) w t+1 = wt − η∇ E(wt ). The so-called learning rate or step size η parametrizes this method. η being “sufficiently small”, one hopes that t 7→ w t converges to an optimal weight vector w∗ . Backpropagation is an algorithm that efficiently implements gradient descent in the case where the cost function E is mainly given by a neural network of simple functions. One reason for the popularity of backpropagation is the ease of its implementation. The following theorem, however, summarizes the major disadvantages of backpropagation that have been observed before (e.g., Amari, 1967; Amari, 1993; Hertz et al., 1991). TRADE-OFF THEOREM FOR BACKPROPAGATION. Let E: Rn → R be the cost function of a neural network with a regular minimum at w ∗ ∈ Rn , i.e., E is expansible into a Taylor series about w ∗ with vanishing gradient ∇ E(w∗ ) and a positive definite Hessian matrix H (w ∗ ). Let λ denote the largest eigenvalue of H (w∗ ). Then, in general, backpropagation with a fixed learning rate η > 2/λ cannot converge to w∗ . PROOF. Let U be an orthogonal matrix that diagonalizes H (w∗ ), i.e., D := U T H (w∗ )U is diagonal. Using the coordinate transformation x = U T (w−w∗ ) and Taylor expansion, 1
This work was partly supported by the Fujitsu European Centre for Information Technology. Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen’s Gate, London SW7 2BZ, England.
[email protected], http://www-ala.doc.ic.ac.uk/˜smr3. 2
Received January 29, 1997; revised June 19, 1997. Communicated by P. Auer and W. Maass.
A Class of Asymptotically Stable Algorithms for Learning-Rate Adaptation
199
E(w) − E(w∗ ) can be approximated by F(x) := x T Dx/2. Since gradient descent does not refer to the coordinate system, the asymptotic behaviour of backpropagation for E near w ∗ is the same as for F near 0. In the latter case, backpropagation calculates the weight components xit = xi0 (1 − Dii η)t at epoch t. The diagonal elements Dii are the eigenvalues of H (w ∗ ); convergence for all geometric sequences t 7→ xit thus requires η < 2/λ.
The trade-off theorem states that, given η, a large class of minima cannot be found, namely, those whose largest eigenvalue of the corresponding Hessian matrix is larger than 2/η. Fewer minima might be overlooked by using a smaller η, but then the algorithm becomes intolerably slow. Instability is another disadvantage of backpropagation. Transforming gradient descent (1) into a differential equation, one arrives at ∂wt /∂t = −∇ E(wt ). Gradient descent with constant step size η can then be viewed as Euler’s method for solving the differential equation. One serious drawback of Euler’s method is that it is unstable: each finite step leaves the trajectory of a solution without trying to get back to it. Virtually any other differential-equation solver surpasses Euler’s method. However, in the context of function minimization, this notion of stability (“do not drift away too far from a trajectory”) would appear to be too strong. Indeed, differentialequation solvers put much effort into a good estimation of points that are as close as possible to the trajectory under consideration. What is really needed for minimization is asymptotic stability: ensuring that the cost function values decrease at the end of learning, i.e., at the end of the function minimization process. This weaker stability criterion allows for greedy steps in the initial phase of learning. Stability is also of relevance for a technique called early stopping. Here, the cost function is the training error given by the performance of the network on a training set. In order to prevent overfitting, training is stopped when the test error (given by the performance of the network on an independently chosen test set) increases. The goal is not to minimise the training error perfectly, but to achieve good generalization. Unstable learning makes it difficult—if not impossible—to determine when to stop learning.
2. Stable Dynamic Learning-Rate Adaptation. The introduction suggests making the learning rate η a variable ηt that adapts itself to the surface of the cost function during learning. This is what dynamic learning-rate adaptation is all about. There are several successful examples of dynamic learning-rate adaptation for backpropagation: Newton and quasi-Newton methods (Becker and le Cun, 1989) as an adaptive η-tensor; individual learning rates for the weights (Jacobs, 1988; Silva and Almeida, 1990); conjugate gradient as a one-dimensional η-estimation (Kramer and Sangiovanni-Vincentelli, 1989); or straightforward η-adaptation (Battiti, 1989; Salomon and van Hemmen, 1996). A particularly interesting example of dynamic learning-rate adaptation was proposed by Salomon and van Hemmen (1996): let ζ > 1; at every step t of the backpropagation algorithm test two values for η, a somewhat smaller one, ηt /ζ , and a somewhat larger one, ηt ζ ; use as ηt+1 the value with the better performance, i.e., the smaller cost function
200
S. M. R¨uger
Fig. 1. Unstable learning-rate adaptation.
value: (2)
ηt+1
ηt = ζ ηt ζ
µ
¶ ηt t if E w − ∇ E(w ) ≤ E(wt − ηt ζ ∇ E(wt )), ζ otherwise. t
The setting of the new parameter ζ proves to be uncritical (all values work, especially sensible ones being those between 1.2 and 2.1). This method outperforms many other gradient-based algorithms, but it is nonetheless unstable. The problem arises from a rapidly changing length and direction of the gradient, which can result in a huge leap away from a minimum, although the latter may have been almost reached. Figure 1(a) shows the niveau lines of a simple quadratic cost function E: R2 → R along with the weight vectors w0 , w1 , . . . (bold dots) resulting from (1) and (2). This effect was probably the reason why Salomon suggested using a normalized gradient descent w t+1 = wt − ηt ∇ E(wt )/|∇ E(wt )| with dynamic learningrate adaptation µ ¶ µ ¶ ηt ∇ E(wt ) ∇ E(wt ) ηt t − η ζ if E wt − ≤ E w , t (3) ηt+1 = ζ ζ |∇ E(wt )| |∇ E(wt )| otherwise, ηt ζ thus eliminating the changes in the length of the gradient. Although this works much better, Figure 1(b) shows the instability of this algorithm due to the change in the gradient’s direction. There is enough evidence that these algorithms eventually converge for a purely quadratic cost function (Salomon and van Hemmen, 1996). Why bother with stability? One would like to prove that an algorithm asymptotically finds the minimum, rather than occasionally leaping far away from it and thus leaving the region where the quadratic Hessian term of a globally nonquadratic cost function dominates.
3. A Class of Asymptotically Stable Algorithms. In this section a class of algorithms is defined by generalizing (2) and (3) and by adding stability. This class provides not only a proof of asymptotic convergence, but also a significant improvement in speed. CLASS OF ASYMPTOTICALLY STABLE ALGORITHMS. Let E: Rn → R be a cost function of a neural network with random weight vector w 0 ∈ Rn . Let ζ > 1, η0 > 0, 0 < c ≤ 1, and 0 < a ≤ 1 ≤ b. At step t of the algorithm, choose a vector g t restricted only by the
A Class of Asymptotically Stable Algorithms for Learning-Rate Adaptation
201
conditions g t ∇ E(wt )/|g t ||∇ E wt | ≥ c and that it either holds for all t that 1/|g t | ∈ [a, b] or that it holds for all t that |∇ E(w t )|/|g t | ∈ [a, b], i.e., the vectors g t have a minimal positive projection onto the gradient and either have a uniformly bounded length or are uniformly bounded by the length of the gradient. Note that this is always possible by choosing g t as the gradient or the normalized gradient. Let e: η 7→ E(wt − ηg t ) denote a one-dimensional cost function given by E, wt , and t g . Repeat (until the gradient vanishes or an upper limit of t or a lower limit E min of E is reached) the iteration w t+1 = wt − ηt+1 g t with ηt ζ /2 η∗ := if e(0) < e(ηt ζ ), e(η t ζ ) − e(0) 1+ ηt ζ g t ∇ E(wt ) µ ¶ (4) ηt+1 = η ηt t if e ≤ e(ηt ζ ) ≤ e(0), ζ ζ ηt ζ otherwise. The first case for ηt+1 is a stabilizing term η∗ , which definitely decreases the cost function value when the cost function is approximately quadratic, i.e., near a minimum. η∗ is put into effect when the cost function value e(ηt ζ ), which would occur in the next step if ηt+1 = ηt ζ was chosen, exceeds the cost function value e(0) produced by the present weight vector wt . By construction, η∗ results in a value less than ηt ζ /2 if e(ηt ζ ) > e(0); hence, given ζ < 2, the learning rate is decreased as expected, no matter what E looks like. Typically (if the values for ζ are not extremely high), the other two cases apply, where ηt ζ and ηt /ζ compete for a lower cost function value. Note that, instead of gradient descent, this class of algorithms proposes a “g t descent”, and the vectors g t may differ from the gradient. A particular algorithm is given by a specification of how to choose g t . For every algorithm asymptotic convergence is guaranteed by the following theorem. Pn λi wi2 /2 with λi > 0. THEOREM OF ASYMPTOTIC CONVERGENCE. Let E: w 7→ i=1 0 For all ζ > 1, 0 < c ≤ 1, 0 < a ≤ 1 ≤ b, η0 > 0, and w ∈ Rn , every algorithm from the above class of algorithms produces a sequence t 7→ w t that converges to the minimum 0 of E with an at least exponential decay of t 7→ E(wt ). q < 1 exists with E(wt+1 ) ≤ q E(wt ) for PROOF. This statement follows if a constant √ t all t. Then limt→∞ w = 0, since w 7→ E(w) is a norm in Rn . Fix a w t , ηt , and a g t according to the premise. Since E is a positive definite quadratic form, e: η 7→ E(wt − ηg t ) is a one-dimensional quadratic function with a minimum at, say, η∗ . Note that e(0) = E(wt ) and e(ηt+1 ) = E(wt+1 ). e is completely determined by e(0), e0 (0) = −g t ∇ E(wt ), ηt ζ , and e(ηt ζ ). Omitting the algebra, it follows that η∗ can be identified with the stabilizing term of (4). If e(ηt ζ ) > e(0), by (4) ηt+1 will be set to η∗ ; hence, wt+1 has the smallest possible cost function value e(η∗ ) along the line given by g t . Otherwise, the three values 0, ηt /ζ , and ηt ζ cannot have the same cost function value, as e is quadratic; e(ηt ζ ) or e(ηt /ζ ) must be less than e(0), and the argument with the better performance is used as ηt+1 . The sequence t 7→ E(w t ) is strictly decreasing; hence, a q ≤ 1 exists. The rest of the proof
202
S. M. R¨uger
Fig. 2. Steps in estimating a bound q for the improvement of E.
shows the existence of a q < 1; this part may be skipped by the less mathematically inclined reader. Assume there are two constants 0 < qe , qη < 1 with (5) (6)
ηt+1 ∈ [qη , 2 − qη ], η∗ e(η∗ ) ≤ qe e(0).
ηt+1 must be in (0, 2η∗ ), since t 7→ E(wt ) is strictly decreasing. Consider the case ηt+1 ≥ η∗ : using first the convexity of e, then (5) and (6), one obtains µ µ ¶ ¶ ηt+1 − η∗ ∗ ηt+1 − η∗ 2η + 1 − η∗ e(ηt+1 ) = e η∗ η∗ µ ¶ ηt+1 − η∗ ηt+1 − η∗ e(0) + 1 − e(η∗ ) ≤ η∗ η∗ ≤ (1 − qη )e(0) + qη e(η∗ ) ≤ (1 − qη (1 − qe ))e(0). | {z } q
Figure 2 shows how the estimations work. Similar arguments yield the same result E(w t+1 ) ≤ q E(wt ) with q := 1 − qη (1 − qe ) < 1 for the other case ηt+1 ≤ η∗ . It remains to specify the bounds qη and qe . Writing e as e(η) = E(w − ηg) = α(η − η∗ )2 + e(η∗ ) yields qe : P λi wi gi η∗ X ∗ ∗ (7) and e(η ) = e(0) − λi wi gi . η = Pi 2 2 i i λi gi Hence, (g∇ E(w))2 e(η∗ ) P = 1− P 2 2 e(0) i λi gi i λi wi
A Class of Asymptotically Stable Algorithms for Learning-Rate Adaptation
203
P 2 P 2 2 gi λi wi Pi ≤ 1 − c2 P i 2 2 λ g i i i i λi wi <
2λ
≤ 1−c > <1 | {z λ } qe <
>
with λ := mini {λi } and λ := maxi {λi }. The first inequality exploits the prerequisite that g∇ E(w)/|g||∇ E(w)| ≥ c. The second inequality follows from the observation that the involved quotients are convex combinations of λi and 1/λi , which in turn are < > minimized by λ and 1/λ , respectively. qη : The condition e(η/ζ ) ≤ e(ηζ ) is equivalent to (η/ζ − η∗ )2 ≤ (ηζ − η∗ )2 , which in turn is equivalent to η ≥ 2η∗ ζ /(ζ 2 + 1). Hence, in the case of a quadratic cost function, the asymptotically stable learning-rate adaptation (4) may be rewritten as 2η∗ , η∗ if ηt > ζ 2η∗ ζ (8) ηt+1 = ηt ∗ ≤ if 2η ≤ η , t ζ ζ2 + 1 ζ else. ηt ζ The stabilizing step size η∗ depends not only on wt but also on the chosen direction g t ; to emphasize this, the notation η∗ (wt , g t ) is used from now on. Formula (8) allows us to formulate the inductive membership ½ ¾ · 2 ηt η∗ (w t−1 , g t−1 ) min 2 , · ζ , ζ +½1 η∗ (wt−1 ,¾¸ g t−1 ) η∗ (wt , g t ) 2 2ζ 2 ηt+1 if t > 0, max 2 , 2 ∈ ∗ t t ζ ζ +1 η (w , g ) ¾ ½ ¾¸ · ½ 2ζ 2 η0 2 2 min , ∗ 0 0 ζ , max 2 , 2 if t = 0. 2 ζ + 1 η (w , g ) ζ ζ +1 Carefully tracing back this formula to the origin of induction yields, for all t ≥ 0, · ¾ ½ ¾¸ ½ ∗ t−s 2 2ζ 2 2 , g t−s ) t+1 η0 ηt+1 s η (w ∈ min ζ , ζ , max . , 0≤s≤t ζ 2 + 1 η∗ (w t , g t ) η∗ (w t , g t ) η∗ (w t , g t ) ζ2 ζ2 + 1 In order to find uniform bounds for η∗ , (7) is transformed into g∇ E(w) |g||∇ E(w)| P 2 |g||∇ E(w)| i λi gi g∇ E(w) 1 |∇ E(w)| P 2 P = . 2/ |g||∇ E(w)| λ (g g ) |g| {z } | i i i{z j j } |
η∗ (w, g) =
∈[c,1]
∈[1/λ> ,1/λ< ]
The underlying class of algorithms has two alternative restrictions for the length of the direction g: if g is always chosen such that |∇ E(w)|/|g| ∈ [a, b], then η∗ is uniformly
204
S. M. R¨uger >
<
bounded by [ca/λ , b/λ ]. In this case, qη may be set to ¾ ½ < < 2ζ ca λ η0 λ ζ 2 , > 0. , qη := min 2 ζ + 1 ζ 2 + 1 b λ> b If g is always chosen such that 1/|g| ∈ [a, b], then the relation |∇ E(w)|2 = 2
X
1
λi P2
i
λi wi 2
1 2 j 2 λj wj
<
>
E(w) ∈ [2λ E(w), 2λ E(w)]
√ √ > < can be exploited. It follows that η∗ (w, g) ∈ [ca 2λ< E(w)/λ , b 2λ> E(w)/λ ]. Since t t 7→ E(w ) is monotonically decreasing, qη may be set to ) ( µ < ¶3/2 < 2 η0 λ ζ 2ζ ca λ > 0. , p > , qη := min ζ 2 + 1 ζ 2 + 1 b λ> b 2λ E(w0 ) REMARK. Many other choices of η∗ in (4) are possible, e.g., η∗ := ηt /ζ k+1 with the smallest k ∈ {0, 1, 2, . . .} such that e(ηt /ζ k ) ≤ e(0). The resulting algorithms are stable (not only asymptotically stable), and a proof of asymptotic convergence can still be given. However, additional cost function evaluations would be necessary for the calculation of η∗ .
4. Simulations and Comparisons. In this section we present simulations of typical neural-network problems in order to assess the relevance of algorithms given in the previous section. As stated before, a particular algorithm is given by a specification of how to choose the vector g t . Several choices for the direction g t come immediately into mind, e.g., g t = ∇ E(wt ) or ∇ E(wt ) (9) . gt = |∇ E(wt )| However, it is well known that the sign-changed gradient of a function is not necessarily the best direction to look for a minimum. The momentum term of a modified backpropagation version uses old gradient directions; Newton or quasi-Newton methods explicitly or implicitly exploit second-order derivatives for a change of direction; another choice of direction is given by conjugate gradient methods (Press et al., 1992). The algorithms from Section 3 allow almost any direction, as long as it is not nearly perpendicular to the gradient. Since they estimate a good step size, these algorithms can be regarded as a sort of “trial-and-error” line search without bothering to find an exact minimum in the given direction, but utilizing any progress made so far. One could exploit the Polak–Ribi`ere rule for conjugate vectors, d t+1 = ∇ E(wt+1 ) +
(∇ E(wt+1 ) − ∇ E(wt ))∇ E(wt+1 ) t d (∇ E(wt ))2
with d 0 = ∇ E(w0 ), to propose (10)
g t :=
dt |d t |
A Class of Asymptotically Stable Algorithms for Learning-Rate Adaptation
205
for an explicit algorithm from Section 3. As in the conjugate gradient method, the direction d t should be reset after each n (the number of weights) updates to the gradient direction. Another reason for resetting the direction arises when g t does not have the minimal positive projection c, say 0.01, onto the normalized gradient. Comparisons of Salomon’s algorithm to many other methods have already been published (Moreira and Fiesler, 1995; Salomon and van Hemmen, 1996; Pfister and Rojas, 1996). This paper not only demonstrates that significant improvements are brought about by Polak–Ribi`ere directions (10), but also gives comparisons to pure backpropagation and conjugate gradient with the Polak–Ribi`ere rule. The following problems were studied: (a) Regression: a 3-2-4 network (three input nodes, two hidden nodes, and four output nodes) with a particular weight vector generates pattern-target samples ( pi , ti ), where a small amount of noise was added to the outputs. A network of the same structure with random initial weights shall approximate the data by minimizing X (t i − outw ( pi ))2 , w 7→ E(w) = 12 i
where outw is the network function. (b) Approximation: as in (a), but without adding noise. (c), (d), (e): These problems are given by the artificial cost function (11)
w 7→ E(w) =
n=76 X
i|wi | p
i=1
with different powers p. p = 2, as in (c), gives a purely quadratic cost function that is interesting for the asymptotic behaviour. Other powers like in problems (d) and (e) should pose no problem as long as p exceeds, say, 1.5: all algorithms discussed here exploit that the gradient vanishes at the minimum, which in turn is numerically questionable for powers like 1.1. Powers smaller than 1 produce a nonconvex cost function; they are commented on later. A typical minimum, however, is approximately given by powers like 2, 4, . . .; hence, problem (e) was chosen to be the case p = 4. (f) Encoder: an 8-3-8 network is supposed to learn the identity restricted to the unit vectors of R8 ; the hidden layer with three nodes typically evolves a binary coding of the eight possible input vectors. Since the activation function of the output nodes is usually given by the Fermi function x 7→ (1 + exp(−x))−1 , the desired output values 0 and 1 can only be realized asymptotically, i.e., if at least one absolute value of the weight vector components is infinitely large. Hence, the minima of the cost function lie at the boundary of its domain Rn . These minima, though not covered in the proof of Section 3, are quickly found (see Table 1). Indeed, the ability to increase the learning rate geometrically helps these algorithms approach the boundary in a few steps. Table 1 shows the average number of epochs of Salomon’s algorithm (3), the corresponding stabilized version (4) + (9), and another one (4) + (10) of the class in Section 3 that uses Polak–Ribi`ere directions. The average was taken over 2700 runs with nine different values of ζ ∈ [1.7, 2.1] and random initial weight vectors, where each√compo√ nent was independent and identically uniformly distributed in the interval [− 3, 3] (standard deviation 1). The root mean square error of the averaging process is shown as a percentage.
206
S. M. R¨uger Table 1. Average number of epochs for some problems. Problem (a) 3-2-4 regression (b) 3-2-4 approximation (c) Pure square (n = 76) (d) Power 1.8 (n = 76) (e) Power 4 (n = 76) (f) 8-3-8 encoder Load per epoch
E min
(3)
(4) + (9)
(4) + (10)
100 10−4 10−16 10−4 10−16 10−3
168 ± 100% 970 ± 125% 454 ± 14% 411 ± 27% 26 ± 11% 830 ± 55% 1.5
164 ± 95% 940 ± 120% 473 ± 19% 494 ± 27% 29 ± 10% 815 ± 55% 1.5
49 ± 60% 138 ± 70% 117 ± 10% 84 ± 21% 34 ± 12% 198 ± 95% 1.5
The stabilization (4) + (9) of Salomon’s algorithm (3) does not seem to improve the speed of convergence. Problems (c)–(e), which have an artificial cost function, might even indicate that the stabilized version is slower: this is explained by the fact that stabilizing decreases the step size in a curve of the solution’s trajectory. Although (3) is thrown off the track—as is demonstrated by Figure 1—the regularity of the cost function brings the weight vectors wi back to the minimum. More realistic optimization problems like (a), (b), or (f) do not show this structure; hence, the stabilization (4) + (9) helps slightly. Significant improvements are brought about by the algorithm using Polak–Ribi`ere directions (4) + (10). This can be expected from the fact that second-order information is incorporated. Only in the case of the cost function (e), where the Hessian matrix of the minimum vanishes, do Polak–Ribi`ere directions not seem to be useful: all learning-rate adaptation methods find the minimum very quickly in this case. Table 2 shows the average number of epochs of pure backpropagation (1) with a fixed learning rate. As indicated by the trade-off theorem in Section 1, the convergence of backpropagation depends on a proper choice of the learning rate. An optimal learning rate ηopt for each problem was determined in several separate runs and used for the experiments. Problem (e) has a minimum with vanishing Hessian matrix—it cannot be solved efficiently with a fixed learning rate. All learning-rate adaptation algorithms from Table 1 need less epochs than the corresponding arduously optimal-tuned backpropagation. This is especially obvious, where the weight vectors wi have to go a long way as is the case for problem (f): the encoder Table 2. Average number of epochs of two traditional methods. Problem (a) 3-2-4 regression (b) 3-2-4 approximation (c) Pure square (n = 76) (d) Power 1.8 (n = 76) (e) Power 4 (n = 76) (f) 8-3-8 encoder Load per epoch ∗ 11%
ηopt
(1) with ηopt
0.2 1.0 2/77 0.005 0 2.0
590 ± 75% 1,970 ± 105% 755 ± 4% 635 ± 24% ∞ 21,350 ± 10% 1
E min 100
10−4 10−16 10−4 10−16 10−3
of the experiments did not converge.
PR 29 ± 55% 60 ± 95% 64 ± 3% 45 ± 9% 18 ± 5% 140 ± 95%∗ 4–8
A Class of Asymptotically Stable Algorithms for Learning-Rate Adaptation
207
features minima at the boundary of Rn , and during the optimization different learning rates would be appropriate. Many similar problems are studied in Salomon and van Hemmen (1996). Table 2 also shows the results for conjugate gradient with Polak–Ribi`ere directions (PR). The number of epochs for the pure square problem (c) is in nice accordance with the theoretical result that conjugate gradient methods need at most n epochs to reach the minimum of an n-dimensional quadratic form exactly. Based on the number of epochs, conjugate gradient seems to outperform all learning-rate adaptations studied so far. One epoch of conjugate gradient, however, involves a minimization procedure along a ray (line search). Quite a few cost function evaluations are involved here, whereas backpropagation uses only one forward pass (function evaluation) and an equally expensive backward pass (gradient computation) per epoch. Hence, a “load per epoch” factor with respect to backpropagation of around 4–8 must be taken into consideration. Note that, owing to the two test steps, each epoch of a learning-rate adaptation has an overhead of around 50% compared with a corresponding epoch of backpropagation. The effective computer time is thus still much less for a learning-rate adaptation with Polak–Ribi`ere directions (4) + (10). What is more, the nonconvergence of conjugate gradient in a significant number of cases (see Table 2) is not observed with (4) + (10) presumably owing to its intrinsic asymptotic stability.
5. Relevance for Neural Learning. The proposed algorithms of Section 3 fall into the class of descent algorithms that use a one-dimensional line search as an important element. Indeed, (2), (3), and (4) can be regarded as a primitive form of a line search. The major task of a line search in connection with descent algorithms lies not so much in finding a minimum along a ray, but all the more in achieving a “sufficient decrease” in the cost function. This has also been one of the main issues of the convergence proof in Section 3. Usually, the computation of a real function value E(w) is much less expensive than the computation of a gradient vector (∂ E(w)/∂w1 , ∂ E(w)/∂w2 , . . . , ∂ E(w)/∂wn )T . In the context of neural learning, however, the computation of both E(w) and ∇ E(w) costs just twice the time as a single E(w) computation. Thus, the efficiency of a neural-learning descent algorithm depends mainly on the number of function evaluations in line search. Quite a few line search procedures have been proposed (e.g., Fletcher, 1980; Gill et al., 1981; Press et al., 1992), but none of them combines the virtues of asymptotic convergence and stability with a competition between two wild guesses like ηt ζ and ηt /ζ in (4). It is hardly imaginable that a line search employs less cost function evaluations than (4). Table 3 demonstrates the overall efficiency of the “trial-and-error” line search (4) by listing the average number of cost function evaluations (including gradient computations) needed to solve typical problems; this number is directly related to computing time. Two line search algorithms (Press et al., 1992; linmin and dlinmin) that made their way into modern software libraries have been chosen for comparison. Yet another routine (Press et al., 1992; lnsrch) was tested, but seemed to fail completely for problems (e) and (f).
208
S. M. R¨uger Table 3. Number of cost function evaluations (including gradient computations). Problem (a) 3-2-4 regression (b) 3-2-4 approximation (c) Pure square (n = 76) (d) Power 1.8 (n = 76) (e) Power 4 (n = 76) (f) 8-3-8 encoder ∗ 12% † 13%
E min 100 10−4 10−16 10−4 10−16 10−3
linmin + (10)
dlinmin + (10)
(4) + (10)
366 ± 65% 760 ± 100% 510 ± 1% 640 ± 9% 381 ± 7% 1870 ± 95%∗
448 ± 70% 840 ± 85% 560 ± 1% 840 ± 9% 491 ± 6% 2220 ± 90%†
147 ± 60% 414 ± 70% 351 ± 10% 252 ± 21% 102 ± 12% 594 ± 95%
of the experiments did not converge. of the experiments did not converge.
6. Limitations. All algorithms studied within the scope of this article are made for convex optimization. This also applies for the learning-rate adaptation algorithms. A critical example would be the cost function (11) with p = 12 and n = 2: p p (12) w 7→ E(w) = |w1 | + 2 |w2 |. The gradient of this nonconvex function is singular if one component of the argument vanishes (and especially at the minimum 0, for an illustration see Figure 3). Two problems arise here: during learning, the length of the gradient increases without bound and the angle between the gradient and the desired direction to the minimum becomes as big as π/2. The first problem can be easily compensated by learning-rate adaptation. Figure 4 √ demonstrates this with the one-dimensional cost function w 7→ |w|. The second problem seems to be fatal: all gradient-based methods hope that the direction of the gradient will be at least a helpful hint for the direction of a minimum. Learning-rate adaptation in the square root singularity (12) typically traps a weight vector near a coordinate axis. Here, the gradient is almost perpendicular to the axis (i.e., the direction
Fig. 3. The square root singularity (12).
A Class of Asymptotically Stable Algorithms for Learning-Rate Adaptation
Fig. 4. Learning-rate adaptation (4) + (9) for w 7→
√
209
|w| with η0 = 0.1 and ζ = 1.3.
to the minimum 0) and the length of the gradient will increase without bounds as the weight vectors come nearer to the coordinate axes. Figure 5 shows the situation. An asymptotically stable algorithm of Section 3 cannot escape this situation either, since the descent directions g t must not be perpendicular to the gradient. The algorithms in this article are made for batch learning, i.e., the minimization of a specific cost function as opposed to on-line learning (Heskes and Kappen, 1991; Amari, 1993; Barkai et al., 1995), where successive steps aim at minimizing different cost functions (each given by a single pattern). Although the methods proposed in Section 3 cannot be expected to converge always in the case of on-line learning, experiments have shown that they work quite well (especially if an intermediate technique is used, where a few samples with a certain batch size contribute to each cost function).
7. Conclusions. Two major aspects are relevant for learning rules in neural networks: stability and speed. The proposed class of asymptotically stable learning-rate adaptation algorithms was developed with respect to these two criteria. Algorithms from this class have several advantages: there is a proof of (at least) first-order convergence; they use only simple-to-calculate terms like the gradient of the cost function or the cost function itself; there is no need to adjust the learning rate; second-order information can be incorporated (as has successfully been demonstrated with Polak–Ribi`ere directions). Moreover, the idea of dynamic learning-rate self-adaptation could be transfered to other parameters such as the momentum parameter. The property of asymptotic stability seems to be a balanced compromise between a strong demand for stability, which would cost much computing time without visible gain, and potential instability, which possibly could destroy convergence.
Fig. 5. Learning-rate adaptation (3) in the square root singularity (12).
210
S. M. R¨uger
References Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Transactions on Electrical Control 16(3), 299–307. Amari, S. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing 5(4-5), 185–196. Barkai, N., H. S. Seung, and H. Sompolinski (1995). Local and global convergence of online learning. Physical Review Letters 75, 1415–1418. Battiti, R. (1989). Accelerated backpropagation learning: two optimization methods. Complex Systems 3, 331–342. Becker, S., and Y. le Cun (1989). Improving the convergence of back-propagation learning with second order methods. In D. Touretzky, G. Hinton, and T. Sejnowski (eds.), Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Morgan Kaufmann, San Mateo, CA. Fletcher, R. (1980). Practical Methods of Optimization, Volume 1. Wiley, Chicester. Gill, P. E., W. Murray, and W. Margaret H (1981). Practical Optimization. Academic Press, London. Hertz, J., A. Krogh, and R. G. Palmer (1991). Introduction to the Theory of Neural Computation. AddisonWesley, Redwood City, CA. Heskes, T. M., and H. J. Kappen (1993). On-line learning processes in artificial neural networks. In J. Taylor (Ed.), Mathematical Foundations of Neural Networks, pp. 199–233. Elsevier, Amsterdam. Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural Networks 1, 295–307. Kramer, A. H., and A. Sangiovanni-Vincentelli (1989). Efficient parallel learning algorithms for neural networks. In D. S. Touretzky (ed.), Advances in Neural Information Processing Systems 1, pp. 40–48. Morgan Kaufmann, San Mateo, CA. Moreira, M., and E. Fiesler (1995). Neural networks with adaptive learning rate and momentum terms. Technical Report 95-04, Institute Dalle Molle d’Intelligence Artificielle Perceptive, Martigny. Pfister, M., and R. Rojas (1996). Hybrid learning algorithms for neural networks—the adaptive inclusion of second order information. Zeitschrift f¨ur Angewandte Mathematik und Mechanik 76(1), 215–218. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery (1992). Numerical Recipes in C (2nd edn.). Cambridge University Press, Cambridge. Salomon, R., and J. L. van Hemmen (1996). Accelerating backpropagation through dynamic self-adaptation. Neural Networks 9(4), 589–601. Silva, F. M., and L. B. Almeida (1990). Speeding up backpropagation. In Proceedings of NSMS—International Symposium on Neural Networks for Sensory and Motor Systems. Elsevier, Amsterdam.