Optimization and Engineering, 2, 75–129, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
A Primer on Differentiation MARK S. GOCKENBACH Department of Mathematical Sciences, Michigan Technological University, 1400 Townsend Drive, Houghton, MI 49931-1295, USA Received February 4, 2000; Revised April 4, 2001
Abstract. The central idea of differential calculus is that the derivative of a function defines the best local linear approximation to the function near a given point. This basic idea, together with some representation theorems from linear algebra, unifies the various derivatives—gradients, Jacobians, Hessians, and so forth—encountered in engineering and optimization. The basic differentiation rules presented in calculus classes, notably the product and chain rules, allow the computation of the gradients and Hessians needed by optimization algorithms, even when the underlying operators are quite complex. Examples include the solution operators of time-dependent and steady-state partial differential equations. Alternatives to the hand-coding of derivatives are finite differences and automatic differentiation, both of which save programming time at the possible cost of run-time efficiency. Keywords: differentiation, solution operators, finite differences, automatic differentiation
1.
Introduction
Throughout their study of calculus, students are introduced to derivatives of various types. These include: • The (ordinary) derivative f (x) of a real-valued function f of a single variable. The number f (x0 ) is the slope of the line tangent to the graph of y = f (x) at x = x0 . It is also interpreted as the instantaneous rate of change of y = f (x) at x = x0 . • The partial derivatives ∂g ∂g ∂g (x1 , x2 , . . . , xn ), (x1 , x2 , . . . , xn ), . . . , (x1 , x2 , . . . , xn ) ∂ x1 ∂ x2 ∂ xn of a real-valued function of several variables. These numbers are interpreted as the instantaneous rates of change of y = g(x1 , x2 , . . . , xn ) as one variable is changed and the others held fixed. • The gradient vector ∂g ∂ x1 (x1 , x2 , . . . , xn ) ∂g (x , x , . . . , x ) 1 2 n . ∇g(x1 , x2 , . . . , xn ) = ∂ x2 .. . ∂g (x1 , x2 , . . . , xn ) ∂ xn
76
GOCKENBACH
This vector has various geometric interpretations; for example, if g(x1 , x2 ) is a function of two variables, then ∇g(x1 , x2 ) is orthogonal to the contour line of g through (x1 , x2 ). • The directional derivative Du g(x1 , x2 , . . . , xn ). This number is the instantaneous rate of change of y = g(x1 , x2 , . . . , xn ) at (x1 , x2 , . . . , xn ) in the direction of the unit vector u, and can be computed using the gradient: Du g(x1 , x2 , . . . , xn ) = ∇g(x1 , x2 , . . . , xn ) · u. • The partial derivatives of a vector-valued function of several variables. If F(x1 , x2 , . . . , xn ) =
F1 (x1 , x2 , . . . , xn ) F2 (x1 , x2 , . . . , xn ) , .. . Fm (x1 , x2 , . . . , xn )
then the partial derivatives form the entries of the Jacobian matrix J (x1 , x2 , . . ., xn ) ∂F 1 (x , x , . . . , xn ) ∂ x1 1 2 ∂F2 (x1 , x2 , . . . , xn ) = ∂ x1 .. . ∂F m (x1 , x2 , . . . , xn ) ∂ x1
∂F1 (x1 , x2 , . . . , xn ) ∂ x2 ∂F2 (x1 , x2 , . . . , xn ) ∂ x2 .. . ∂Fm (x1 , x2 , . . . , xn ) ∂ x2
··· ··· ..
.
···
∂F1 (x1 , x2 , . . . , xn ) ∂ xn ∂F2 (x1 , x2 , . . . , xn ) ∂ xn , .. . ∂Fm (x1 , x2 , . . . , xn ) ∂ xn
which is an m × n matrix. • The tangent vector r˙ (t) to a curve r = r (t). The vector-valued function
r1 (t)
r (t) 2 r (t) = .. . rn (t) traces out a curve in n-space, and at the point r (t) on the curve, r˙ (t) is the tangent vector. Other interpretations are attached to the above quantities, and students usually learn to compute these derivatives for functions defined as combinations of polynomials, trigonometric functions, and other elementary functions. However, in my experience, students often miss the key underlying principle—that of a local linear approximation—that unifies these various kinds of derivatives. It is possible to get an undergraduate mathematics degree (at
A PRIMER ON DIFFERENTIATION
77
least in the United States) without encountering a course that makes this principle explicit.1 Moreover, the elementary rules of differentiation as learned in calculus courses—the product rule, chain rule, and so forth—can leave a student ill-prepared to compute derivatives of the complicated functions and operators that arise in advanced engineering and applied mathematics research. The purpose of this paper is to explain the concept of derivative from the point of view of local linear approximation, to show how the various types of derivatives mentioned above fit into the concept, and to work through several important and nontrivial examples. In the following section, I discuss the basic definitions and notation needed. The setting for these definitions is a normed vector space—a vector space with a norm. For this reason, linear algebra is important. In Section 3, I present the elementary representation theorems of linear algebra, and show how they lead to the various scalars, vectors, and matrices that arise in calculus courses in the context of differentiation. This is followed by a brief discussion of the rules of differentiation (Section 4), simple representations for operators on infinite dimensional spaces (Section 5), and second derivatives (Section 6). In addition to several examples included in the sections described above, I discuss two more involved examples: the adjoint state method for handling finite difference solution operators (Section 7), and a direct computation of the derivative (and its adjoint) of a finite element solution operator (Section 8). Finally, in Section 9, I discuss two alternatives to programming derivatives by hand: finite differences and automatic differentiation. Throughout this paper, the emphasis is on the structure of maps and their derivatives, not on the analytic details. Therefore, most technical proofs are omitted. 2.
Definitions and notation
2.1.
Normed vector spaces; inner products
The various derivatives described in the introduction can all be discussed in the context of a function (operator, map) f mapping one Euclidean space into another. I will write Rn for Euclidean n-space, and denote a vector x ∈ Rn as x = (x1 , x2 , . . . , xn ) or x1 x 2 x = .. . . xn Note that R1 is (isomorphic to) R, the set of real numbers. The following examples were discussed in the introduction: • • • •
f f f f
: : : :
R → R, a real-valued function of a single variable, Rn → R, a real-valued function of several variables, Rn → Rm , a vector-valued function of several variables, R → Rn , a vector-valued function of a single variable.
78
GOCKENBACH
From now on I adopt vector notation and write, for example, g(x) instead of g(x1 , x2 , . . . , xn ). Also, I distinguish vectors from scalars only by context. Now, Euclidean n-space is equipped with an inner product, namely, the dot product: (x, y) = x · y =
n
xi yi ,
x, y ∈ Rn .
i=1
A more general setting is an inner product space (which need not be Euclidean or finitedimensional). An inner product space is just a vector space V with an inner product (·, ·)V , which is a mapping from V × V into R satisfying the following properties: • (αu + βv, w)V = α(u, w)V + β(v, w)V for all u, v, w ∈ V, α, β ∈ R; • (u, v)V = (v, u)V for all u, v ∈ V ; • (v, v)V ≥ 0 for all v ∈ V , and (v, v) = 0 if and only if v = 0. An inner product on V induces a norm · V on V : vV =
(v, v)V
for all v ∈ V.
It is sometimes necessary to work with norms that are not defined by inner products. A general norm · U on a vector space U is a mapping from U into R satisfying • uU ≥ 0 for all u ∈ U , and uU = 0 if and only if u = 0; • αuU = |α|uU for all u ∈ U , α ∈ R; • u + vU ≤ uU + vU for all u, v ∈ U (the triangle inequality). √ It can be shown that if (·, ·)V is an inner product on a vector space V , then vV = (v, v) define a norm on V . The reason I discuss vector spaces more general than Euclidean space is that many practical problems cannot be described using finite-dimensional spaces. For example, suppose is an open subset of Rn and f : → Rn . Then under appropriate conditions on f , for any closed and bounded subset W of , there exists α > 0 such that, for each x0 ∈ W , the Initial Value Problem (IVP) x˙ = f (x) x(0) = x0
(1)
has a unique solution x : [−α, α] → Rn . Thus the IVP (1) defines an operator S : W → (C[−α, α])n , where (C[−α, α])n is the space of all continuous functions u : [−α, α] → Rn . This space is infinite-dimensional and therefore cannot be identified with any Euclidean space. I pursue this example Section 5.5, where I compute the derivative of S.
A PRIMER ON DIFFERENTIATION
2.2.
79
Definition of the derivative
Now suppose X and Y are normed linear spaces, suppose U is an open subset of X , and assume that f : U → Y . As I explained in Section 2.1, the types of functions encountered in calculus all fit under this description, as do many other important examples. First recall the following definition. Definition 2.1. linear if
Suppose X and Y are vector spaces, and L : X → Y . The operator L is
L(x + z) = L x + Lz ∀x,z ∈ X and L(αx) = αL x ∀x ∈ X ,α ∈ R (or, more concisely, L(αx + βz) = αL x + β Lz for all x, z ∈ X and α, β ∈ R). Next is the fundamental definition in this paper. Definition 2.2. that lim
δx→0
Let x ∈ U . Suppose there is a continuous linear operator L : X → Y such
f (x + δx) − f (x) − LδxY = 0. δx X
Then f is said to be differentiable at x, and L is called the derivative of f at x, denoted L = D f (x). According to this definition, if f is differentiable at x, then D f (x) defines a linear approximation to f near x; indeed, if E(x, δx) = f (x + δx) − f (x) − D f (x)δx, then f (x + δx) = f (x) + D f (x)δx + E(x, δx) and E(x, δx)Y →0 δx X
as δx → 0.
This last condition is abbreviated by E(x, δx) = o(δx X )
as δx → 0
80
GOCKENBACH
(read “E(x, δx) is little-oh of δx X ”), which indicates that the error E(x, δx) is small compared to δx X when δx X is small. It is easy to show that, if f is differentiable at x, then no other linear map K : X → Y defines a better local linear approximation to f near x; that is, if K = D f (x), then the error in the approximation f (x + δx) = ˙ f (x) + K δx is large than the error in the approximation f (x + δx) = ˙ f (x) + D f (x)δx in the sense that lim
δx→0
f (x + δx) − f (x) − K δxY = 0. δx X
Now, in addition to the basic definition of derivative just given, there is really just one key idea in this paper: the linear map D f (x) has different representations, depending on the particular X and Y involved. There is an underlying question here, which properly belongs to linear algebra (or functional analysis when the spaces are infinite-dimensional): Given two normed vector spaces X and Y , find a convenient representation for a continuous linear map L : X → Y . I will address this question in Sections 3 and 5 below; here I preview those sections by answering the question for X = Y = R. Now suppose that L : R → R is linear. If a = L(1), which is a real number, then, since x = x · 1 for all x ∈ R, L x = x L(1) = ax. Thus, if L : R → R is linear, there is a real number a ∈ R such that L x = ax
for all x ∈ R.
That is, a linear map from R to R is represented by a real number. Therefore, if f : R → R has a derivative at x, it is customary in elementary calculus courses to define f (x) to be the number f (x) = lim
δx→0
f (x + δx) − f (x) , δx
which is equivalent to lim
δx→0
f (x + δx) − f (x) − f (x)δx = 0. δx
It now becomes clear that, under this definition, the number f (x) is just the representer of the linear map D f (x).
A PRIMER ON DIFFERENTIATION
81
It may seem overly pedantic to distinguish between the linear map and its representer. However, when the vector spaces X and Y are not both one-dimensional, I believe it is essential to make the distinction. Before I go on to more examples in Section 3, where this should become clearer, I need to define continuity of the derivative, and also the concept of partial derivatives. 2.3.
Continuity of the derivative
I assume again that f : U → Y , where X and Y are norm vector spaces and U ⊂ X is open. If f is differentiable at each x ∈ U , then D f become an operator; for each x ∈ U , D f (x) belongs to L(X, Y ), the space of all continuous linear maps from X into Y : D f : U → L(X, Y ). When f is differentiable at every x ∈ U , f is simply said to be differentiable. Now, L(X, Y ) is a vector space, since there is a natural way to add operators and multiply them by scalars, and continuity and linearity are obviously preserved by these operations. Also, L(X, Y ) has a natural norm:
L xY LL(X,Y ) = sup : x ∈ X, x = 0 . x X Note that this definition of norm implies L xY ≤ LL(X,Y ) x X ∀x ∈ X. The norm of an operator thus measures the largest factor by which the operator “stretches” or “magnifies” any vector in its domain. It can be shown that a linear operator L : X → Y is continuous if and only if LL(X,Y ) < ∞ (if this latter condition holds, then L is said to be bounded ). Since L(X, Y ) is a normed vector space, the concept of continuity is defined for D f . The operator f : U → Y is called continuously differentiable (or just C 1 for short) if D f : U → L(X, Y ) exists and is continuous. The concept of continuous differentiability is important for relating a function’s derivative to its partial derivatives. 2.4.
Partial derivatives
Suppose X, Y , and Z are all normed linear spaces and f : X × Y → Z , where X × Y is the product space X × Y = {(x, y) : x ∈ X, y ∈ Y }.
82
GOCKENBACH
(Of course, f might be defined on a subset of X × Y , but the exposition is simpler if I assume that the domain of f is all of X × Y .) Since X × Y is a vector space, an operator like f is just another example that fits into the discussion above: f is differentiable at (x, y) if there is a continuous linear operator L : X × Y → Z such that lim
(δx,δy)→(0,0)
f (x + δx, y + δy) − f (x, y)L(δx, δy) Z = 0. (δx, δy) X ×Y
(For the norm on X × Y , the obvious choices are: (x, y) = x2X + y2Y , (x, y) = max{x X , yY }, and (x, y) = x X + yY . I will only use the property that (x, 0) X ×Y = x X and (0, y) X ×Y = yY , which holds for any of the above.) On the other hand, given any y ∈ Y , g(x) = f (x, y) for all x ∈ X defines an operator g : X → Z . Similarly, for any x ∈ X , h(y) = f (x, y)
for all y ∈ Y
defines an operator h : Y → Z . The question now arises: What is the relationship between D f , Dg, and Dh? The answer to this question is very simple when the structure of operators in L(X × Y, Z ) is understood. Theorem 2.3. Let X, Y, and Z be normed linear spaces. Then L ∈ L(X × Y, Z ) if and only if there exist L 1 ∈ L(X, Z ) and L 2 ∈ L(Y, Z ) such that L(x, y) = L 1 x + L 2 y Proof:
for all x ∈ X, y ∈ Y.
Suppose L ∈ L(X × Y, Z ). Define L 1 ∈ L(X, Z ) by
L 1 x = L(x, 0)
for all x ∈ X,
and L 2 ∈ L(Y, Z ) by L 2 y = L(0, y)
for all y ∈ Y.
It is easy to prove that L 1 and L 2 are indeed linear and bounded. Moreover, for any (x, y) ∈ X × Y, L(x, y) = L((x, 0) + (0, y)) = L(x, 0) + L(0, y) = L 1 x + L 2 y, as desired.
83
A PRIMER ON DIFFERENTIATION
On the other hand, it is easy to verify that if L 1 ∈ L(X, Z ), L 2 ∈ L(Y, Z ), and L : X × Y → Z is defined by L(x, y) = L 1 x + L 2 y
for all (x, y) ∈ X × Y,
then L ∈ L(X × Y, Z ).
✷
It is now easy to prove the following theorem. Theorem 2.4. Let X, Y, and Z be normed linear spaces, suppose f : X × Y → Z , and let (x0 , y0 ) ∈ X × Y . Define g : X → Z by g(x) = f (x, y0 ) and h : Y → Z by h(y) = f (x0 , y). Suppose f is differentiable at (x0 , y0 ). Then g is differentiable at x0 , h is differentiable at y0 , and D f (x0 , y0 )(δx, δy) = Dg(x0 )δx + Dh(y0 )δy. The operators Dg(x0 ) and Dh(y0 ) are called the partial derivatives of f, and are denoted Dx f (x0 , y0 ) and D y f (x0 , y0 ), respectively. Thus D f (x0 , y0 )(δx, δy) = Dx f (x0 , y0 )δx + D y f (x0 , y0 )δy. Proof: By the preceding theorem, there exist L 1 ∈ L(X, Z ) and L 2 ∈ L(Y, Z ) such that D f (x0 , y0 )(δx, δy) = L 1 δx + L 2 δy
for all δx ∈ X, δy ∈ Y.
In particular, D f (x0 , y0 )(δx, 0) = L 1 δx, so lim
δx→0
f (x0 + δx, y0 ) − f (x0 , y0 ) − L 1 δx Z = 0. (δx, 0) X ×Y
This is equivalent to lim
δx→0
g(x0 + δx) − g(x0 ) − L 1 δx Z = 0, (δx X
so L 1 = Dg(x0 ). Similarly, L 2 = Dh(y0 ), and the proof is complete. Note that, for example, Dx f (x, y) ∈ L(X, Z ),
✷
84
GOCKENBACH
that is, Dx f : X × Y → L(X, Z ). Similarly, D y f : X × Y → L(Y, Z ). The following theorem is only slightly harder to prove. Theorem 2.5. Suppose X, Y, and Z are normed linear spaces, f : X × Y → Z , and the partial derivatives of f, Dx f (x, y) and D y f (x, y), exist and are continuous on an open set U ⊂ X × Y . Then f is C 1 on U, and Df (x, y)(δx, δy) = Dx f (x, y)δx + D y f (x, y)δy. Note that the continuity of Dx f and D y f is necessary; it is not the case that if Dx f (x0 , y0 ) and D y f (x0 , y0 ) exist, then D f (x0 , y0 ) must exist. These above results obviously generalize to an operator of the form f : X 1 × X 2 × · · · × X n → Z ; the basic equation is Df (x)δx = Dx1 f (x)δx1 + Dx2 f (x)δx2 + · · · + Dxn f (x)δxn , where x, δx ∈ X 1 × X 2 × · · · × X n . 3. 3.1.
Representation of linear operators on Euclidean spaces The basic theorem
I will now give the fundamental representation theorem for linear operators on Euclidean spaces. Specializing this result to the various contexts described in the introduction (R → R, Rn → R, Rn → Rm , and R → Rn ) will account for the various types of derivatives described there. Theorem 3.1. A such that L x = Ax
Let L : Rn → Rm . Then L is linear if and only if there is an m × n matrix for all x ∈ Rn .
Proof: Let {e1 , e2 , . . . , en } be the standard basis for Rn (so the ith component of ei is one and all other components are zero). Let c1 = Le1 , c2 = Le2 , . . . , cn = Le1 , and define A to be the m × n matrix whose columns are the vectors c1 , c2 , . . . , cn . That is, Ai j is (c j )i , the ith component of the vector c j . Then, since x = x 1 e1 + x 2 e2 + · · · + x n e n ,
A PRIMER ON DIFFERENTIATION
85
the linearity of L yields L x = x1 Le1 + x2 Le2 + · · · + xn Len = x 1 c1 + x 2 c2 + · · · + x n cn . However, by the definition of matrix multiplication, Ax = x1 c1 + x2 c2 + · · · + xn cn also holds. Thus L x = Ax
for all x ∈ Rn .
✷
Thus every linear operator L : Rn → Rm can be represented by an m × n matrix. If f : Rn → Rm is differentiable, then D f (x) : Rn → Rm linear. Therefore, there is an m × n matrix J representing D f (x). This matrix J turns out to be the Jacobian matrix mentioned in the introduction. To see this, it is convenient to first consider certain special cases. 3.2.
Representation of derivatives in special cases
3.2.1. m = n = 1. In the special case m = n = 1, so that f is a real-valued function of a real variable, D f (x) is represented by a single number f (x), as was already shown. 3.2.2. m = 1, n > 1. In the case m = 1, n > 1, so that f is a real-valued function of several variables, the result of Section 2.4 applies: D f (x)δx = Dx1 f (x)δx1 + Dx2 f (x)δx2 + · · · + Dxn f (x)δxn . Here δx1 , δx2 , . . . , δxn are the components of the vector δx ∈ Rn . Moreover, regarded as a function xi with the other components of x held fixed, f defines a real-valued function of a real variable. Thus Dxi f (x) can be represented by a single number, which is usually denoted ∂f (x). ∂ xi Thus D f (x)δx =
∂f ∂f ∂f (x)δx1 + (x)δx2 + · · · + (x), ∂ x1 ∂ x2 ∂ xn
which can be recognized as a matrix-vector product if the numbers ∂f ∂f ∂f (x), (x), . . . , (x) ∂ x1 ∂ x2 ∂ xn
(2)
86
GOCKENBACH
are gathered in a row, that is, a 1 × n matrix:
∂f ∂f ∂f (x)δx1 , (x)δx2 , . . . , (x) . ∂ x1 ∂ x2 ∂ xn This is the representer of D f (x) suggested by Theorem 3.1. There is a slightly different representer of D f (x) that is to be preferred because it generalizes to the infinite-dimensional case: Eq. (2) is recognized as the inner product of δx with the vector ∂f (x) ∂ x1 ∂f ∂ x2 (x) ∇f (x) = , .. . ∂f (x) ∂ xn which is called the gradient of f at x. Thus D f (x)δx = (∇f (x), δx)Rn
for all δx ∈ Rn .
The gradient ∇f (x) is the usual representer of D f (x). 3.2.3. m > 1, n = 1. In the case m > 1, n = 1, so that f is a vector-valued function of a real variable (often called a curve since its image is a curve in m-space), D f (t) : R → Rm is represented by an m × 1 matrix. Now, f can be written
f 1 (t)
f (t) 2 f (t) = .. , . f m (t) where each f i is a real-valued function of a real variable. By considering the definition of the derivative in this case, which implies that lim
δt→0
| f i (t + δt) − f i (t) − (D f (t)δt)i | = 0, |δt|
it is easy to see that (D f (t)δt)i = f i (t)δt. Therefore the representer of D f (t) is
f 1 (t)
f (t) 2 .. , . f m (t)
87
A PRIMER ON DIFFERENTIATION
which is written as f (t) or f˙(t) in calculus courses. Since an m × 1 matrix can be thought of as a vector, the usual interpretation of f (t) as the tangent vector to the curve x = f (t) holds. If the curve is traced out by a particle as t varies, then, at time t, the particle is at x = f (t), while at time t + δt, it is approximately at f (t) + f (t)δt. 3.2.4. The general case m > 1, n > 1. Finally, consider the case m > 1, n > 1, so that f is a vector-valued function of a vector variable. Then, by the results on partial derivatives, D f (x)δx = Dx1 f (x)δx1 + Dx2 f (x)δx2 + · · · + Dxn f (x)δxn . Now, regarded as a function of xi with the other components of x held fixed, f defines a function of the type considered in Section 3.2.3. The representer of Dxi f (x) is the column vector ∂f 1 (x) ∂ xi ∂f 2 ∂f ∂ xi (x) (x) = , .. ∂ xi . ∂ fm (x) ∂ xi and it follows from this that the matrix J representing D f (x) has ∂f ∂f ∂f (x), (x), . . . , (x) ∂ x1 ∂ x2 ∂ xn as columns. Thus ∂f 1 (x) ∂ x1 ∂f 2 (x) J = ∂ x1. .. ∂ fm ∂ x1 (x)
∂ f1 ∂ x2 (x) ∂ f2 ∂ x2 (x) .. .
··· ··· .. .
∂ fm ∂ x2 (x) · · ·
∂ f1 (x) ∂ xn ∂ f2 ∂ xn (x) , .. . ∂ fm (x) ∂ xn
which is the Jacobian matrix mentioned in the introduction. (Note that the gradients of the component functions f i (x) form the rows of J .) 3.3.
Summary
Here is a summary of the results that I have presented: • If f : R → R, then the representer of D f (t) is the scalar f (x).
88
GOCKENBACH
• If f : Rn → R, then the representer of D f (x) is the vector ∇f (x). Recall that this is a slight departure from the general framework, as D f (x) is represented via inner product with a (column) vector rather than via matrix multiplication with a row vector. • If f : R → Rm , then the representer of D f (x) is the (column) vector f˙(t). • If f : Rn → Rm , then the representer of D f (x) is the Jacobian matrix J defined by Ji j =
3.4.
∂ fi (x). ∂x j
Example: A quadratic function
Suppose A is an n × n symmetric matrix (A T = A), b ∈ Rn , γ ∈ R, and f : Rn → R is defined by f (x) = 12 (x, Ax)Rn + (b, x)Rn + γ . To compute ∇ f (x), one method is to write f in terms of the components of x, f (x) =
n n n 1 Ai j x i x j + bi xi + γ , 2 i=1 j=1 i=1
and compute the partial derivatives of f . While this is possible, it is easier to proceed from the definition: Write f (x +δx)− f (x) as a term that is linear in δx plus a smaller remainder. Now, f (x + δx) − f (x) =
1 (x + δx, A(x + δx)) + 2 − 12 (x, Ax) − (b, x) − γ
(b, x + δx) + γ = (Ax + b, δx) + 12 (δx, Aδx).
Note that the symmetry of A was used to conclude that (δx, Ax) = (x, Aδx). The term (δx, A δx)/2 is small compared to δx when δx is small; in fact, 1 (δx, 2
Aδx) = O(δx2 ) = o(δx).
Therefore, f (x + δx) − f (x) = (Ax + b, δx) + o(δx), that is, D f (x)δx = (Ax + b, δx). This equation exhibits the representer ∇f (x) of D f (x): ∇f (x) = Ax + b.
A PRIMER ON DIFFERENTIATION
4.
89
Rules for differentiation
I will now review the important rules for differentiating functions. 4.1.
The derivative of a linear function
Suppose f : X → Y is linear and continuous. Then f (x + δx) − f (x) = f (x) + f (δx) − f (x) = f (δx). It follows that D f (x)δx = f (δx), that is, D f (x) = f . This holds independently of x ∈ X , and is the analogue of the rule from calculus which states that the derivative of a linear function is a constant. 4.2.
The chain rule
Now suppose that h : Y → Z , g : X → Y are C 1 , and f : X → Z is the composition of h and g: f (x) = h(g(x)) for all x ∈ X. Then f (x + δx) = h(g(x + δx)) = h(g(x) + Dg(x)δx + o(δx)) = h(g(x)) + Dh(g(x))(Dg(x)δx + o(δx)) + o(Dg(x)δx + o(δx)) = f (x) + Dh(g(x))Dg(x)δx + o(δx). Thus D f (x)δx = Dh(g(x))Dg(x)δx. This is the chain rule. 4.3.
The product rule
Suppose X , Y , Z , and W are normed vector spaces, and P : Y × Z → W is continuous and bilinear; that is P(α1 y1 + α2 y2 , z) = α1 P(y1 , z) + α2 P(y2 , z) for all y1 , y2 ∈ Y, z ∈ Z , α1 , α2 ∈ R, P(y, α1 z 1 + α2 z 2 ) = α1 P(y, z 1 ) + α2 P(y, z 2 ) for all y ∈ Y, z 1 , z 2 ∈ Z , α1 , α2 ∈ R.
90
GOCKENBACH
Then P defines some sort of generalized product, and it is reasonable to write P(y, z) = yz. (However, P is not assumed to be symmetric; that is, the product may not be commutative.) Suppose f : X → Y and g : X → Z are C 1 , and define h : X → W by h(x) = P( f (x), g(x)) = f (x)g(x). Now, h(x + δx) − h(x) = f (x + δx)g(x + δx) − f (x)g(x) = ( f (x) + D f (x)δx + o(δx X ))(g(x) + Dg(x)δx + o(δx X )) − f (x)g(x) = f (x)g(x) + f (x)Dg(x)δx + D f (x)δxg(x) + o(δx X ) − f (x)g(x) = f (x)Dg(x)δx + D f (x)δxg(x) + o(δx X ). Thus Dh(x)δx = f (x)Dg(x)δx + D f (x)δxg(x), that is, Dh(x)δx = P( f (x), Dg(x)δx) + P(D f (x)δx, g(x)). This is the product rule. 4.4.
Example: A simple instance of the product rule
A simple example of the product rule is provided by the following: Let f : Rn → Rm , g : Rn → Rm , and define h : Rn → R by h(x) = ( f (x), g(x))Rm . Then Dh(x)δx = ( f (x), Dg(x)δx)Rm + (D f (x)δx, g(x))Rm . To find ∇h(x), the usual representer of Dh(x), let F and G be the Jacobian matrices representing D f (x) and Dg(x), respectively. Then Dh(x)δx = ( f (x), Gδx)Rm + (Fδx, g(x))Rm = (G T f (x), δx)Rn + (δx, F T g(x))Rn = (δx, G T f (x) + F T g(x))Rn .
A PRIMER ON DIFFERENTIATION
91
Thus ∇h(x) = G T f (x) + F T g(x).
4.5.
Example: Differentiating an inverse
Suppose L : X → L(Y, Z ), and assume that L(x)−1 exists and is continuous for each x ∈ X . Define f : X → L(Z , Y ) by f (x) = L(x)−1 . Assuming that L is C 1 , what is D f (x)? Both the chain rule and the product rule are involved in the answer. Define φ : U → L(Z , Y ) by φ(K ) = K −1 , where U = {K ∈ L(Y, Z ) : K −1 exists and is continuous}. Then K φ(K ) = I, where I : Z → Z is the identity operator. Differentiating both sides yields KDφ(K )δ K + δ K φ(K ) = 0.
(3)
To obtain this result, the product rule was applied to the mapping P : L(Y, Z )×L(Z , Y ) → L(Z , Z ) defined by P(K , L) = KL. Also, note that I ∈ L(Z , Z ) is constant, so its derivative is zero. Now, φ(K ) = K −1 , so (3) yields KDφ(K )δK = −δK K −1 or Dφ(K )δ K = −K −1 δK K −1 .
(4)
(The reader will notice the shadow of the calculus rule f (x) =
1 1 ⇒ f (x) = − 2 x x
here.) Equation (4) can now be combined with the chain rule to find D f (x) for f (x) = L(x)−1 . Since f is the composition of φ and L, the chain rule yields D f (x)δx = Dφ(L(x))DL(x)δx = −L(x)−1 DL(x)δx L(x)−1 .
92
GOCKENBACH
This expression for D f (x)δx is the product (composition) of three linear operators, L(x)−1 , DL(x)δx, and L(x)−1 . Note that since the product of linear operators is not commutative, order is important in this formula. 5.
Simple representations on infinite-dimensional spaces
If X and Y are both infinite-dimensional, then little can be said in general about the representation of L ∈ L(X, Y ). However, if one of the spaces is finite-dimensional, then it is not difficult to derive some useful results. 5.1.
Real-valued functions defined on Hilbert spaces
First I take the special case of f : X → R, where X is a Hilbert space—a complete inner product space. In this case, the Riesz Representation theorem is available. Theorem 5.1. (Riesz Representation Theorem). Let X be a Hilbert space. Then if L : X → R is linear and continuous, there exists a unique v ∈ X such that L(x) = (x, v)x
for all x ∈ X.
This theorem follows immediately from Theorem 3.1 if X is finite-dimensional. For a proof in the infinite-dimensional case, see any book on Hilbert spaces or functional analysis. Now suppose f : X → R is differentiable at x ∈ X . Then D f (x) is a continuous linear function defined on X , and, by the Riesz representation theorem, there exists a vector v ∈ X satisfying D f (x)δx = (δx, v) X
for all δx ∈ X.
Just as in the finite-dimensional case, the vector v is called the gradient of f at x, and is denoted by ∇f (x). 5.2.
Example: The gradient of a nonlinear least-squares function
A common optimization problem is to minimize the nonlinear least-squares function f (x) = 12 F(x)22 = 12 (F(x), F(x))Y , where F : X → Y is a nonlinear operator and X , Y are Hilbert spaces. I will now apply the results developed above to compute the gradient of f . I will also specialize the results to X = Rn , Y = Rm . By the product rule, D f (x)δx = 12 (F(x), D f (x)δx)Y + 12 (D f (x)δx, F(x))Y = (D f (x)δx, F(x))Y .
A PRIMER ON DIFFERENTIATION
93
Now, an m × n matrix A has the property that (Ax, y)Rm = (x, A T y)Rn
for all x ∈ Rn , y ∈ Rm .
Similarly, for every operator L ∈ L(X, Y ), there is a unique adjoint operator L ∗ defined by the equation (L x, y)Y = (x, L ∗ y)X
for all x ∈ X, y ∈ Y.
(The existence and uniqueness of L ∗ can be proved using the Riesz representation theorem.) Therefore, D f (x)δx = (D F(x)δx, F(x))Y = (δx, D F(x)∗ F(x)) X , which shows that ∇f (x) = D F(x)∗ F(x). Computing the adjoint of D f (x) can be quite challenging in some applications; see Section 7 for a nontrivial example. In the case of X = Rn , Y = Rm , D f (x) is represented by the Jacobian matrix J , and therefore D f (x)∗ is represented by its transpose. It follows that ∇ f (x) = J T F(x), where J is the Jacobian matrix of F at x. 5.3.
Finite-dimensional operators on Hilbert space
Next I consider the case of F : X → Rm , where X is again a Hilbert space. Clearly F can be represented as
F1 (x)
F (x) 2 F(x) = .. , . Fm (x) where Fi : X → R, i = 1, 2, . . . , m. It follows that
D f 1 (x)δx
D f (x)δx 2 D F(x)δx = .. . D f m (x)δx
94
GOCKENBACH
(∇F1 (x), δx) X
(∇F (x), δx) 2 X = .. .
.
(∇Fm (x), δx) X Thus the derivative of F can be represented by m vectors in X , namely, ∇ F1 (x), ∇ F2 (x), . . . , ∇ Fm (x). By analogy with the finite-dimensional case, these vectors can be thought of as forming the rows of a matrix (with infinitely many columns). 5.4.
Operators with a finite-dimensional domain
5.4.1. The case of a one-dimensional domain. Suppose Y is a normed linear space, and assume F : R → Y is differentiable. The D F(t) is a continuous linear operator from R into Y . It is simple to represent such operators, for if L ∈ L(R, Y ) and z = L(1), then Lt = tL(1) = t z
for all t ∈ R.
Thus L is represented by an element z of Y . ˙ It follows that D F(t) is represented by an element of Y , which is denoted F(t): ˙ D F(t)δt = δt F(t) for all δt ∈ R. 5.4.2. The case of a finite-dimensional domain. Now suppose F : Rn → Y is differentiable. Then D F(x) ∈ L(Rn , Y ), and the structure of such an operator must be determined. Let L ∈ L(Rn , Y ) and let {e1 , e2 , . . . , en } be the standard basis for Rn . Then for any x ∈ Rn , x=
n
xi ei ,
i=1
and so Lx =
n
xi Lei .
i=1
That is, there are n vectors Le1 , Le2 , . . . , Len in Y , and each image L x is a linear combination of these n vectors. These n vectors represent L. By analogy with the case Y = Rm , one can think of the representer of L as a matrix with n columns (each of which is a vector in Y ). It is now easy to see that D F(x) is represented by n vectors, each of which is the representer of a partial derivative of F at x. Again, one can think of the representer of D F(x) as a matrix with n columns, each a representer of a partial derivative of F at x.
A PRIMER ON DIFFERENTIATION
5.5.
95
Example: The solution operator of an IVP
As an important example of the previous section, consider a vector field f : → Rn , where ⊂ Rn is an open set. The vector field f defines an autonomous (i.e. time-independent) ordinary differential equation (ODE) x˙ = f (x). By a standard result of the theory of ODEs, if W ⊂ is closed and bounded and f is C 1 , then there exists a positive number α such that, for each x0 ∈ W , there exists x ∈ (C[−α, α])n satisfying the Initial Value Problem (IVP) x˙ = f (x), x(0) = x0 .
(5)
That is, there is an operator S : W → (C[−α, α])n with S(x0 ) = x, the solution of (5). I call S the solution operator of the IVP. Recall that (C[−α, α])n is the space of all continuous, vector-valued functions defined on [−α, α]. The usual norm of (C[−α, α])n is u∞ = max{u(t)2 : t ∈ [−α, α]}. This definition implies that if u − v∞ is small, then u(t) − v(t) is uniformly small on the interval [−α, α]. For this reason, · ∞ is sometimes called the uniform norm. The derivative DS(x0 ) is computed by finding the local linear approximation to S(x0 + δx0 ) − S(x0 ). Write z = S(x0 + δx0 ) and x = S(x0 ). Then z satisfies z˙ = f (z), z(0) = x0 + δx0 , and x satisfies (5). Therefore, if w = z − x, then w˙ = z˙ − x˙ = f (z) − f (x) = D f (x)(z − x) + o(z − x∞ ) = Df (x)w + o(w∞ ) and w(0) = z(0) − x(0) = x0 + δx0 − x0 = δx0 . Since a linear (in δx0 ) approximation to w is desired, it is reasonable to drop the o(w∞ ) term from the ODE and consider the solution u to u˙ = Df (x(t))u, u(0) = δx0 .
(6)
96
GOCKENBACH
Note that u really does depend linearly on δx0 . Indeed, if u solves (6) and v solves u˙ = Df (x(t))u, u(0) = δy0 , then y = βu + γ v satisfies y˙ = = = =
β u˙ + γ v˙ β Df (x(t))u + γ D f (x(t))v Df (x(t))(βu + γ v) Df (x(t))y,
and y(0) = βu(0) + γ v(0) = βδx0 + γ δy0 . Therefore u, the solution of (6), depends linearly on δx0 , and it is an approximation to w since it is obtained by solving an IVP with the same initial conditions as that satisfied by w and with a slightly changed vector field. It can be proved, in fact, that w = u + o(w∞ ) = u + o(δx0 2 ). (This is a standard theorem about the continuous dependence of the solution to an IVP on the vector field.) Therefore DS(x0 )δx0 = u, where u is the solution of the IVP (6). 6.
Second derivatives
In elementary calculus, if f : R → R is twice differentiable, then the scalar f (x) representing D f (x) is called the first derivative. Since, in this way of looking at things, f : R → R is the same type of function as is f itself, it is natural to define f (x) as the derivative of f at x, so that f (x) is also a scalar and f , like f and f , maps R into R. As I will now explain, this is another instance in which the one-dimensional case gives a completely misleading picture. Suppose X and Y are normed linear spaces, U ⊂ X is open, and f : U → Y is differentiable. Then the derivative D f is also an operator mapping one normed linear space into another; however, it is not of the same type as f , since D f maps U into L(X, Y ). It does make sense to ask whether D f : U → L(X, Y ) is differentiable; to examine this question, Definition 2.2 is applied. The operator D f is differentiable at x ∈ U if there exists a continuous linear operator L ∈ L(X, L(X, Y )) such that lim
δx→0
Df (x + δx) − D f (x) − LδxL(X,Y ) = 0. δx X
If such an L exists, then f is said to be twice-differentiable at x, and L is denoted by D 2 f (x). If f is twice-differentiable at each x ∈ U , then f is called twice-differentiable, in which
A PRIMER ON DIFFERENTIATION
97
case D 2 f is an operator mapping U into L(X, L(X, Y )). If this operator is continuous, then f is called C 2 . Now, clearly Definition 2.2 can be used to discuss derivatives of order three and higher. However, things become quite awkward. For example, if f is three times differentiable, then D 3 f (x) ∈ L(X, L(X, L(X, Y ))), and if D 4 f (x) exists, then D 4 f (x) ∈ L(X, L(X, L(X, L(X, Y )))). Fortunately, a simplification is afforded by the nature of the spaces L(X, L(X, Y )), L(X, L(X, L(X, Y ))), . . . Consider L ∈ L(X, L(X, Y )). By definition, L(x) ∈ L(X, Y ) and L(x)z ∈ Y for each x, z ∈ X . In other words, L defines an operator B : X × X → Y by B(x, z) = L(x)z. It is easy to see that B is bilinear, that is, that B(α1 x1 + α2 x2 , z) = α1 B(x1 , z) + α2 B(x2 , z)∀x1 , x2 , z ∈ X, α1 , α2 ∈ R, B(z, α1 x1 + α2 x2 ) = α1 B(z, x1 ) + α2 B(z, x2 )∀x1 , x2 , z ∈ X, α1 , α2 ∈ R. Indeed, L(α1 x1 + α2 x2 ) = α1 L(x1 ) + α2 L(X 2 ), from which it follows that B(α1 x1 + α2 x2 , z) = L(α1 x1 + α2 x2 )z = α1 L(x1 )z + α2 L(x2 )z = α1 B(x1 , z) + α2 B(x2 , z). Similarly, L(z) is linear, so B(z, α1 x1 + α2 x2 ) = L(z)(α1 x1 + α2 x2 ) = α1 L(z)x1 + α2 L(z)x2 = α1 B(z, x1 ) + α2 B(z, x2 ).
98
GOCKENBACH
If X , Y , and Z are normed linear spaces, then the space of continuous bilinear operators B : X × Y → Z is denoted by L2 (X, Y, Z ). This space has a natural norm:
BL2 (X,Y,Z )
B(x, y)z = sup : x ∈ X, y ∈ Y, x = 0, y = 0 . x X yY
It is a standard result that a bilinear operator B : X × Y → Z is continuous if and only if BL2 (X,Y,Z ) < ∞. Above I showed that L ∈ L(X, L(X, Y )) defines a bilinear operator B : X × X → Y ; it is now easy to see that the boundedness (continuity) of L implies that B is bounded (continuous): B(x, z)Y = L(x)zY ≤ L(x)L(X,Y ) z X ≤ LL(X,L(X,Y )) x X z X . This shows that B ∈ L2 (X, X, Y ) and BL2 (X,X,Y ) ≤ LL(X,L(X,Y )) . In fact, it is not difficult to show that this inequality is actually an equation, and that this correspondence between L ∈ L(X, L(X, Y )) and B ∈ L2 (X, X, Y ) is an isometry—a oneto-one correspondence that preserves norms. Therefore, since D 2 f (x) is a member of L(X, L(X, Y )), it can equally well be regarded as a continuous bilinear operator on X × X . The reader will probably not find it difficult to believe that, if f is smooth enough for higher derivatives to exist, then D 3 f (x) can be regarded as a tri-linear operator, D 4 f (x) as a 4-linear operator, and, in general, D k f (x) is a k-linear operator. I will not discuss derivatives of order higher than two any further, but the reader should be able to generalize the discussion on second derivatives to higher derivatives if necessary. The following result is fundamental. Theorem 6.1. Suppose f : X → Y is twice-differentiable at x ∈ X, where X and Y are normed linear spaces and Y is complete. Then the bilinear operator D 2 f (x) is symmetric: D 2 f (x)δxδy = D 2 f (x)δyδx
for all δx, δy ∈ X.
The proof is too technical to present here. 6.1.
Representation of bilinear operators on finite-dimensional spaces
Suppose f : Rn → Rm is twice differentiable at x ∈ Rn . Then D 2 f (x) is a bilinear operator mapping Rn × Rn into Rm . The following basic theorem shows how such bilinear operators are represented.
99
A PRIMER ON DIFFERENTIATION
Theorem 6.2. Suppose B : R p × Rn → Rm is bilinear. Then there is a tensor of order three (a 3-tensor) T = Ti jk such that B(x, y) = Txy
for all x ∈ R p , y ∈ Rn .
That is, the ith component of B(x, y) is given by (B(x, y))i =
p n
Ti jk xk y j .
j=1 k=1
Proof: (Sketch) Since B(x, ·) defines a linear operator from Rn into Rm for each x ∈ R p , it follows that B(ek , ·) is represented by an m × n matrix A(k) for each standard basis vector ✷ ek , k = 1, 2, . . . , p. Define the 3-tensor T by Ti jk = Ai(k) j . The result follows. Thus D 2 f (x) is represented by a 3-tensor T ∈ Rm×n×n . The formula for Ti jk is most easily derived using the following results about partial derivatives. 6.2.
Second partial derivatives
Suppose f : X × Y → Z , where X, Y , and Z are normed linear spaces, is twice differentiable. Then D 2 f (x, y) belongs to the space L2 (X × Y, X × Y, Z ). Such operators have the following representation. Theorem 6.3. Suppose B ∈ L2 (X × Y, X × Y, Z ), where X, Y, and Z are normed linear spaces. Then there exist operators B11 ∈ L2 (X, X, Z ),
B12 ∈ L2 (Y, X, Z ),
B21 ∈ L2 (X, Y, Z ),
B22 ∈ L2 (Y, Y, Z ) such that, for all (x, y), (u, v) ∈ X × Y, B((x, y), (u, v)) = B11 (x, u) + B12 (y, u) + B21 (x, v) + B22 (y, v). The proof is similar to others given above and will be omitted. Applying the theorem to D 2 f (x, y), it is not difficult to believe that the operators B11 , B12 , B21 , and B22 are the second partial derivatives of f , resulting in the formula D 2 f (x, y)(δr, δz)(δx, δy) = Dx2x f (x, y)δr δx + Dx2y f (x, y)δzδx 2 + Dyx f (x, y)δr δy + D 2yy f (x, y)δzδy.
100
GOCKENBACH
Note that the notation Dx2y f denotes the partial derivative with respect to y of Dx f , and 2 2 similarly for Dyx f . By Theorem 6.1, each of D 2 f (x, y), Dxx f (x, y), and D 2yy f (x, y) is symmetric. It follows easily that Dx2y f (x, y)δzδx = D 2yx f (x, y)δxδz
for all δx ∈ X, δz ∈ Y.
Of course, these results can be generalized to the case of f : X 1 × X 2 × · · · × X n → Z , in which case the fundamental formulas are D 2 f (x)δxδr =
n n
Dx2i x j f (x)δx j δri ,
i=1 j=1
and Dx2i x j f (x)δx j δri = Dx2 j xi f (x)δri δx j . 6.3.
Representation of second derivatives on finite-dimensional spaces
Now I return to the case of f : Rn → Rm and derive the formula for the 3-tensor representing D 2 f (x). Since Rn can be regarded as the product of n copies of R, D 2 f (x) can be expressed in terms of the second partial derivatives of f : D 2 f (x)δxδr =
n n
Dx2 j xk f (x)δxk δx j .
j=1 k=1
Recall that Dx2 j xk f is the derivative with respect to xk of Dx j f ; also recall that Dx j f : Rn → L(R, Rm ), or, effectively, D x j f : R n → Rm (since each operator in L(R, Rm )) is represented by a vector in Rm , and vice versa). Specifically, Dx j f (x) is represented by the vector ∂f (x). ∂x j By the same reasoning, Dx2 j xk f (x) is also represented by a vector in Rm , namely, ∂2 f (x). ∂ xk ∂ x j
A PRIMER ON DIFFERENTIATION
101
It follows that D 2 f (x)δxδr =
n n j=1 k=1
∂2f (x)δxk δx j . ∂ xk ∂ x j
Thus,
D 2 f (x)δxδr
i
=
n n j=1 k=1
∂ 2 fi (x)δxk δx j , ∂ xk ∂ x j
which shows that D 2 f (x) is represented by the 3-tensor T , with Ti jk =
6.4.
∂ 2 fi (x). ∂ xk ∂ x j
The Hessian
In Sections 3.2.2 and 5.1, I showed that, in the case of f : X → R, the usual representer of D f (x) is the gradient vector ∇ f (x). When X = Rn , this is not quite the same as the Jacobian matrix (which is a row matrix in that case); the gradient is adopted instead precisely because it generalizes to the case in which X is a Hilbert space. In the same way, the representer for D 2 f (x) has a special form when the range of f is R. Indeed, in this case D 2 f (x) is a bilinear operator mapping X × X into R, and the following theorem holds. Theorem 6.4. Suppose X is a Hilbert space and B ∈ L2 (X, X, R) is a bilinear form. Then there exists a linear operator L ∈ L(X, X ) such that B(x, y) = (L x, y)x. This theorem can be proved using the Riesz representation theorem, since, for fixed x, the map y → B(x, y) defines a continuous, linear, real-valued function on X . In the case of D 2 f (x), where f : X → R, the linear operator representing the bilinear operator is called the Hessian operator and is denoted ∇ 2 f (x). That is, ∇ 2 f (x) ∈ L(X, X ) is defined by D 2 f (x)δxδy = ∇ 2 f (x)δx, δy X .
102
GOCKENBACH
In the case of f : Rn → R, the Hessian matrix is just the specialization of the tensor T discussed above. Since m = 1 in this case, the 3-tensor can obviously be identified with a 2-tensor, i.e. a matrix (just as the Jacobian matrix can be identified with a vector, the gradient, in this case). Therefore, (∇ 2 f (x))i j =
6.5.
∂2f (x). ∂ x j ∂ xi
Example: The Hessian of a nonlinear least-squares function
I now return to the example of Section 5.2. Let F : X → Y , where X and Y are Hilbert spaces, and define f (x) = 12 (F(x), F(x))Y . I showed earlier that D f (x)δx = (D F(x)δx, F(x))Y . By the product rule, it follows that D 2 F(x)δxδr = (D F(x)δx, D F(x)δz)Y + (D 2 F(x)δxδr, F(x))Y .
(7)
This gives a formula for the second derivative of f , but it must be rearranged to exhibit the Hessian operator. The first term is easy to handle, since (D F(x)δx, D F(x)δr )Y = (δx, D F(x)∗ D F(x)δr ) X . The operator D F(x)∗ D F(x) thus forms part of the Hessian; indeed, in small residual leastsquares problems, this operator is a good approximation to the Hessian, at least for x near the minimizer. For this reason, it is often used as an approximation to Hessian, and is referred to as the Gauss-Newton Hessian. To handle the second term, write B(δx, δr ) = D 2 F(x)δxδr ; then (D 2 F(x)δxδr, F(x))Y = (B(δx, δr ), F(x))Y = (δx, (B(·, δr ))∗ F(x)) X , where I write B(·, δr ) for the linear operator defined by δx → B(δx, δr ).
103
A PRIMER ON DIFFERENTIATION
Now, it is easy to see that δr → (B(·, δr ))∗ F(x) defines a linear operator mapping X to X , and it can be shown to be bounded (continuous). This operator depends on x through F(x) and D 2 F(x), and I will denote it by S(x), so that S(x)δr = (B(·, δr ))∗ F(x). With this notation, (D 2 F(x)δxδr, F(x))Y = (δx, S(x)δr )) X , and therefore ∇ 2 f (x) = D F(x)∗ D F(x) + S(x). Lastly, I will compute S(x) in the case F : Rn → Rm . In that case, D 2 F(x) is represented by the 3-tensor T , where Ti jk =
∂Fi (x). ∂ xk ∂ x j
Therefore, setting z = F(x), (D 2 F(x)δxδr, z)Rm =
m
(D 2 F(x)δxδr )i z i
i=1
=
m n n
Ti jk δxk δr j z i
i=1 j=1 k=1
=
n n m k=1
= δx,
j=1
i=1
j=1
i=1
n m
Ti jk z i δrj δxk
(S(x)δr )k = =
j=1
i=1
j=1
i=1
n m
Ti jk z i δrj
and so n m
Ti jk z i δr j ∂Fi (x)Fi (x) δr j . ∂ xk ∂ x j
, Rn
104
GOCKENBACH
This shows that S(x) is represented by the matrix whose (k, j) entry is m i=1
∂Fi (x)Fi (x), ∂ xk ∂ x j
and hence S(x) =
m
Fi (x)∇ 2 Fi (x).
i=1
The matrix representing S(x) has been referred to as the “mess matrix,” and the above formula shows that it is expensive to compute. This explains the popularity of the GaussNewton Hessian. However, in large-residual least-squares problems, use of the full Hessian (or an approximation to it) is necessary. 7.
Example: The adjoint state method
As a more involved example, I will discuss the computation of DG(c) and DG(c)∗ for a nonlinear operator G defined by an (explicit) finite-difference simulation. This discussion is taken from the paper (Gockenbach et al., in press). The problem described here arises, for example, when one or more coefficients in a partial differential equation are to be estimated by the Output Least-Squares (OLS) technique. In this technique, the parameters are chosen to produce simulated data as close as possible (in a norm induced by an inner product) to observed data. Specifically, the OLS problem is min J (c), J (c) = 12 G(c) − D obs 2 , c
(8)
where c ∈ C denotes the unknown parameters, D obs is the observed data, and G is the forward map, that is, the operator embodying the mathematical model of the dependence of the data on the parameters. Thus the OLS problem is just a nonlinear least-squares problem of the type discussed above, and ∇ J (c) = DG(c)∗ (G(c) − D obs ). In the application I consider here, G is defined by an explicit finite-difference simulation followed by sampling (in many applications, only part of field simulated by finite-differences is observable). I will therefore assume that G(c) = SU =
N
Sn U n ,
n=0
where U n ∈ U is (related to) the nth time level of the simulated field, S : U → D is the sampling operator, and D is the data space (that is, D obs ∈ D).
105
A PRIMER ON DIFFERENTIATION
Note that S is defined by SU =
N
Sn U n ,
n=0
where Sn : U → D for n = 0, 1, . . . , N . That is, each time level of the computed field is sampled, and the results are accumulated as the data. This formalism provides an efficient way to abstractly represent several different sampling possibilities. For example, the entire time level U n may be recorded for certain values of n, in which case Sn is the zero operator for all other values of n. Alternatively, every time level could be sampled at a few “receiver” locations (as in the typical seismic experiment), and the results recorded as time series. At the other extreme the entire history of the field could be retained. All of these possibilities can be accommodated within the above formalism by appropriate choice of S. Any finite-difference scheme can be considered to be formally two-level, by concatenating several time levels if necessary. Therefore, U n+1 = Hn (c, U n ),
n = 0, 1, . . . , N − 1.
I call Hn : C × U → U the stencil operator. 7.1.
A convection-diffusion example
I will now pause to give an explicit example of the situation described above. Consider the following initial-boundary value problem for the convection-diffusion equation: u t + a(x)u x = 0, u(x, 0) = φ(x), u(0, t) = 0,
0 < x < 1, 0 < x < 1,
t > 0,
t > 0,
where a(x) > 0 for all x ∈ [0, 1]. Define a grid on the rectangle (x, t) ∈ [0, 1] × [0, T ] by setting x j = jδx,
δx =
1 , M
tn = nδt,
δt =
T , N
and write u nj for the approximation to u(x j , tn ). Since the characteristics of the PDE point up and to the right in the (x, t) plane, it is natural to discretize using a forward difference in time and a backward difference in space to obtain u n+1 − u nj j δt
+ aj
u nj − u nj−1 δx
= 0,
106
GOCKENBACH
where a j = a(x j ). Taking into account the initial and boundary values yields
u n+1 = j
0, j = 0, n = 0, 1, . . . , N − 1 δx φ j − a j δt (φ j − φ j−1 ), n = 0, j = 1, 2, . . . , M n u n1 − a1 δx δt u 1 , n n n u j − a1 δx δt u j − u j−1 ,
n = 1, 2, . . . , N − 1, j = 1
(9)
n = 1, 2, . . . , N − 1, j = 2, 3, . . . , M
The stencil operator H for this example is therefore defined by H (a, u n ) = u n+1 , where u n+1 is defined by (9). In terms of the above notation, U is (M + 1)-dimensional space, while C is M-dimensional space. To introduce sampling, suppose that sensors are placed at several grid points on the spatial grid, say at x j1 , x j2 , . . . , x j , and that the observed data consists of the times series u 1ji , u 2ji , . . . , u Nji ,
i = 1, 2, . . . , .
If each time series forms a column of a matrix, then the data D is an (N + 1) × matrix, and we have D=
N
Sn U n ,
n=0
where Sn U n is the matrix with every row equal to zero except in nth row, which has entries u nj1 , u nj2 , . . . , u nj . 7.2.
Back to the general case
The linearization of the map c → G(c) is the result of first-order perturbation of the timestepping equations: DG(c)δc =
N
Sn δU n ,
n=0
where δU n+1 = Dc Hn (c, U n )δc + DU Hn (c, U n )δU n and δU n = (DU (c)δc)n . Note that if the original finite-difference scheme is linear (really affine: linear plus constant), then it can be written as U n+1 = A(c)U n + F n ,
107
A PRIMER ON DIFFERENTIATION
where A(c) = DU Hn (c, U ) (DU Hn is independent of the time level n in this case). It follows that δU satisfies δU n+1 = A(c)δU n + (DA(c)δc)U n . Therefore, in this common case, the linearization is computed by a finite-difference simulation identical to the original, except that the “right-hand side” F n is replaced by (DA(c) δc)U n . I will now show how to compute the adjoint of DG(c). The spaces C and U require inner products; these inner products will be denoted (·, ·)C and (·, ·)U , respectively. The field U belongs to U N +1 , and I define the inner product on U N +1 by (U, V )U N +1 =
N
(U n , V n )U .
n=0
For convenience, and suppressing the dependence on c, write An for DU Hn (c, U n ) and δ F n+1 for Dc Hn (c, U n )δc, δ F 0 = 0, so that the linearized scheme can be written as δU 0 = 0,
δU n+1 − An δU n = δ F n+1 ,
n = 0, 1, . . . , N − 1.
This can also be written as MδU = δF, where M : U N +1 → U N +1 is the block linear operator
I
0
0
···
0 I .. .
··· ··· .. .
0
I −A1 .. . 0
−A0 M = 0 . . .
· · · −A N −1
0
0 0 .. . I
(note that M depends on c, but I suppress this dependence). Then δU = M −1 δ F, and the explicit time-stepping scheme is equivalent to solving MδU = δ F by forward substitution. Now, write B for the operator mapping δc to δ F: (Bδc) = n
0, n=0 n−1 Dc Hn−1 (c, U )δc, n = 1, 2, . . . , N
(again suppressing the fact that B depends on c). Then DG(c) = S M −1 B,
108
GOCKENBACH
and so DG(c)∗ = B ∗ M −∗ S ∗ . Assuming that Sn , Sn∗ , and the stencil operator Hn and its derivatives and adjoints Dc Hn (c, U ), DU Hn (c, U ), Dc Hn (c, U )∗ , and DU Hn (c, U )∗ are known (the reader might find it instructive to compute these derivatives and adjoints for the convection-diffusion example given above), I will now show how to compute DG(c)∗ from them. Note that DG(c)∗ δ D = B ∗ M −∗ S ∗ δ D. Write δV = S ∗ δ D. Then, as is easy to verify, δV n = (S ∗ δ D)n = Sn∗ δ D,
n = 0, 1, . . . , N .
From my choice of inner product on U N +1 , it follows that M ∗ is the block linear operator
I 0 ∗ M = ... 0
−A∗0 I .. . 0
0 −A∗1 .. .
···
···
I
0
···
0
0
··· .. .
0
∗ −A N −1 I 0 .. .
Write δW = M −∗ δV , so that δW solves M ∗ δW = δV . Since M ∗ is block upper triangular, δW can be found by back substitution, which is equivalent to the following reverse time-stepping scheme: δW n−1 = A∗n−1 δW n + δV n−1 ,
δW N = δV N ,
n = N , N − 1, . . . , 1.
I will refer to δW as the adjoint state and to the equation M ∗ δW = δV as the adjoint state equation. Next I compute B ∗ . Note that (Bδc, δW )U N +1 = (0, δW 0 )U +
N
(Dc Hn−1 (c, U n−1 )δc, δW n )U
n=1
=
N
(δc, Dc Hn−1 (c, U n−1 )∗ δW n )C
n=1
= δc,
N
Dc Hn−1 (c, U
n=1
This shows that B ∗ δW =
N n=1
Dc Hn−1 (c, U n−1 )∗ δW n .
n−1 ∗
) δW
.
n C
A PRIMER ON DIFFERENTIATION
109
Thus the procedure for computing DG(c)∗ δ D, for δ D ∈ D N +1 , is: 1. Solve the simulation problem to produce the field U (needed in steps 3b and 3c). 2. Set δc to zero. 3. For n = N , N − 1, . . . , 1: (a) Compute δV n = Sn∗ δ D. (b) Compute δW n by taking one step (backward in time) on the adjoint state equation (or simply δW N = δV N ). (c) Add Dc Hn−1 (c, U n−1 )∗ δW n to the output vector δc. A logistical problem immediately asserts itself: U is produced by stepping forward in time, δW by stepping backwards. Unless the state space has small dimension (which is certainly not the typical case), storage of the entire time history of the reference field U is very expensive in terms of memory. On the other hand, one could, at each step of the backward time-stepping algorithm, re-compute the needed time level U n by forward time-stepping from U 0 . This is obviously expensive in terms of computation time. To balance the need for storage and recomputation, a checkpointing scheme due to Andreas Griewank (1992), extended in Symes et al. (1998), can be employed. The idea is to save (“checkpoint”) various time levels U n to use as intermediate initial data to restart the computation of U during the solution of the adjoint state system. A complete description of the algorithm appears in Gockenbach et al. (in press). 8.
Example: Differentiating a finite element solution operator in an inverse problem
As my final example, I will discuss the computation of the derivative and its adjoint when the operator is the (approximate) solution operator, as implemented using the finite element method, of an elliptic partial differential equation. Suppose is a bounded polygonal region in the plane and consider the boundary value problem (BVP) −∇ · (a∇u) = p u=0
in ,
(10)
on ∂.
This BVP models, for example, the small tranverse displacements u of an elastic membrane under a tranverse pressure p. The coefficient a is describes the elastic properties of the membrane, and when the membrane is heterogeneous, a is a function of space: a = a(x). The usual direct problem is to compute u given the functions p and a; that is, given the elastic properties of the membrane and the pressure to which it is subjected, determine its displacement. In many applications, it is necessary to solve an inverse problem, such as: Given p and a measurement of u, estimate a; that is, by observing the displacement of the membrane under a known pressure, estimate the elastic properties of the membrane. (It would also be possible to consider p as needing to be measured, that is, that it forms part of the data of the problem. To simplify the presentation, I will assume that the pressure p is known.)
110
GOCKENBACH
One way to solve the inverse problem numerically is to use the Output Least-Squares approach, as described in the previous section, in conjunction with the finite element method for solving the BVP. Suppose that the measured data is denoted u obs and a is to be chosen so that the predicted displacement u, as simulated by piecewise linear finite elements, is to be as close to u obs as possible in the L 2 () norm. It is necessary to have a representation for the unknown coefficient a, and I will represent it using a piecewise linear function. Let T (h) be a triangulation of , and define ¯ → R | φ is continuous and piecewise linear relative to T (h) } P (h) = {φ : P0(h) = {φ ∈ P (h) | φ = 0 on ∂}. Suppose the nodes of the triangulation T (h) are x1 , x2 , . . . , xm and φi is the element of P (h) defined by φi (x j ) =
1,
j =i
0,
j = i
.
Then {φ1 , φ2 , . . . , φm } is the standard basis for the space P (h) , and every element u ∈ P (h) satisfies u=
m
u(xi )φi .
i=1
The basis functions that correspond to interior nodes comprise a basis for P0(h) ; I will denote this basis by {ψ1 , ψ2 , . . . , ψn } (there exists a sequence i 1 , i 2 , . . . , i n such that ψ j = φi j , j = 1, 2, . . . , n). The finite element method for estimating the solution of (10) takes the form (h) find u ∈ P such that a∇u · ∇ψi = f ψi ∀i = 1, 2, . . . , n. (11)
Upon substituting u=
n
Ujψj,
j=1
(11) can be written as the matrix-vector equation KU = P, where Ki j = a∇ψ j · ∇ψi , i, j = 1, 2, . . . , n, Pi = pψi , i = 1, 2, . . . , n.
Note that K ∈ Rn×n is symmetric and positive definite.
111
A PRIMER ON DIFFERENTIATION
Now define the (approximate) solution operator of (10) as f : P (h) → P0(h) , where f : a → u =
n
Ui ψi ,
U = K −1 P
i=1
and K ∈ Rn×n and P ∈ Rn are defined as above. The OLS approach is then to minimize the function J : P (h) → R defined by J (a) = 12 f (a) − u obs 2L 2 () . This is another nonlinear least-squares function, and its gradient is given by ∇ J (a) = D f (a)∗ ( f (a) − u obs ). It is easier to compute D f (a) and D f (a)∗ if I explicitly recognize the fact that the bases for P (h) and P0(h) make it possible to identify them with Rm and Rn , respectively. Define E : Rm → P (h) by EA =
m
Ai φi ,
i=1
and note that, as discussed above, E −1 is defined by (E −1 a)i = a(xi ),
i = 1, 2, . . . , m.
Similarly, define E 0 : Rn → P0(h) by E0U =
n
Ui ψi ;
i=1
then (E 0−1 u)i = u(x ji ), i = 1, 2, . . . , n. I can then write f = E 0−1 ◦ F ◦ E, where F : Rm → Rn is defined by F(A) = U,
a=
m i=1
Ai φi ,
u=
n i=1
Ui ψi ,
U = K −1 P.
112
GOCKENBACH
I will now how to compute D F(A) and D F(A)∗ . The matrix K depends on A, so I will write K = K (A). With a=
m
A k φk ,
k=1
it follows that K i j (A) = = = =
a∇ψ j · ∇ψi
m k=1 m k=1 m
Ak φk ∇ψ j · ∇ψi
φk ∇ψ j · ∇ψi Ak
Ti jk Ak ,
k=1
where Ti jk =
φk ∇ψ j · ∇ψi , i, j = 1, 2, . . . , m,
k = 1, 2, . . . , m.
It then follows that, for any A, δ A ∈ Rm , DK(A)δ A =
m
Ti jk δ Ak = K (δ A).
k=1
This result, DK(A)δ A = K (δ A), is to be expected because the operator K : A → K (A) is linear in A. Since F is defined by F(A) = K (A)−1 P, the result from Section 4.5 applies, and D F(A)δ A = −K (A)−1 (D K (A)δ A)K (A)−1 P = −K (A)−1 K (δ A)U, where U = F(A). This formula shows that computing Df(A)δ A for a given δ A is no more expensive than computing the simulated displacement U (assuming U is computed first, so K (A) and U are already known), and may be much less expensive if the matrix K (A) has already been factored.
113
A PRIMER ON DIFFERENTIATION
I will now turn to the computation of D f (A)∗ . Note that (K (δ A)U )i =
n
(K (δ A))i j U j
j=1
=
n m
Ti jk δ Ak U j
j=1 k=1
=
m n
Ti jk δ Ak U j
k=1 j=1
=
m
n
k=1
Ti jk U j δ Ak .
j=1
I now define the matrix L = L(U ) by L(U ) =
n
Ti jk Uj ,
j=1
which allows me to write K (δ A)U = L(U )δ A and D F(A)δ A = −K (A)−1 L(U )δ A. The formula for D f (A)∗ now follows: D F(A)∗ δU = −L(U )T K (A)−1 δU (where I used the fact that K is symmetric). The relationship between D f (a) and D F(A) is straightforward, and, indeed, is exactly analogous to the relationship between f and F. If a = E A and δa = Eδ A, that is, a=
m
Ai φi ,
δa =
i=1
n
δ Ai φi ,
i=1
then δu = D f (a)δa and δU = D F(A)δ A satisfy δu =
n
δUi ψi .
i=1
Indeed, this follows from the chain rule applied to the relationship f (a) = E 0 F(E −1 a).
114
GOCKENBACH
The chain rule (along with the linearity of E −1 and E 0 ) yields D f (a)δa = E 0 D f (E −1 a)E −1 δa
(12)
which is the same relationship stated above. Determining the relationship between D f (a)∗ and D F(A)∗ requires taking into account the difference in inner products for P (h) and Rm , P0(h) and Rn . Equation (12) yields D f (a)∗ = (E −1 )∗ D F(E −1 a)∗ E 0∗ .
(13)
The operators E and E 0 map Euclidean spaces to finite element spaces, and so two different inner products are involved in the definition of each. For instance, the operator E 0∗ is defined by the condition (E 0 U, w) L 2 () = (U, E 0∗ w)Rn
∀U ∈ Rn , w ∈ P0(h) .
Consider V ∈ Rn , w ∈ P0(h) . Then (E 0 V, w) L 2 () =
n
n Vi ψi , w xi j ψ j
i=1
=
j=1
n n
L 2 ()
(ψi , ψ j ) L 2 () w(xi j )Vi
i=1 j=1
=
n
i=1
n
(ψi , ψ j ) L 2 () E 0−1 w j
Vi .
j=1
Defining M0 ∈ Rn×n by (M0 )i j = (ψi , ψ j ) L 2 () (M0 is a Gram matrix—in the context of finite elements, it is referred to as a mass matrix), this yields n n −1 (E 0 V, w) L 2 () = (M0 )i j E 0 w j Vi i=1
=
n
j=1
M0 E 0−1 w i Vi
i=1
= V, M0 E 0−1 w Rn . This shows that E 0∗ = M0 E 0−1 . Next, I note that E ∗ (E −1 )∗ = (E −1 E)∗ = I ∗ = I
A PRIMER ON DIFFERENTIATION
115
by the fundamental rule (AB)∗ = B ∗ A∗ , which shows that (E −1 )∗ = (E ∗ )−1 .
(14)
The calculation of E ∗ is exactly the same as for E 0−1 ; the result is E ∗ = ME −1 ,
(15)
where M ∈ Rm×m is defined by Mi j = (φi , φ j ) L 2 () . Together, (14) and (15) yield (E −1 )∗ = EM −1 . The matrix M is symmetric and positive definite and hence invertible; this follows from the fact that {φ1 , φ2 , . . . , φm } is linearly independent. Using the expressions for E 0∗ and (E −1 )∗ , (13) yields D f (a)∗ = EM −1 D F(E −1 a)∗ M0 E 0−1 . The appearance of the trivial mappings E and E 0−1 in this formula is no more significant than it was in the formula for D f (a). On the other hand, the Gram matrices M −1 and M0 appear because of the different inner products used for the two pairs of isomorphic spaces. 9.
Avoiding the need to program derivatives
The user of a software package implementing numerical optimization algorithms is required to provide some computer code (usually a subroutine written in a given language) to evaluate the objective and constraint functions. (This is how the user specifies his or her problem to the optimization code.) Typically, the optimization code will need values of various derivatives, which can be obtained in several ways: 1. The user can provide hand-written computer codes to evaluate the derivatives. 2. The optimization code can estimate the derivatives using finite differences. 3. The derivatives can be produced, either by the user or the optimization code, using automatic differentiation. The emphasis of my presentation so far has been on understanding the basic theory of derivatives, particularly the linear algebraic foundations, and on using this theory to derive formulas for derivatives of specific functions. Such understanding is essential for handcoding derivatives. Suppose, though, that a user wishes to avoid the labor (and risk2 ) of programming the derivatives of the problem functions. In this final section, I will briefly discuss the advantages
116
GOCKENBACH
and disadvantages of the other two approaches to the computation of derivatives, finite differences and automatic differentiation. 9.1.
Finite difference estimation of derivatives
Optimization codes generally use the representers of the relevant derivatives: the gradient and Hessian of a real-valued function, the Jacobian matrix of a vector-valued function. In order to be concise, I will mostly limit my discussion to the computation of the gradient of a real-valued function. Suppose3 f : Rn → R. Then ∇ f (x) is the vector in Rn whose ith component is ∂f f (x + hei ) − f (x) , (x) = lim h→0 ∂ xi h
(16)
where ei is the ith standard basis vector (that is, the vector with every component equal to zero, except the ith, which is one). When the only information available about f is a “black box” that will return its value for a given x, it is not possible to implement (16) exactly, as the limit operation implies an infinite calculation. A natural way of approximating ∂ f /∂ xi is to simply truncate the limit operation by choosing a small but nonzero value of h: ∂f . f (x + hei ) − f (x) . (x) = ∂ xi h
(17)
Indeed, Taylor’s theorem, f (x + hei ) = f (x) +
∂f 1 ∂2 f (x)h + (x + θ hei )h 2 , ∂ xi 2 ∂ xi2
θ ∈ (0, 1),
can easily be rearranged to show that ∂f f (x + hei ) − f (x) 1 ∂ 2 f (x) = (x + θ hei )h, + ∂ xi h 2 ∂ xi2
θ ∈ (0, 1).
Thus, the error in (17) is O(h); this error is referred to as the truncation error. The question now arises: What value of h should be chosen in practice? At first glance, it would appear that smaller values of h (the smaller, the better) would tend to lead to better approximations of the partial derivative. Though this is true in exact arithmetic, it does not take into account the effects of floating point (computer) arithmetic. First of all, h cannot be chosen too small in comparison to xi ; otherwise, the values of xi and xi + h, rounded to the nearest floating point number, will be identical (and therefore, 2 necessarily, so will be f (x + hei ) and f (x)). More subtly, the magnitude ∂∂ x 2f plays a part. i A computer subroutine implementing the evaluation of f will inevitably return inexact results, because of round-off error if for no other reason. Suppose the implemented function
A PRIMER ON DIFFERENTIATION
117
actually returns fˆ(x) = f (x) + (x), with |(x)| < ε for all relevant values of x. Then formula (17) will be implemented as ∂f . fˆ(x + hei ) − fˆ(x) (x) = ∂ xi h f (x + hei ) + (x + hei ) − f (x) − (x) = h f (x + hei ) − f (x) (x + hei ) − (x) = + h h ∂f 1 ∂2 f (x + hei ) − (x) = . (x) + (x + θ hei )h + 2 ∂ xi 2 ∂ xi h There is no reason to expect that the function is differentiable, so all that can be said about the last term in the above expression is that (x + hei ) − (x) 2ε ≤ . h h If the second partial derivative of f is bounded by M, then ∂f fˆ(x + hei ) − fˆ(x) Mh 2ε (x) − ∂x ≤ 2 + h . h i
(18)
This bound suggests that the total error in the approximation can grow as h → 0, since the round-off error (or at least the bound for it) grows as h is decreased. Thus smaller values of h are not necessarily better in practice, and so the question remains: How should h be chosen in practice? One idea would be to choose h to minimize the bound in (18). This leads to the value √ 2 ε h=√ . M However, this result is of limited use, since the value M is not available in general. It does suggest, however, that √ h = O( ε)
118
GOCKENBACH
is reasonable, and an estimate of ε might be available. The usual choice is √ h = sign(xi )|xi | ε,
(19)
with some adjustment made if |xi | is too close to zero. With √ h determined by some variation on (19), the error in the computed partial derivative is O ε . This leads to the first disadvantage of using finite difference estimates for partial derivatives, and thus for gradients: the attainable accuracy in the computed minimizer is limited. After all, algorithms for numerical optimization are based on the necessary condition that ∇ f (x) = 0 at a minimizer (or the analogous Lagrange multiplier conditions, which also involve ∇ f (x), for a constrained minimization problem). It is easy to see that the minimizer cannot be reliably computed to an accuracy greater than the accuracy with which the gradients is computed. The foregoing disadvantage of finite differences is only important for small problems, when it is reasonable (and may be important) to compute the solution to a high degree of accuracy. A more serious objection is related to the computational cost of using finite differences: to estimate ∇ f (x) costs n evaluations of the function f (assuming that f (x) must be computed anyway in the course of the optimization algorithm). For any problem in which it is expensive to evaluate f , or n is large, or both, this cost may be unacceptable. By comparison, the examples given in Section 7 and 8 yield formulas that will result in the computation of the gradient at a cost equal to a small multiple of the cost of computing the function itself. (Note, though, in the case of the adjoint state method detailed in Section 7, this efficiency depends on the use of the checkpointing scheme that was only briefly mentioned.) The major advantage of using finite differences is obvious: the user need only implement the problem functions and not their derivatives. The optimization code can then take care of all details concerning the estimation of derivatives, including the choice of the step size h (although the user may need to provide an estimate of ε). When the cost is affordable and there is no need for high accuracy solutions, this makes finite differences an attractive option. Although I will not discuss it here, finite difference methods can also be devised for computing Jacobian and Hessian matrices. 9.2.
Automatic differentiation
Automatic, or algorithmic, differentiation (AD) is a term applied to a collection of techniques for automatically producing derivatives of functions implemented in computer subroutines. AD tools can analyze a computer program that implements a mathematical function or operator, and systematically apply the rules of differentiation, notably the chain rule, producing a new computer program implementing the desired derivative. There are two primary approaches to automatic differentiation, operator overloading and source transformation. I will only discuss the source transformation approach, and, indeed, will focus on a single AD tool, TAMC (Giering, 1999). Another source transformation
A PRIMER ON DIFFERENTIATION
119
tool is ADIFOR (Bischof et al., 1992). An AD package that uses the operator overloading approach is ADOL-C (Griewank et al., 1996). For a more complete discussion of automatic differentiation, see Griewank (2000). The following discussion is taken from the report (Gockenbach, 2000), which may be consulted for more details. The Tangent linear and Adjoint Model Compiler (TAMC), designed and implemented by Ralf Giering (1999), is an Automatic Differentiation (AD) package that produces linearized and adjoint code for nonlinear operators. To be more precise, given a Fortran subroutine implementing an operator of the form F : Rn → Rm , TAMC can produce code that computes D F(x)δx and D F(x)∗ δy. TAMC can also produce derivatives and adjoints for operators defined on product spaces, such as G : Rn × Rk → Rm . Although TAMC produces correct and efficient code, the exact operation of TAMCgenerated code can be slightly counter-intuitive to those not well-versed in AD. I will present an explicit mathematical model for an operator as implemented by a computer program, and explain the output of TAMC in terms of this model. The following simple example will serve to introduce some of the issues encountered in using TAMC. Define F : R → R by y = F(x) = x 2 . This operator is implemented in the following Fortran subroutine: subroutine F(x,y) double precision x,y y = x∗x return end
TAMC generates the following adjoint code (stripped of TAMC-generated comments): subroutine adF(x,adx,ady) implicit none double precision adx double precision ady double precision x adx = adx+2 ∗ ady ∗ x ady = 0.d0 end
This code correctly computes the adjoint of D F(x); however, the value D F(x)∗ δy is added to (rather than assigned to) the output variable. Moreover, the input variable is assigned the value of zero after it is used. That is, instead of implementing δx ← D F(x)∗ δy,
120
GOCKENBACH
the TAMC-generated code implements δx ← δx + D F(x)∗ δy, δy ← 0. Below I will show how this result could have been predicted. 9.2.1. The mathematical structure of a subroutine implementing an operator. Consider an operator F : Rn →Rm . A Fortran subroutine implementing y ← F(x) would have arguments x, y, as well as (possibly) other arguments involved in the definition of the operator (grid parameters, constants, etc.). The subroutine (which may call other subroutines) consists of a sequence of statements which together perform the desired calculation. A number of variables are involved: x and y, any variables required to hold intermediate quantities, loop control indices, etc. Some of the variable merely serve to control the flow of the executable statements, and are not important in developing a mathematical model of the subroutine. The crucial variables are the active variables, which are necessarily of floating point type. A variable u is active if • the final value of an output variable (i.e. one of the components of y) depends one the value of u at some step i; • the value of u at step i depends on the initial value of an input variable (one of the components of x). The input and output variables are active by definition. The phrase “variable w at step j depends on the value of variable z at step i” will be left undefined. The intuitive meaning is that the value of w at step j is linked to the value of z by a sequence of assignments statements, the last of which assigns a value to w, and the first of which (and perhaps others) has z, holding its value from step i, in the right hand side. (A precise definition of this concept would be required to implement a package such as TAMC, but is not needed to understand how it works.) Now let S be the set of all active variables appearing in the subroutine (or in subroutines called by it—I will not bother to make this distinction). Identifying S with a Euclidean space R N , F : Rn → Rm can be viewed as the composition of operators F = P ◦ FM ◦ FM−1 ◦ · · · ◦ F1 ◦ Q ∗ , where Fi : S → S,
i = 1, 2, . . . , M
and Q and P are the natural projections onto the domain and range, respectively, of F. (That is, P : S → Rm is defined by (Ps)i = s ji ,
i = 1, 2, . . . , m,
121
A PRIMER ON DIFFERENTIATION
where yi = s ji (recall that each active variable, including every component of y, is identified with a component of s). The projection Q : S → Rn is defined similarly.) Each statement assigning a value to an active variable can be thought of as implicitly defining one of the operators Fi , and it is in the sense of these operators that I spoke of “steps” in the previous paragraph. Of course, most steps will only involve a few variables, so most of the active variables will retain their previous values. The role of the projectors Q and P should be clear: Q ∗ assigns to the input variables their initial values and assigns zero to all other variables; P extracts from the set of all active variables the output F(x). There may be other floating point variables that are not active by this definition. For example, if the final value of an output variable depends on the value of z at some step, but that value of z does not depend on the initial value of any input variable, then z plays the role of a constant. (Input arguments to the subroutine other than x can be constants.) It is also possible to have variables which depend on the input variables, but do not influence any output variable. Such a variable can be called diagnostic. For an example of a diagnostic variable, consider the variable z in the following program fragment: z=x(1) ∗ x(1)+x(2) ∗ x(2) if (z.gt.1.0d0) then y(1) = 2.0d0 ∗ x(1) else y(1) = x(1) ∗ x(1) endif
Constant and diagnostic variables are called passive variables. By way of example, consider an assignment statement of the form w = g(u). Assuming u and w are active variables, say u = si1 , w = si2 (s ∈ S), this implicitly defines the operator Fk : S → S
g si1 , j = i 2 , (Fk (s)) j = j = i 2 . sj , It is instructive to compute D f k (s) and (Dk F(s))∗ . The derivative is given by
g si1 δsi1 , (D F k (s)δs) j = δs j ,
j = i2 , j = i 2 .
Therefore, (D F k (s)δs, δr ) =
δs j δr j + g (si1 )δsi1 δri2
j=i 2
=
j=i 1 ,i 2
δsj δrj + δsi1 (δri1 + g (si1 ) δri2 ) + δsi2 0,
122
GOCKENBACH
which shows that δri1 + g si1 δri2 , ((D F k (s))∗ δr ) j = 0, δr , j
j = i1 j = i2 j = i 1 , i 2 .
This calculation shows the basic pattern, and explains some of the apparently anomalous code generated by TAMC. The behavior of TAMC can be understood as follows: it regards a subroutine as a sequence of operators acting on the set of active variables, and it ignores the projectors P and Q. That is, TAMC identifies the set of active variables and then identifies each assignment statement in a program as implementing an operator G i : S → S. Thus, instead of viewing a subroutine implementing F as the composition P ◦ FM ◦ FM−1 ◦ · · · ◦ F1 ◦ Q ∗ , TAMC regards it as the composition G K ◦ G K −1 ◦ · · · ◦ G 1 . For some subroutines, K = M and G i = Fi for all i, but this is not necessarily the case. 9.2.2. Two examples. Consider first F : R → R defined by F(x) = x 2 . This operator is implemented in the following Fortran subroutine: subroutine F(x,y) double precision x,y y = x∗x return end
The only active variables are x and y, and I will write a typical element s ∈ S as (x, y). The “desired” interpretation of the subroutine is that it implements F = P ◦ F1 ◦ Q ∗ , where Q ∗ x = (x, 0)
A PRIMER ON DIFFERENTIATION
F1 (x, y) = (x, x 2 ) P(x, y) = y. The interpretation adopted by TAMC is that the subroutine implements the operator G 1 (x, y) = (x, x 2 ). Note that DG 1 (x, y)(δx, δy) = (δx, 2xδx), (DG 1 (x, y))∗ (δx, δy) = (δx + 2xδy, 0). The derivative of F is given by D F(x)δx = 2xδx. TAMC produces the following subroutine for the derivative: subroutine g_f(x,g_x,g_y) implicit none double precision g_x double precision g_y double precision x g_y = 2 ∗ g_x ∗ x end
This subroutine performs the computation DG 1 (x, y)(δx, δy) = (δx, 2xδx); that is, it performs the operation δx ← δx (implicitly), δy ← 2xδx. In this case, the subroutine can also be regarded as performing the desired operation δy ← D F(x)δx. The adjoint of D f (x) is given by D F(x)∗ δy = 2xδy.
123
124
GOCKENBACH
TAMC produces the following adjoint code: subroutine adf(x,adx,ady) implicit none double precision adx double precision ady double precision x adx = adx+2 ∗ ady ∗ x ady = 0.d0 end
This subroutine performs the operation (δx, δy) ← (DG 1 (x, y))∗ (δx, δy), as would be predicted by my discussion above. In terms of the original operator F, the Fortran command call adf(x,dx,dy)
does the following: • adds the value D F(x)∗ δy to dx, assuming that x and dy have been initialized to hold the values of x and δy, respectively; • sets dy to zero. To use adf in the desired manner (given x, δy, compute δx = D f (x)∗ δy), one would have to perform the following steps: 1. 2. 3. 4. 5.
initialize x and dy to the values of x and δy, respectively; set dx to zero; save dy (assuming that its value is wanted later); call adf(x,dx,dy); restore the value of dy.
Alternatively, one could hand-edit the routine adf so as to 1. replace the “add-to” statements with simple assignments; 2. remove the statements that change the input variable (dy in this example). As a second example, suppose the operator F(x, y) = x 2 y
A PRIMER ON DIFFERENTIATION
125
is implemented so that the result F(x, y) overwrites one of the inputs, say (arbitrarily) y. This is done in the following subroutine: subroutine F(x, y) double precision x,y,w w = x∗x y = w∗y return end
The active variables are now x, y, w, and F can be written as F = P ◦ F2 ◦ F1 ◦ Q ∗ , where F1 (x, y, w) = (x, y, x 2 ), F2 (x, y, w) = (x, wy, w). TAMC regards the subroutine as implementing G 2 ◦ G 1 , where, again, G 1 = F1 and G 2 = F2 . Now, DG1 (x, y, w)(δx, δy, δw) = (δx, δy, 2xδx), DG2 (x, y, w)(δx, δy, δw) = (δx, δwy + δyw, δw), and (DG1 (x, y, w))∗ (δx, δy, δw) = (δx + 2xδw, δy, 0), (DG2 (x, y, w))∗ (δx, δy, δw) = (δx, δyw, δw + yδy). The TAMC-generated code for the derivative is: subroutine g_f(x,y,g_x,g_y) implicit none double precision double precision double precision double precision
g _x g _y x y
double precision g_w double precision w g_w = 2 ∗ g_x ∗ x w=x∗x
126
GOCKENBACH
g_y = g_w ∗ y+g_y ∗ w end
The first executable statement computes the action of DG 1 (x, y, w), the second computes G 1 (x, y, w), and the third computes the action of DG 2 (G 1 (x, y, w)). The behavior of this subroutine is exactly as expected, and as desired; the Fortran command call g_f(x,y,dx,dy)
overwrites dy with D F(x, y)(δx, δy), assuming that x,y,dx, and dy have previously been initialized with the values x, y, d x, and δy, respectively. TAMC generates the following code for the adjoint: subroutine adf(x,y,adx,ady) implicit none double precision double precision double precision double precision
adx ady x y
double precision adw double precision w adw = 0.d0 w adw ady adx adw
= = = = =
x∗x adw+ady ∗ y ady ∗ w adx+2 ∗ adw ∗ x 0.d0
end
This subroutine initializes the local variable adw to zero (the first executable statement), computes G 1 (x, y, w) (the second statement), applies (DG 2 (G 1 (x, y, w)))∗ (statements three and four), and applies (DG 1 (x, y, w))∗ (statements five and six). This behavior is consistent with my description above; note that its effect is the following: δy ← x 2 δy, δx ← δx + 2x yδy. The desired behavior of the subroutine is δy ← x 2 δy, δx ← 2x yδy.
A PRIMER ON DIFFERENTIATION
127
Therefore, to use adf in the desired manner (given x, y, δz, compute (δx, δy) = D f (x, y)∗ δz), one would have to perform the following steps: 1. initialize x, y, and dy to the values of x, y, and δz, respectively; 2. set dx to zero; 3. call adf(x, y, dx, dy); Alternatively, one could hand-edit the routine adf so as to 1. replace the “add-to” statement computing adx with a simple assignment. The operator F has two partial derivatives, which must be handled differently by the code since the subroutine implementing F treats the two input variables differently (one is overwritten with the output). I will show how TAMC handles the partial derivative with respect to x. Suppose y is fixed, and G is defined by G(x) = F(x, y). The same subroutine is used to implement G; this subroutine is unusual in that y plays the role of a constant and as the output variable. To fit this subroutine into the mathematical structure defined above, write G = P ◦ F3 ◦ F2 ◦ F1 ◦ Q ∗ . The active variables are still x, y, w, and now Q ∗ takes the following form: Q ∗ x = (x, 0, 0). As before, P(x, y, w) = y. Let y0 be the value of y input to the subroutine; then F1 (x, y, w) = (x, y0 , w), F2 (x, y, w) = (x, y, x 2 ), F3 (x, y, w) = (x, wy, w). However, TAMC views the subroutine as still implementing G 2 ◦ G 1 , where G 1 and G 2 are exactly as before. The effect of this is that the TAMC-generated code for the partial derivative actually computes the total derivative D f (x, y)(δx, δy). To use it in the desired manner requires that δy be initialized to zero. The TAMC-generated code for the adjoint computes both δx and δy from the input δy, as described above. The desired value of δx is computed, provided the variable passed to the subroutine to hold this value is first initialized to zero. 9.2.3. Discussion. The above description of the AD tool TAMC is intended to give the reader some feel for the capabilities of automatic differentiation, and to provide a basis for discussing its advantages and disadvantages. Although the examples I gave were very simple, they will serve to illustrate the points I make below.
128
GOCKENBACH
The apparent advantages of AD are: • The user avoids the labor-intensive and error-prone task of implementing derivatives by hand. • If it is necessary to modify the original function, its derivatives can be modified automatically by the AD tool, again saving time spent writing and debugging code. • Modern AD tools can handle very complex code and produce, in many cases, efficient derivative code. The disadvantages are not quite so obvious, and stem from the fact the foregoing advantages are not quite realized: • My discussion of the code generated by TAMC shows that, in fact, the code produced by a fully automatic AD tool may need some modification by hand to achieve the desired results as efficiently as possible. • Code produced by a fully automatic AD tool can be inefficient for certain applications. An example of this is provided by the adjoint state calculation of Section 7; an AD tool would tend to either save or re-compute all of the intermediate computations needed for the reverse time-stepping calculation. Either approach is significantly inefficient in some applications. In summary, my view is that automatic differentiation is very useful in a variety of situations, but it must be used with care if efficiency is a prime consideration. In particular, “fully automatic” differentiation may not be satisfactory for some applications.4 Notes 1. I am far from the first to notice this. See, for example, Groetsch (1980): A closely guarded secret in some elementary calculus courses is the fact that the basic idea of differential calculus is the local approximation of a nonlinear function by a linear function. To quote from Dieudonn´e (1969), “In the classical teaching of calculus, this idea is immediately obscured by the accidental fact that, on a one-dimensional vector space, there is a one-to-one correspondence between linear forms and numbers, and therefore the derivative at a point is defined to be a number instead of a linear form.” The texts of Dieudonn´e and Groetsch are general references for the material in this paper; see also the survey paper by Tapia (1971). 2. Probably the most common difficulty encountered when using packaged optimization software results from the user’s providing incorrect derivatives. 3. It is possible to compute gradients without referring to coordinates explicitly, for instance in the formula ∇ J (x) = D f (x)∗ (F(x) − d) for J (x) = (1/2)F(x) − d2 . However, finite difference derivatives depend on an explicit coordinate representation, and so I may as well assume that the function is defined on Rn . 4. See Griewank (2000), page 92: As a rule, a general-purpose AD tool will not produce transformed code as efficient as that produced by a special-purpose translator designed to work only with underlying code of a particular structure, since the latter can make assumptions (often with far-reaching consequences), where as the former can only guess.
A PRIMER ON DIFFERENTIATION
129
A tool that requires the user to rewrite the underlying code in order to make explicit such factors as variable dependence, structural sparsity, interface width, and memory access patterns will be able to produce efficient transformed code more easily than a tool that must use internal analysis of the underlying program to deduce these structural features, but that does not require such a great effort from the user. On the other hand, the first sort of tool is difficult to apply to legacy code. A possible compromise is a tool that allows, but does not require, the user to insert directives into the program text. The latest version of TAMC does allow user-defined directives, making efficient code more attainable.
References C. Bischof, A. Carle, G. Corliss, A. Griewank, and P. Hovland, “Adifor: Generating derivative code from fortran programs,” Scientific Programming Vol. 1, pp. 1–29, 1992. J. Dieudonn´e, Foundations of Modern Analysis, Academic Press: New York, London, 1969. Ralf Giering, Tangent Linear and Adjoint Model Compiler, User Manual 1.4, 1999. URL: http://puddle. mit.edu/∼ralf/tamc M. S. Gockenbach, “Understanding code generated by TAMC,” Department of Computational and Applied Mathematics, Rice University, Houston, TX, Technical Report TR00-30, 2000. M. S. Gockenbach, D. R. Reynolds, and W. W. Symes, “Efficient and automatic implementation of the adjoint state method,” in Press. A. Griewank, “Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation,” Optimization Methods and Software, Vol. 1, pp. 35–54, 1992. A. Griewank, D. Juedes, and J. Utke, “ADOL-C, a package for the automatic differentiation of algorithms written in C/C++,” ACM TOMS Vol. 22, pp. 131–167, 1996. Andreas Griewank, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM: Philadelphia, 2000. Charles W. Groetsch, Elements of Applicable Functional Analysis, Marcel Dekker: New York, 1980. W. W. Symes, J. O. Blanch, and R. Versteeg, “A numerical study of linear inversion in layered viscoacoustic media,” in Comparison of Seismic Inversion Methods on a Single Real Dataset, R. Keys and D. Foster, eds., Society of Exploration Geophysicists, Tulsa, 1998. R. Tapia, “The differentiation and integration of nonlinear operators,” in Nonlinear Functional Analysis and Applications, L. Rall, ed., Academic Press: New York, 1971.