Majorization as a tool for optimizing a class of matrix functions

The problem of minimizing a general matrix, trace function, possibly subject to certain constraints, is approached by means of majorizing this functio...

0 downloads 44 Views 953KB Size

Download PDF

PSYCHOMETR1KA--VOL. 55, NO. 3, 417--428 SEPTEMBER 1990

M A J O R I Z A T I O N AS A T O O L F O R O P T I M I Z I N G A C L A S S O F M A T R I X FUNCTIONS

HENK A. L. KIERS UNIVERSITY OF GRONINGEN

The problem of minimizing a general matrix trace function, possibly subject to certain constraints, is approached by means of majorizing this function by one having a simple quadratic shape and whose minimum is easily found. It is shown that the parameter set that minimizes the majorizing function also decreases the matrix trace function, which in turn provides a monotonically convergent algorithm for minimizing the matrix trace function iteratively. Three algorithms based on majorization for solving certain least squares problems are shown to be special cases. In addition, by means of several examples, it is noted how algorithms may be provided for a wide class of statistical optimization tasks for which no satisfactory algorithms seem available. Key words: trace optimization, majorization, alternating least squares.

Introduction Many multivariate statistical techniques are based on fitting a model to observed data. In a great variety of cases, these models are fitted by least squares, and as a consequence, one has to minimize a least squares loss function possibly subject to certain constraints on the parameters in the model. Least squares loss functions can often be expressed in terms of rather simple matrix trace functions. The main purpose of the present paper is to provide a general algorithm that can be used for optimizing such matrix trace functions. It is not our intention to provide an algorithm that is the most efficient in all situations, but merely to offer a simple algorithm that is easily p r o g r a m m e d by anyone who is faced with matrix trace optimization problems for which no straightforward solutions are available. A class of functions that would include all those of interest in multivariate statistical analysis can probably not be given, but in practice many statistical problems can be reformulated as special cases of the optimization o f the general function m

f(X) = c + tr A X + ~

tr

BjXCjX'.

(1)

j=l

Here, A is a fixed p × n matrix; Bj a fixed n x n matrix; Cj a fixed p x p matrix, j = 1. . . . . m; X an u n k n o w n n x p matrix, and c a constant that does not depend on X. The matrices A, Bj, and Cj at this point are arbitrary but with the required orders; therefore, the problem of maximizing f(X) can be readily translated into the problem of minimizing f(X) by merely changing the sign of the matrices A and Bj. Thus, without

The Netherlands organization for scientific research (NWO) is gratefully acknowledged for funding this project. This research was conducted while the author was supported by a PSYCHON-grant (560-267-011) from this organization. The author is obliged to Jos ten Berge, Willem Heiser, and Wim Krijnen for helpful comments on an earlier version of this paper. Requests for reprints should be addressed to Henk A. L. Kiers, Department of Psychology, Grote Kruisstr. 2/1, 9712 TS Groningen, THE NETHERLANDS. 0033-3123/90/0900-88086500.75/0 © 1990The Psychometric Society

417

418

PSYCHOMETRIKA

loss of generality, the general problem of optimizing f(X) can always be treated as the problem of minimizing fiX), and this is assumed in the sequel. The present paper focuses on algorithms that minimize matrix trace functions monotonically. That is, at each step the function to be minimized is decreased until no further decrease is obtained. If the function is bounded from below, convergence of such algorithms is guaranteed. The algorithms proposed for minimizing the general function f(X) are based on majorizing fiX) by a function g(X) whose minimum is easily obtained. The majorizing function is chosen such that the parameter set for which the majorizing function attains its minimum will also decrease the function to be minimized. This technique has been applied in the context of multidimensional scaling (e.g., de Leeuw, 1988; de Leeuw & Heiser, 1980; Meulman, 1986) and more recently, for minimizing the type of function treated here. In fact, for certain special cases of f(X), the technique has been used explicitly by Meulman (1986, pp. 148-149), Heiser (1987), Bijleveld and de Leeuw (I987), and Kiers, ten Berge, Takane, and de Leeuw (1990). The above type of algorithms can easily be incorporated in algorithms for minimizing functions over more than one set of parameters. Just as in alternating least squares (see de Leeuw, Young, & Takane, 1976; Wold, 1966; Young, de Leeuw, & Takane, 1976), where the function to be minimized is minimized over each of the subsets in an alternating fashion, the present type of algorithms can be used to minimize such function by alternatingly decreasing the function over each of the subsets. Each step in this alternating lower squares procedure results in a monotone decrease of the function, until no considerable decrease in the function value is observed. Convergence of this procedure is guaranteed by the monotonicity of each of the separate steps. The function f(X) can be minimized over arbitrary X, but in many practical applications, constraints are imposed, the most frequent one being columnwise orthonormality: X ' X = Ip. Another common constraint is that of reduced rank; that is, the rank of X is forced to be less than or equal to r (r < min (n, p)). The present paper discusses only these three cases: unrestricted X, columnwise orthonormality, and reduced-rank. The next section presents an algorithm for minimizing f(X) over arbitrary X, which is then used in the subsequent section to develop methods for minimizing f(X) when X is columnwise orthonormal or of reduced rank. It is shown that the recent algorithms in the literature can be interpreted as special cases of the general algorithms proposed here. In addition, the algorithms for the general function can easily be adapted for a wide class of minimization problems, and we illustrate this using several optimization problems in multivariate statistics. Typically, the algorithms currently available for these problems are problematic in the sense that no proof of monotone convergence is given. However, using the general function algorithms, monotone convergence is guaranteed. Minimization of the General Function over Arbitrary X To minimize f(X) over X, an algorithm is given by which X is updated iteratively, and thus, it is determined entirely by the formula that expresses the update of X in terms of the matrix of current values ~. The algorithm decreases f(X) monotonically when the matrix Y ( n × p), the update of X, is chosen such that flY) - f(~). The convergence of any algorithm that monotonically decreases f(X) is guaranteed, provided that the function is bounded from below. In many practical applications, boundedness is obtained by imposing constraints on X, but it is also possible that the function is bounded for arbitrary X because of some relation that exists between the matrices A, Bj, and Cj, j = 1. . . . . m. A simple example is given by m = 1, A = - 2H'G, B 1 = G'G, and Cl = H'H. Then, f(X) = - 2

HENK A. L. KIERS

419

tr H'GX + tr G'GXH'HX' =[lI - GXH'II z - tr I. From III - GXH'II z ~ O, it follows that f(X) -> - t r I, and hence, f is bounded from below. In the sequel, it will be assumed that the function at hand is bounded from below, since if this were not the case, the minimization problem would be without solution and of no practical interest. Assuming boundedness, finding a monotonically convergent algorithm for minimizing f(X) reduces to the construction of an updating formula that expresses the update Y in terms of X so that f(Y) < f(X). The following theorem provides such an update Y.

Theorem 1. Let r--

l(a, J

)

+ Z Bj×cj + Z 8j×G: , J

J

(2)

where Ej denotes zjm 1, and aj is to be chosen as a scalar greater than or equal to the first eigenvalue of the symmetric part of (Bj ® Cjg, where ® denotes the Kronecker product, with the requirement that Zjaj > 0. For any matrix Y satisfying (2): f(r) _< f(×).

(3)

The updating formula for X given in Theorem I is in fact based on the idea of finding a function g that majorizes the original function f, while f(X) = g(X). Because of the latter equality, minimizing the majorizing function g over X yields a Yfor which f(Y) never exceeds f(~), as is made clear in Figure 1. In this way, fis decreased by replacing by the update Y. A more detailed explanation of this idea can be found in Meulman (1986, pp. 185-192).

Proof. The proof for Theorem 1 is a simple elaboration of similar proofs given by Heiser (1987), and Bijleveld and de Leeuw (1987). Let matrix E be defined as E = X ~. Then f(X) = f(X + E) can be rewritten as f(X) = f(~) + tr AE + ~ tr BjECj~' + ~ tr BjXCjE' + ~ tr BjECjE' J J J = f(X)+ tr ( A +

ZCj~'Bj+ J ~_,Cj'X'Bj)E+ J ff]trE'BjECj.j

(4)

The crucial point in proving that Yis an update for X such that f(Y) --<-fiX), is to show that the last term in the right-hand side of(4) is majorized by a simple function of E, and hence of X. Explicitly, if Aj is the greatest eigenvalue of the symmetric part of (Bj ® Cjg, then Zj is the greatest eigenvalue of (½ (Bj ® Cj9 + 1 (Bj' ® Cj)), and tr E'BjECj <- Aj tr E'E.

(5)

The proof of (5) can be given as follows: The left hand side of (5) can be rewritten as tr

E'BjECj = e'(Bj® C])e = e'(½(Bj ® Cj) +

(B:® c:))e,

(6)

where e denotes Vec (E), the vector with the elements of E strung out rowwise into a column vector. Because tr E'E = e'e, (5) follows immediately from the well-known inequality for the Rayleigh-quotient

420

PSYCHOMETRIKA

I

I

\ \ \ k

/ k

!

i

y

X

g(×) f(X)

--,X

FIoupm 1 Decreasing function f(X) by means of minimizing the majorizing function g(X).

e'Se

....--< A(S),

(7)

ere

where S is a symmetric matrix, and A(S) is its greatest eigenvalue. It follows at once from (5) that for any aj greater than or equal to Aj, tr

E'BjECj <- aj

tr

E'E.

(8)

Combining (4) and (8), f(X)-< f ( X ) + tr ( A +

ECjN'Bj+ J ~'~Cj'N'Bj)E+ J ~'~ajtrE'E.j

In the sequel it will be assumed that defined as F --~ - ( 2 Xj otj)-l(a ' + Xj f(X) -< f(X) +

Y7 aj > 0, as can always BjXCj + Xj BjXQ'). Then ~'~ aj (tr E'E - 2 tr F'E),

(9)

be arranged. Let F be (9) can be rewritten as (10)

J or, equivalently, f(X) - f(X) + ~ J

aj(lIF -

E~[2 -IIFII2),

CtI)

HENK A. L. KIERS

421

Replacing E by (X - X) in (11), we obtain

f(X) -
j(ll(F + ×) - xll 2 - II 12),

(12)

which holds for any matrixX. The update Yfor X is in fact found by minimizingthe right-hand side of (12). The right-hand side of (12) can be considereda function of X: g(X) = f(X) + ~ aj(ll(F + X) - X]l 2 - ll q12). J

(13)

Because Zj aj > 0, the minimum of g(X) is trivially given by Y = (F + X), which is equivalent to the expression for Y in (2). As a result g(Y) < g(~). Moreover, because of (12), function g majorizes f(X), that is, f(X) -< g(X) for every X; hence, in particular, flY) -< g(Y). Finally it follows at once from (13) that g(X) = f(~). Combining these results, f(Y) ---
(14) []

The present section described an algorithm for minimizing fiX) over unconstrained X. In practice, some constraints are usually imposed on X. The present derivation mainly served as a basis for the derivation of algorithms for minimizing fiX) subject to certain constraints. The next section shows that when constraints are imposed on X, finding an update for X is slightly more complicated but can still be based on minimizing the majorizing function. Minimization of the General Function over Constrained X In many practical examples constraints are imposed on the matrix X. Although many different types of constraints are conceivable, only the cases where X is columnwise orthonormal or of reduced rank will be treated explicfly here. For the minimization of function fiX) subject to X ' X = Ip, with (p - n), a procedure can be used that resembles the unconstrained case. The update Y for X is found according to the following theorem.

Theorem 2.

Consider the singular value decomposition

X - ~ J

A' + ~. Bj~(Cj+ ~. B]~CJ J

=PDQ',

(15)

J

with P'P = Q'Q = QQ' = tp, and D a diagonal matrix with nonnegative diagonal elements in weakly descending order. Then, for Y = PQ', Y'Y = Ip, and if X'X = Ip, f(Y) -- f(~).

Proof. It follows immediately from P'P = QQ' = Ip that Y' Y = Ip. To show that when ~ ' X = Ip, flY) - fiX) for the thus chosen Y, a proof similar to that of the preceding section is used. Specifically, inequality (12) holds for any matrix X, and for the present case, X is constrained to be columnwise orthonormal. Again, X is a possible choice for X that satisfies the constraint because ~ is itseff a matrix that is columnwise orthonormal. Again, minimizing the majorizing function g over X yields a Y for which flY) -< fiX). The only difference in the present problem is for the majorizing function to be minimized over X satisfying X ' X = Ip (instead of over arbitrary X).

422

PSYCHOMETRIKA

Minimizing g(X) essentially reduces to minimizing I[(F + ~) X]t2. Let (F + X) = PDQ' be a singular value decomposition. Then, the minimum of II(F + x ) - gll 2 over columnwise orthonormal X is attained by Y = PQ' (Cliff, 1966), which is exactly the formula for updating X given in Theorem 2. Thus, for this Y, f(Y) - f(X) if X'X = Ip. [] -

The second constrained minimization problem to be discussed is the minimization off(X) subject to the constraint that X be of rank less than or equal to r (r < min (n, p)). As might be expected, the solution is similar to that for unconstrained X, or for columnwise orthonormal X, and is again reduced to minimizing g(X) over X, subject to the constraint. Specifically, one has to minimize [I(F + X) X[I 2 over X where X has rank smaller than or equal to r. The solution is given by means of the Eckart-Young theorem (Eckart & Young, 1936). If (F + X) = PDQ' is a singular value decomposition, er and Qr are matrices that contain the first r columns of P and Q, respectively, and D r the r x r diagonal matrix containing the first r diagonal elements of D, then Y PrDrOr gives the minimum of It(F + Y0 - XII2 over X subject to the constraint that X is of rank less than or equal to r. Therefore, this particular Y minimizes g(X), and is an update for X for which f(Y) - f(M). This result is stated explicitly in Theorem 3. -

=

Theorem 3. Consider the singular value decomposition given by (15) in Theorem 2. I f X has rank less than or equal to r, then for Y = PrDrQr, f(Y) < f(X) and Yhas rank less than or equal to r. Each of the above theorems provides an update Y for X such that f(X) decreases. Therefore, repeatedly computing updates according to any one of these theorems provides a monotonically convergent algorithm for minimizing f(X), provided that it is bounded from below. In all formulas for finding the update YofX, the aj parameters are needed, and in certain cases, their computation can be facilitated as will be discussed below. Alternative Choices of o9 The algorithms discussed above require the determination of the aj coefficients subject to aj >- Aj and Ej aj > 0. The most obvious choice seems to be aj = Aj, and if the requirement Zj aj > 0 is not met, all aj that are negative can be set equal to zero provided that at least one of the aj is positive. Unfortunately, this is not always an attractive way to choose aj, because the computation of Aj can be rather cumbersome. The Kronecker product (Bj ® Cj') is needed as well as the greatest eigenvalue of (½ (Bj I (B) ® Cj)), which can be rather time-consuming since a large matrix is involved. Given this, a different choice for aj might be preferred in practice. An alternative set of as. parameters will necessarily consist of some that are strictly greater than Aj, implying that the choice for an update of X will differ from that based on as. = Aj. If aj is chosen to be very large, (2 Ej aj)-1 will be very small, and as a consequence, the update Y will differ little from X, slowing down convergence of the algorithm. Thus, an alternative set of 09 parameters is sought that are the smallest possible and computed easily. As one possibility, suppose a(Bj) and a(Cj) are the first singular values of Bj and Cj, respectively. If aj is given as the product of a(Bj) and a(Cj), the constraints oq -> Aj are satisfied (see Bijleveld & de Leeuw, I987), as shown by the following two Lemmas.

Lemma 1. Let or(A) be the greatest singular value of a square matrix A, and A(A) the greatest eigenvalue of the symmetric part of A. Then, a(A) -- A(A).

HENK A. L, KIERS

423

Proof. It is well-known that A(A) is the maximum of x'(~ A + ~ A')x = x'Ax over x, subject to x'x = 1, and o(A) is the maximum of x'Ay over x and y, subject to x'x = 1 and y'y = 1. Because x'Ax is a possible value for x'Ay, the maximum of x'Ax can never be greater than that of x'Ay, from which tr(A) >- A(A) follows. [] Lemma 2. Let o(A ® B) be the greatest singular value of A ® B, and o(A) and o(B) the greatest singular values of A and B, respectively. Then, o(A ® B) = o(A) o(B). Proof. The singular values o f A ® B are given by the products of all possible pairs of singular values of A and B, respectively. Because of the nonnegativity of singular values, the greatest singular value of A ® B is given by the product of the greatest singular values of A and B, respectively. [] The choice of o9 as the product of tr(Bj) and o(Cj) is much simpler computationally than setting aj equal to Aj. It only requires the first singular value of matrices that are much smaller than their Kronecker product. The Special Case Where hj <- 0 for j = 1. . . . .

m

In the particular case where Aj --< 0 f o r j = 1. . . . . m, the formulas for updating X, and the algorithm based on the update, can be simplified further because a computation of the parameters aj is not needed. When all ;tj are negative or zero, aj can be set to any nonnegative number provided at least one aj > O, j = 1. . . . . m. Even simpler is the case where f(X) is to be minimized subject to X'X = Ip, and Aj -< 0 f o r j = 1. . . . . m. Here, the aj parameters are not needed at all. Specifically, (9) holds for any aj >- hi, j = 1. . . . . m. Now, suppose ~j -< 0 f o r j = 1. . . . . m, and we merely set aj = 0 for all j. Equation (9) can be expressed as

f(X)<-f(X) + tr ( A + ~ CjX'Bj

~,

(16)

1

If F ~ -~ (A' + Y,j BjXCj + Zj Bj'XCj'), (16) can be rewritten as fiX) -< f(X) - 2 tr F'E,

(17)

and upon substitution of E = (X - X), fiX) <- fiX) + 2 tr F ' X - 2 tr F'X,

(18)

or, equivalently, f(X) -< f(X) + IlV -

XII 2 -

tr V'V + 2 tr F ' X - p.

(19)

If the majorizing function h(X) is defined as h(X) = fiX) + lie - Sll 2 - tr F'F + 2 tr F ' X - p, then setting X = X yields h(~) = fiX), a n i f t h e minimum of h(X) is attained for a Y for which flY) -< f(X). Letting the singular value decomposition of F be F = PDQ', the minimum of h(X) found by minimizing lie - gll 2 is attained by setting Y = PQ'. Note that taking Y = - U V ' , resulting from the singular value decomposition (A' + Zj BjXCj + Zj BjXCj) = UAV', gives exactly the same result, and this solution for X does not involve any p a ~ e r aj. Identification of Certain Algorithms as Special Cases of the Procedures for Minimizing the General Function A number of minimization problems have been approached by means of algorithms based on minimizing a majorizing function, and for three of these, we show they are special cases of minimizing fiX).

424

PSYCHOMETRIKA

The first algorithm was proposed by Bijleveld and de L e e u w (1987) for a least squares fit of a model for linear dynamical systems. Following their notation, let B be a fixed n × n matrix, D, F, and H fixed p × p matrices, X and Yfixed n × p matrices, and Z a variable n × p matrix. One o f their subproblems is to minimize a function fl ( Z ) = IIZ - B Z F ' - XD']] ~ + ]IY - Z H l l 2,

(20)

o v e r Z, subject to the constraint Z ' Z = It,. This function can be expanded as fl (Z) = c~ + tr B Z F ' F Z ' B ' - 2 tr Z ' B Z F ' - 2 tr Z ' X D ' + 2 tr F Z ' B ' X D ' + tr Z H ' H Z ' - 2 tr Y ' Z H ' = cl - 2 tr D X ' Z + 2 tr F ' D X ' B Z - 2 tr H ' Y ' Z + tr Z H ' H Z ' + tr B ' B Z F ' F Z ' - 2 tr B Z F ' Z ' ,

(21)

where cl denotes a constant term not depending on Z. It is readily verified that f l ( Z ) is the special case o f f(X) with X replaced by Z, A = - 2 D X ' + 2 F ' D X ' B - 2 H ' Y', B 1 = In, B 2 = B ' B , B3 = - 2 B , C 1 = H ' H , C 2 = F ' F , and C3 = F ' . Thus, according to T h e o r e m 2, an update Y for Z is given by Y = P Q ' , where P and Q are based on the singular value decomposition 7/- ~

'(A

J

+ ~ B j Z C j + ~_~ B]ZC] J J

)

= ?DQ'.

(22)

Substituting the expressions given above for the coefficient matrices A, B j , and Cj, j = 1 , . . . , 3, and joining equal terms in (22) gives

1

2v - ~

(-2XD'

+ 2B'XD'F - 2YH + 2ZH'H + 2B'B7/F'F

J - 2B7/F' - 2 B ' Z F ) = P D Q ' .

(23)

The eigenvalues that bound the aj parameters from below are A1 = A(H'H), Az = A ( B ' B ) A ( F ' F ) , and A3 = A((-B ® F ) - (B' ® F')); thus, letting al = A1 and Ot2 = A 2 poses no computational problems, and the choice for a 3 can be facilitated by choosing a3 = o ( - 2 B ) o ( F ) = 2o(B)o(F) instead of a3 = A3, as was explained in a previous section. The procedure for updating Z proposed by Bijleveld and de L e e u w (I987) is analogously based on/~Q' obtained from a singular value decomposition M = # / ) 0 ' , where M = 7/+

1

(B'ZF - B'B~'F

ce+/3

_ 7 / ( a +/3 - I )

-

'k, a , + ~ "

+ 27/H'H-

-

1

2a+2------~

2BZF' - 2XD'),

- B ' X D ' F + Y H - Z H ' H - 7/+ BT/F' + X D ' )

(-2B'7/F + 2B'BZF'F + 2B'XD'F - 2 YH

(24)

here, a = (1 + tr(B)o(F)) 2 and 13 = A ( H ' H ) . The product/SQ,, and hence the update for Z, are not changed by replacing M by hTl = M ( a + / 3 ) / ( a + /3 - 1). The expression for )f/is then given by

HENK A. L. KIERS

425

1

~/= Z

2t~ ÷ 2~ - 2

(-2B'ZF + 2B'BZF'F + 2B'XD'F

- 2 Y H + 2 Z t t ' H - 2 B Z F ' - 2XD'),

(25)

where 2o~ + 2/3 - 2 = 2 A ( B ' B ) A ( F ' F ) + 4o(B)o(F) + 2A(H'H). It can be shown that (25) for ~7/is equal to the left-hand side of (23) when the aj parameters are chosen as at = A ( H ' H ) , a 2 = A ( B ' B ) A ( F ' F ) , and ~3 = 2o(B)o(F). Thus, the algorithm proposed by Bijleveld and de L e e u w (1987) is identical to the one derived above from the general algorithm. The second minimization problem was solved by Heiser (1987) and a special case by Meulman (1986, pp. i48-149). L e t F be a fixed and X a variable matrix o f order n x p, and M a positive semi-definite matrix o f order n × n. Heiser's problem (1987, p. 344) is to minimize the function fE(X) = tr ( X - F ) ' M ( X - f ) ,

(26)

over X, subject to the constraint X ' X = Ip, which he solves by majorization as follows. If fl is the first eigenvalue o f M, and • + ~ - 1 M ( F - X) = P D Q ' is a singular value decomposition, an update Y for X is given by Y = P Q ' . Heiser's problem can be recognized as a special case o f minimizing f(X) o v e r X, subject to X ' X = Ip. Function fE(X) can be expanded as fE(X) = c2 - 2 tr F ' M X + tr M X X ' ,

(27)

where c2 is a constant not depending on X. This is the special case o f f(X) with A = - 2 F ' M , B 1 = M, and Cl = Ip, and according to T h e o r e m 2, the update Yfor X is given by Y = P Q ' , where P and Q are taken from the singular value decomposition X ( 2 a l ) - l ( 2 M X - 2 M F ) = X + a l l M ( F - X) = P D Q ' . The parameter t~1 is chosen so that al -> A(M), where A(M) is the first eigenvalue of M. Obviously, setting a I = /3 satisfies this requirement, demonstrating the equivalency o f the update proposed by Heiser and the one that follows from our general procedure. The third problem is the least squares fitting o f the D E D I C O M model (Harshman, 1978), which can be written as minimizing f3 (X, R) =

IIF -

XRX'l[ 2,

(28)

where F is an arbitrary fixed n x n matrix, X and R are variable matrices of orders n x r and r x r, respectively, with X columnwise orthonormal. Kiers et al. (1990) have given an algorithm for this problem based on majorization that can be interpreted as alternately updating R by R = Z ' F X , and X b y X = P Q ' , where P and Q are taken from the singular value decomposition 2aX + F X R ' + F ' X R = P D Q ' , and a is taken as a value greater than or equal to the greatest eigenvalue of the symmetric part of ( - F ® R). (In fact, their algorithm combines these two steps into one that immediately updates X.) Here, only the update for X is reconsidered. The D E D I C O M function, with R considered fixed, can be rewritten as f3(X,*) = c3 - 2 tr F ' X R X ' ,

(29)

where * denotes that R is considered fixed and c3 is a constant not depending on X. This is a special case o f f ( X ) with A = 0, Bl = - 2 F ' , Cl = R, and thus, the solution for an update Y o f X according to T h e o r e m 2 is given by Y = P Q ' , where P and Q are taken from the singular value decomposition X - ( 2 a t ) - l ( - 2 F X R ' - 2F'XR) = X + a { - I ( F X R ' + F ' X R ) = P D Q ' . The choice for a I is made so a I is greater than or equal to the first eigenvalue of the symmetric part of ( - 2 F ® R). It is clear that the choice

426

PSYCHOMETRIKA 1

regions for a in Kiers et al. (1990) and for½ al are the same, and consequently, ~ al can be replaced by or. The matrix X + a~-1(F~R ' + F ' X R ) = X + ( 2 a ) - l ( F X R ' + F ' ~ R ) for which a singular value decomposition is taken according to Theorem 2, is equal to the matrix for which a singular value decomposition is taken in Kiers et al. up to the positive scalar 2a. Hence, the Kiers et al. algorithm is a special case of minimizing f(X) subject to X ' X = Ip. New Algorithms Based on Minimizing the General Function In the previous section algorithms have been identified as special cases of those for minimizing the general function, and although this may be of some general interest, a more exciting feature of general function algorithms is their possible use in solving a wide class of unsolved minimization problems or to offer alternatives for problematic algorithms. The present section discusses one such problem briefly and indicates how other minimization tasks can be solved along similar lines. The first minimization problem is to fit the "orthonormal INDSCAL model" (see Kroonenberg, 1983, pp. 1I0, 118). Let Sj be a fixed symmetric matrix of order n × n, X a variable matrix of order n x p, and Dj a variable diagonal matrix of order p, j = 1, . . . . m. Fitting the orthonormal INDSCAL model is equivalent to minimizing the function f4(X, Dj) = ~ [ISj J

XDjX'll z,

(30)

subject to X ' X = Ip. Kroonenberg uses a procedure that essentially minimizes g4(X, Dj, Y) = ~ HSj - X O y Y ' l l J

z,

(31)

over X and Y, subject to X ' X = lp and Y' Y = Ip. In practice, one tends to find Y = X at convergence, but no proof is available, and therefore, a procedure that minimizes f4 directly is desirable. Ten Berge, Knol, and Kiers (1988) provide a method for minimizing f4, but monotone convergence has been proven only for the case where Sj is positive semi-definite (p.s.d.),j = 1, . . . , m. For this case, it can be shown that in Kroonenberg's algorithm, Y can be set equal to X after convergence to the global optimum without loss of fit, and thus, Kroonenberg's algorithm poses no problems for such data either. In many cases, however, the matrices Sj will not necessarily be p.s.d., and in fact, ten Berge et al.'s method for finding the VARIMAX rotation is based on minimizing f4 with a special choice of Sj that is certainly not p.s.d. Here, an alternative to the ten Berge et al. approach is proposed for which monotone convergence is guaranteed regardless the nature of the Sj matrices, which obviously gives an alternative VARIMAX algorithm for which monotone convergence is also guaranteed. As in ten Berge et al. (1988), f4 will be minimized by alternatingly updating X and Dj, j = 1 . . . . . m. An update Ej for Dj such that f4(~, Ej) ~ f4(~, Dj) is given by Ej = Diag (X'Sj~) (see Kiers, 1989). Updating X, however is more complicated. Function f4, with Dj, j = 1 . . . . . m, considered fixed, can be rewritten as fa(X) = c4 - 2 ~ tr S j X D j X ' , J

(32)

where c4 is a constant with respect to X. This is the particular case of f(X) with A = 0, Bj = - 2Sj, Cj = D j , j = 1 . . . . . m, and thus, an update Y of X, according to Theorem 2, is given by Y = PQ', where P and Q are taken from the singular value decomposition X - (2 Ej a j ) - I ( - 4 Ej SjXDj) = X + (~1 Ej aj) - 1 Ej S j X D j = PDO'. The choice for aj

HENK A. L. KIERS

427

is made so aj is greater than or equal to the first eigenvalue of the symmetric part of (-2Sj ® Dj), provided Ej o~j ~ 0. The constraint Ej o~j _> 0 is not irrelevant now, because when Sj is, for instance, positive definite for all j, so is Dj. Then, the first eigenvalue of the symmetric part of (-2Sj ® Dj) is -21~(Sj)Ix(Dj), where/z(.) denotes the smallest eigenvalue of the matrix between the parentheses. Because Sj and Dj are p.s.d., i.t(Sj) >- O, and Ix(Dj) >- O, and thus, -2Iz(Sj)lx(Dj) <- O, implying that choices for aj satisfying this first constraint do not necessarily satisfy Ej , j -> 0. As has been mentioned in the section on the special case where Aj -< 0 for all j, the algorithm can be simplified considerably when the first eigenvalue of the symmetric part of (-2Sj ® Dj) is smaller than or equal to 0 for allj. This occurs when all Sj are p.s.d., because then Dj = Diag (~'SjX) is p.s.d, as well. Thus, the simplified algorithm based on the singular value decomposition of - 4 ~,j Sj~Ogj can be used, and it can be shown to be the algorithm proposed by ten Berge et al. (1988). Therefore, the algorithm proposed above for minimizing f4 is a generalization. The algorithm proposed above has not been tested extensively in practice. It is described merely to illustrate how one can rather easily construct an algorithm using methods for minimizing f(X). Along similar lines, algorithms can be constructed for the following problems: fitting the IDIOSCAL model (Carroll & Chang, 1972) by minimizing fs(X, Rj) = ~

Ilsj -XRjX'II 2,

(33)

J over Rj and X, subject to X'X = It,; fitting the Tucker-3 model (Kroonenberg & de Leeuw, 1980) with equality constraints on two of the modes, which reduces to minimizing

f6(X, C,

H1 . . . . .

Hr) = ~

cflHiX'

Sj - X

j

,

(34)

t=l

over X, C, and H1 . . . . . nr, subject to X'X = Ip and C'C = Ir; minimizing a function that arises in the PINDIS context (Lingoes & Borg, 1978), fT(g) = IIs - AXBII 2,

(35)

over X, subject to X'X = Ip. Discussion Algorithms have been proposed for minimizing a rather general function subject to certain constraints on the parameters. We have shown that special cases of these algorithms have already been used for minimizing certain special cases of the general function, and minimization of certain other special cases of the general function can be approached by the algorithms presented. This does not imply these are the only possible algorithms for such problems, or the best. The importance of the present algorithms for minimizing f(X) is that they provide a tool for constructing algorithms for least squares fitting problems of various kinds that are guaranteed to converge monotonically provided the function is bounded from below. As is the case for practically all iterative minimization procedures, one runs the risk of hitting local optima, which can be minimized by choosing good starting configurations, but this is easier said than done. A different strategy is to use several restarts at different starting configurations, or to change the aj parameters that might influence

428

PSYCHOMETRIKA

the occurence of local optima, and affect the "step-size" of the algorithm. These parameters play a role similar to that of the acceleration parameters proposed by Ramsay (1975). The type of problem that can be handled by algorithms presented here is by no means limited to least squares fitting, and includes every problem that involves the optimization of a function that is a special case of f(X) provided that the function is bounded. Moreover, other constraints might be imposed, such as: X is diagonal, has columnwise or rowwise unit sums of squares, or must lie in a prescribed subspace. In these cases, the minimization of the general function can be performed by an algorithm that finds an update Yfor X by minimizing [IF + X - xII 2 over X (with F as in (10)), with the respective constraints imposed on X. References Bijleveld, C., & de Leeuw, J. (1987, June). Fitting linear dynamical systems by alternating least squares. Paper presented at the European Meeting of the Psychometric Society, Twente, The Netherlands. Carroll, J. D., & Chang, J. J. (1972, March). IDIOSCAL: A generalization oflNDSCAL allowing IDiOsyncratic reference systems as well as an analytic approximation to INDSCAL. Paper presented at the Spring Meeting of the Psychometric Society, Princeton, N.J. Cliff, N. (1966). Orothogonal rotation to congruence. Psychometrika, 31, 33-42. de Leeuw, J. (1988). Convergence of the majorization method for Multidimensional Scaling. Journal of Classification, 5, 163-180. de Leeuw, J., & Heiser, W. (1980). Multidimensional scaling with restrictions on the configuration. In P. R. Krishnaiah (Ed.), Multivariate analysis V (pp. 501-522). Amsterdam: North Holland Publishing Company. de Leeuw, J., Young, F. W., & Takane, Y. (1976). Additive structure in qualitative data: an alternating least squares method with optimal scaling features. Psychometrika, 41,471-503. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1,211-218. Harshman, R. A. (1978, Augus0. Models for analysis of asymmetric relationships among N objects or stimuli. Paper presented at the first joint meeting of the Psychometic Society and the Society for Mathematical Psychology, Hamilton, Ontario. Heiser, W. J. (1987). Correspondence Analysis with least absolute residuals. Computational Statistics and Data Analysis, 5, 337-356. Kiers, H. A. L. (1989). INDSCAL for the analysis of categorical data. In R. Coppi & S. Bolasco (Eds.), Multiway data analysis (pp. 155-167). Amsterdam: Elsevier Science. Kiers, H. A. L., ten Berge, J. M. F., Takane, Y., & de Leeuw, J. 0990). A generalization of Takane's algorithm for DEDICOM. Psychometrika, 55, 151-158. Kroonenberg, P. M. (1983). Three mode principal component analysis: Theory and applications. Leiden: DSWO. Kroonenberg, P. M., & de Leeuw, J. (1980). Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika, 45, 69-97. Lingoes, J. C., & Borg, I. (1978). A direct approach to individual differences scaling using increasingly complex transformations. Psychometrika, 43, 491-519. Meulman, J. J. (1986). A distance approach to nonlinear multivariate analysis. Leiden: DSWO. Ramsay, J. O. (1975). Solving implicit equations in psychometric data analysis. Psychometrika, 40, 337-360. ten Berge, J. M. F., Knol, D. L., & Kiers, H. A. L. (1988). A treatment of the Orthomax rotation family in terms of diagonalization, and a re-examination of a singular value approach to Varimax rotation. Computational Statistics Quarterly, 3, 207-217. Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In P. R. Krishnaiah (Ed.), Multivariate analysis H (pp. 391-420). New York: Academic Press. Young, F. W., de Leeuw, J., & Takane, Y. (1976). Regression with qualitative and quantitative variables: an alternating least squares method with optimal scaling features. Psychometrika, 41,505-529. Manuscript received 1/12/89 Final version received 9/5/89

Majorization as a tool for optimizing a class of matrix functions

Recommend Documents