Adaptive control of average Markov decision chains under the Lyapunov stability condition

This note concerns discrete-time Markov decision processes with denumerable state space. A control policy is graded by the long-run expected average ...

0 downloads 380 Views 302KB Size

Download PDF

Math Meth Oper Res (2001) 54 : 63±99

aaaaa 2001

Adaptive control of average Markov decision chains under the Lyapunov stability condition* Rolando Cavazos-Cadena Departamento de EstadõÂstica y CaÂlculo, Universidad AutoÂnoma Agraria Antonio Narro, Buenavista, Saltillo COAH, 25315, MEÂXICO (e-mail: [email protected]), and Centro de Investigaciones SocioeconoÂmicas, Universidad AutoÂnoma de Coahuila, Edi®cio S, Unidad Campo Redondo, Saltillo COAH, 25479, MEÂXICO (e-mail: [email protected]). Manuscript received: November 2000

Abstract. This note concerns discrete-time Markov decision processes with denumerable state space. A control policy is graded by the long-run expected average reward criterion, and the main feature of the model is that the reward function and the transition law depend on an unknown parameter. Besides standard continuity-compactness restrictions, it is supposed that the controller can use the observed history to generate a consistent estimation scheme, and that the system's transition-reward structure satis®es an adaptive version of the Lyapunov function condition. Within this context, a special implementation of the non stationary value iteration method is studied, and it is shown that this technique produces convergent approximations to the solution of the optimality equation, result that is used to build an optimal adaptive policy. Key words: Schweitzer's Transformation, Pause Control, Non stationary value iteration, Consistent estimation, Adaptive optimal policy, Discrepancy function. AMS Classi®cation: 90C40, 93E20. 1 Introduction This work concerns discrete-time Markov decision processes (MDPs) endowed with the (long-run) expected average reward criterion and denumerable state space, and it is supposed that the system's transition-reward structure depends on an unknown parameter. In this case, to drive the system to optimality the controller must use the observed history of the process to estimate the true parameter value, so that the actions applied at each decision times are adapted * This work was generously supported by the PSF Organization under Grant No. 01-00/300.

64

R. Cavazos-Cadena

to the available knowledge. The main objective is to build an optimal adaptive policy, i.e. a policy that combines the estimation and control tasks in such a way that produces the optimal average reward. This problem will be analyzed via the value iteration scheme (VI) with converging parameters introduced by Federgruen and Schweitzer (1981), which was ®rstly used in adaptive control problems in HernaÂndez-Lerma and Marcus (1985) for discounted problems. The application of the value iteration scheme to adaptive control problems has been widely studied in the literature (HernaÂndez-Lerma, 1989), so that it is interesting to point out that the main di¨erence between the results in this note and those already available concerns the assumptions used to derive the optimality results. As usual in adaptive control problems, this note assumes that the reward function and the transition law depend continuously on the action and the parameter value, and that the controller is able to use the observed history to construct consistent estimates of the true unknown parameter; see Assumptions 2.1 and 2.3 below. On the other hand, to ensure that the average reward criterion is `well-behaved', in this work it is supposed that the system satis®es an adaptive version of the Lyapunov function condition (Hordjik, 1977, Cavazos-Cadena and HernaÂndez-Lerma 1992), which is formally stated below in Assumption 2.2. This assumption is substantially weaker than the strong forms of the simultaneous Doeblin condition (Thomas, 1965) usually imposed to obtain optimal adaptive policies for the average reward criterion. Indeed, under this latter assumption, the value iteration scheme produces approximations that converge at a geometric rate, fact that allows to use ideas from the discounted case that greatly simplify the analysis (Cavazos-Cadena and HernaÂndez-Lerma, 1991; for MDPs with more general state space see, for instance, HernaÂndez-Lerma 1989 and the references therein, and MinjaÂrezSosa, 1999 for more recent results). In contrast, under Assumption 2.3 such a geometric convergence does not hold, and di¨erent arguments are necessary to analyze the value iteration method; the results of this note are based on the ideas presented in Cavazos-Cadena (1998), where the VI method was analyzed for MDPs satisfying the Lyapunov function condition in a non adaptive context; see also Cavazos-Cadena and FernaÂndez-Gaucherand (1996). Essentially, this work consists of two parts: The ®rst one is dedicated to study the value iteration method with converging parameters, and the results in this direction are used in the second part to build an optimal adaptive policy. The organization of the material is as follows: In Section 2 the decision model is formally described and the basic assumptions are stated. In Section 3 this model is modi®ed by applying a transformation due to Schweitzer (1971), and incorporating an arti®cial action whose role is to allow a pause in the evolution of the system (Cavazos-Cadena, 1998). Next, using this modi®ed model, the VI scheme with parameters converging to a target value is implemented in Section 4 and, after the technical preliminaries in Sections 5 and 6, it is shown in Section 7 that the approximations obtained from the VI method converge to the solution of the optimality equation associated to the target value. In Section 8 the non stationary VI policy is introduced, and its adaptive optimality is proved using the tools in Section 9. In Section 10 a simple example is given to illustrate the application of the adaptive policy and, ®nally, the presentation conludes in Section 11 with some brief comments. Notation. Throughout the remainder N denotes the set of nonnegative integers and, for an event W, I W stands for its indicator function. Also, the cartesian

Adaptive control under the LFC

65

product of topological spaces is always endowed with the corresponding product topology. 2 Decision model Let Y be a ®xed metric space, hereafter referred to as the parameter space, and for each y A Y let My hS; A; fAxg; R; ; y; f px; y ; ygi be an MDP, where the state S is a denumerable set endowed with the discrete topology, the metric space A is the action set, and for each x A S, Ax is the class of admissible actions at state x; the set of admissible pairs is de®ned by K fx; a j a A Ax; x A Sg and is considered as a topological subspace of the Cartesian product S A. On the other hand, R; ; y is the real-valued reward function de®ned on K, whereas f px; y ; yg stands for the controlled transition law. For each parameter y, model My represents a system evolving as follows: At each time t A N, the state of the system is observed, say Xt x A S, and an action a A Ax is chosen by the controller. In this situation, a reward Rx; a; y is gained and, regardless of the previous states and actions, the state of the system at time t 1 will be Xt1 y with probability px; y a; y; this latter description corresponds to the Markov property of the decision process. Assumption 2.1. (i) For each x A S the set Ax is a compact subspace of A. (ii) Given x; y A S, the mappings a; y 7! Rx; a; y and a; y 7! px; y a; y are continuous in a; y A Ax Y. Policies. For each t A N the space Ht of admissible histories up to time t is recursively de®ned by H0 S, and Ht K Ht 1 , t b 1; a generic member of Ht is denoted by ht x0 ; a0 ; . . . ; xt 1 ; at 1 ; xt . A policy p fpt g for model My is a special sequence of stochastic kernels: for each t A N and ht A Ht , pt jht is a probability measure on the Borel sets of A satisfying pt Axt jht 1, whereas, for each Borel subset B of the action space A, the mapping ht 7! pt Bjht is a Borel measurable Q function of ht A Ht . The class of all policies is denoted by P. Set F : x A S Ax, so that F consists of all functions f : S ! A such that f x A Ax for every x A S. A policy p is stationary if there exists f A F such that, given ht , the action prescribed by p is f xt , that is pt f f xt gjht 1 always holds; in this case p and f are naturally identi®ed and, with this convention, F H P. A policy p is Markovian if there exists a sequence f ft A Fg such that, under the action of p, the equality At ft Xt always occur with probability 1, whereas a policy p is (history-dependent) deterministic, if there exists a sequence f ft : Ht ! Ag such that for every t A N and ht A Ht , ft ht A Axt and pt f ft ht gjht 1; in the two latter cases, p is naturally identi®ed with the corresponding sequence f ft g. Given the initial state X0 x, the parameter value y A Y such that the system evolves according to My , and the policy p A P used to drive the system, the distribution Py;px of the state-action process fXt ; At g is uniquely determined and Ey;p x stands for the corresponding expectation operator; see, for instance, HernaÂndez-Lerma (1989), Puterman (1994), or Sennott (1999). Stability Condition. In this work a control policy will be graded via the long-run average performance index introduced below. To ensure that this criterion is well-behaved it is necessary to impose a stability requirement on the transition-

66

R. Cavazos-Cadena

reward structure of the system (Thomas, 1965), and throughout the remainder the conditions in Assumption 2.2 below will be enforced; this is an adaptive version of an assumption used by Foster (1953) in an uncontrolled context, and by Hordjik (1977) for controlled Markov chains. Assumption 2.2. [Adaptive Lyapunov Function Condition (LFC).] There exists a state z A S and a function l : S ! 0; y satisfying the following conditions (i)±(iii), where the ®rst passage time to state z is de®ned by T minfn > 1 j Xn zg:

2:1 P

(i) For every x; a A K and y A Y, 1 jRx; a; yj y0z px; y a; yl y a lx. P (ii) For each x A S, the mapping a; y 7! y px; y a; yl y is continuous in a; y A Ax Y. (iii) Given x A S, y A Y and f A F, Ey;f x lXn I T > n ! 0 as n ! y. The function l in this condition is referred to as a Lyapunov function for the family fMy g. Performance Index. Let y A Y be ®xed and suppose that the system's dynamics is described by the MDP My . When the system is driven by p A P, the long-run expected average reward associated to y at state x is de®ned by " # n X 1 Ey;p x RXt ; At ; y ; 2:2 Jp; x; y lim sup nAy n 1 t0 whereas Jx; y sup Jp; x; y pAP

2:3

is the optimal average reward at x corresponding to y; a policy p is optimal when y is the parameter value if Jp ; x; y Jx; y for every x A S. Under Assumptions 2.1 and 2.2, the expectation in (2.2) is well-de®ned and both the average reward and the optimal value in (2.2) and (2.3), respectively, are ®nite. Moreover, as shown by Hordjik (1977), if the controller knows the parameter value y such that model My describes the evolution of the system, the main goal in a control problem, namely, ®nding an optimal policy, can be achived by determining a solution to the optimality equation given in (2.4) below. Lemma 2.1 (Hordjik, 1977). Suppose that Assumptions 2.1 and 2.2 hold, and let y A Y be ®xed. In this case there exist a constant gy and a function hy : S ! R satisfying the following assertions (i)±(v). (i) The optimal value function associated to y is constant and equals gy , i.e., Jx; y gy , x A S. (ii) hy z 0 and there exists a constant C such that jhy xj a Clx for every x A S.

Adaptive control under the LFC

67

(iii) The following average reward optimality equation (AROE) is satis®ed by the pair gy ; hy : " # X px; y a; yhy y ; x A S: 2:4 gy hy x sup Rx; a; y a A Ax

y

(iv) For each x A S the term within brackets in (2.4) has a maximizer fy x A Ax, and the stationary policy fy is optimal when y is the parameter value, i.e., J fy ; x; y gy for every x A S. (v) The pair gy ; hy in (2.4) is unique when the function hy is restricted to satisfy the conditions in statement (ii) above. Remark 2.1. The following consequences of Assumption 2.2 will be useful. (i) The optimal average reward gy satis®es jgy j a lz

1;

2:5

Cavazos-Cadena (1998). (ii) From Assumption 2.2(i), an induction argument using the Markov propPn 1 jRXt ; At ; yjI T > t erty yields that the inequality Ey;p x t0 lXn1 a lx always holds. Then, for each x A S, p A P and y A Y, " # y X p jRXt ; At j 1I T > t Ey; x t0

"

lim E p n!y y; x

n X

# jRXt ; At j 1I T > t a lx:

t0

Pn I T > t a lx < y. In particular, Ey;p x T limn!y Ey;p x t0 (iii) A speci®c value for the constant C in Lemma 2.1(ii) can be determined from (2.5) and the above remark. Let y A Y be ®xed, and select the policy fy 1 f A F as in Lemma 2.1(iv). In this case (Hordjik, 1977) for each x A S, P gy I T > t, so that hy x Ey;f x y t0 RXt ; At ; y " jhy xj a Ey;f x

# jRXt ; At ; yj jgy jI T > t

t0

" a Ey;f x

y X

# jRXt ; At ; yj jgy j 1I T > t

t0

" a Ey;f x

y X

y X

# jRXt ; At ; yj lzI T > t

t0

where (2.5) was used to obtain the last inequality. Since l b 1, it follows that

68

R. Cavazos-Cadena

" jhy xj a lzEy;f x

y X jRXt ; At ; yj 1

#

t0

and then jhy xj a lzlx, by the previous remark. Thus, the constant C in Lemma 2.1 can be chosen as C lz. (iv) The following convergence holds for every p A P, x A S and y A Y (Hordjik, 1977, Cavazos-Cadena and HernaÂndez-Lerma, 1992): lim

n!y

1 E p lXn 0: n 1 y; x

(v) Let Y 0 be a subset of the parameter space, and suppose that R 0 : K Y 0 ! R is a continuous function such that jR 0 j a jRj. In this case Assumptions 2.1 and 2.2 hold when R and Y are replaced by R 0 and Y 0 , respectively, so that, when these substitutions are made, the conclusions in Lemma 2.1 remain valid. Throughout the remainder of the paper it is assumed that the behaviour of the system is described by some model My , but the controller does not know in advance the true parameter value. Therefore, as the system is in progress, the observed history of the decision process must be used to improve the controller's guess about `the correct' parameter, and at each step the actions applied should be `adapted' to the current estimate. The following assumption concerning the estimation of the unknown parameter will be enforced. Assumption consistently y A Y, h Py;px lim

t!y

2.3. There exists a sequence of functions fy^t : Ht ! Yg which estimates y, in the following sense: For every x A S, p A P and i y^t X0 ; A0 ; . . . ; Xt 1 ; At 1 ; Xt y 1:

Within the framework determined by Assumptions 2.1±2.3, the main objective of the paper is to build an optimal adaptive policy, i.e., a policy p^ A P satisfying J^ p; x; y Jx; y;

x A S; y A Y:

2:6

This problem will be analyzed via an implementation of the value iteration technique with converging parameters, approach that has been successfully used for controlled Markov chains satisfying strong forms of the Simultaneous Doeblin condition; see for instance, (HernaÂndez-Lerma, 1989). This method allows to build an optimal adaptive policy `starting from scratch'; other approach, usually referred to as the Principle of Estimation and Control (PEC), will be brie¯y discussed later in Section 11. Before presenting the value iteration scheme, it is convenient to introduce some modi®cations on the original family of decision models fMy g. Remark 2.1. Even without explicit reference, throughout the remainder of the paper Assumptions 2.1±2.3 are supposed to be valid.

Adaptive control under the LFC

69

3 Transformation of the original model As stated in Puterman (1996), when implementing the value iteration scheme in an MDP endowed with the average criterion, it is convenient to have that the transition matrices associated to stationary policies are aperiodic, a property that does not follow from the assumptions in the previous section. To guarantee this latter property, a transformation introduced in Schweitzer (1971) will be applied to the original transition law; see (3.4) below. Moreover, to obtain the convergence results in the following sections, an arti®cial action and reward will be introduced at the distinguished state z in Assumption 2.2 (CavazosCadena, 1998). Transformed Model. Let a A 0; 1 be ®xed, and let D be an object outside of the action set A. De®ne the metric space U by U A W fDg; where D is considered as an isolated point. For each parameter y A Y the new MDP ~ y hS; U; fUxg; r; ; y; fqx; y ; ygi M is determined as follows:

. The space of admissible actions at each state x A S is given by Ux Ax;

x A Snfzg;

Uz Az W fDg:

3:1

. The reward function r; ; y is de®ned by rx; a; y Rx; a; y;

a A Ax; x A S;

and rz; D; y

B;

3:2

where B lz 1:

3:3

. The controlled transition law is determined as follows: qx; y a; y 1

a a px; y a; y;

a A Ax; x A S;

3:4

whereas qz; z D; y 1:

3:5

~ y , the arti®cial action D is only available at state Notice that for model M z, and that when it is applied the system stays at z; in this sense, a pause in the transitions at z can be done at the expense of incurring in the negative reward B; see (3.2) and (3.5). As it will be shown below, the optimal average reward ~ y coincide, so that combining (2.5) with (3.2), (3.3) and for models My and M ~ y to opti(3.5), it follows that a controller using a stationary policy to drive M

70

R. Cavazos-Cadena

mality will not apply action D; however, the presence of D will be very useful to obtain the bounds in Section 5. On the other hand, (3.4) and (3.5) yield that qx; x a; y b 1

a > 0;

a A Ux; x A S:

3:6

~ y and Xt x is obParticularly, when a stationary policy f is used to drive M served, then Xt1 x occurs with the positive probability 1 a, so that the transition matrix corresponding to f is aperiodic; (3.6) will play an essential role in the analysis of Section 6. Remark 3.1. (i) Since D is an isolated point in the space U, from (3.1)±(3.5) ~ y j y A Yg satis®es the continuityit is not di½cult to see that the family fM compactness conditions in Assumption 2.1. ~ and when M ~y ~ y will be denoted by P, (ii) The space of policies of model M p ~ is driven by p and X0 x, Py; x denotes the distribution of the state-action process fXt ; At g, whereas E~y;p x is the associated expectation operator. The Q ~ : ~ y will be denoted by F set of stationary policies for M x A S Ux; notice ~ then f A F whenever f z 0 D, by (3.1). that, if f A F, The remainder of the section is dedicated to establish two important facts, ~ y g satis®es the adaptive Lyapunov function condinamely, that the family fM tion, and that the solutions of the optimality equations corresponding to each ~ y can be easily obtained from each other. pair of models My and M ~ y g be determined by (3.1)±(3.5), and de®ne the Lemma 3.1. Let the family fM function L : S ! 1; y by Lx

B1 lx; a

x A S:

3:7

~ y g. More explictly, statements (i)± In this case, L is a Lyapunov function for fM (iii) below occur. P (i) For every x A S, a A Ux and y A Y, 1 jrx; a; yj y0z qx; y a; y L y a Lx. P (ii) For each x A S, the mapping a; y 7! y qx; y a; yLy is continuous in a; y A Ax Y. ~ E~ f LXn I T > n ! 0 as n ! y. (iii) Given x A S, y A Y and f A F, y; x Proof. (i) Let y A Y be ®xed and consider an admissible P pair x; a A K. In this case, Assumption 2.1(i) yields that 1 jRx; a; yj y0z px; y a; ylyalx, and then X px; y a; yB 1l y 1 jRx; a; yj y0z

" a B 1 1 jRx; a; yj

X y0z

a B 1lx

# px; y a; yly

Adaptive control under the LFC

so that (see (3.7)) 1 jRx; a; yj

X y0z

71

px; y a; yaL y a aLx:

Using (3.2) and (3.4), this relation implies that X 1 jrx; a; yj qx; y a; yL y a Lx;

x A S; a A Ax:

y0z

Since Ux Ax for x 0 z and Uz Az W fDg, by (3.1), to conclude the proof of part (i) it is su½cient to verify that the inequality in the above statement remains valid when x z and a D. With this in mind, notice that (3.2), (3.5) and (3.7) together yield that 1 jrz; D; yj

X y0z

qz; y D; yL y 1 jBj 1 B a

1B lz Lz; a

recall that lz b 1 and a A 0; 1. case, for y A Y and a A Ax Ux, P (ii) Let x A Snfzg be ®xed. In thisP aLx a y px; y a; yLy, y qx; y a; yLy 1 P so that, form Assumption 2.2(ii) and the de®nition of L in (3.7), y qx; y a; yL y is a continuous function of a; y A Ux Y. On the other hand, from (3.4), (3.5) and (3.7), X qz; y a; yLy y

Lz; if a D P 1 aLz B 1 y px; y a; yl y if a A UznfDg;

since D is an isolated point in Uz, Assumption 2.2(ii) yields that P q y z; y a; yLy is a continuous mapping of a; y A Uz Y. ~ y . To estab(iii) Select y A Y and let f be a stationary policy for model M lish that lim E~y;f x LXn I T > n 0;

3:8

n!y

consider the following two exhaustive cases: Case 1: f z 0 D. In this situation, f is also an admissible policy for model My (see Remark 3.1(ii)), and a simple induction argument using (2.1) and (3.4) yields that for every positive integer n, E~y;f x LXn I T > n ( Pn

n k k f n a Ey; x LXk I T > k; k0 k 1 a P n 1 n 1 n 1 k k1 f a Ey; x LXk1 I T a k0 k 1

if x A Snfzg > k 1;

if x z 3:9

72

R. Cavazos-Cadena

Suppose now that x 0 z and let e > 0 be arbitrary. Using Assumption 2.2(iii) pick N A N such that B1 f Ey; x lXk I T > k < e; a

Ey;f x LXk I T > k

k > N:

From (3.9) it follows that, for n > N, E~y;f x LXn I T > n a

N X n

k

k0

e;

1

a n k a k Ey;f x LXk I T > k

x0z

so that lim sup E~y;f x LXn I T > n n!y

a lim sup n!y

N X n k0

k

1

a n k a k Ey;f x LXk I T > k e e;

and since e > 0 is arbitrary and L is a positive function, it follows that (3.8) holds for x 0 z, whereas the case x z can be handled similarly. Case 2: f z D. In this case (2.1) and (3.5) yield that P~y;f z T 1 1, so that E~y;f z LXn I T > n 0 for n > 0, and (3.8) certainly occurs when x z. To conclude, it is su½cient to show that the convergence in (3.8) is also valid when the state x does not coincide with z. To achive this goal, de®ne a policy f 0 A F as follows: f 0 y f y for y 0 z, and f 0 z a 0 , where a 0 is a ®xed action in Az. Observing that E~y;f x LXk I T > k depends only on the actions selected before the ®rst visit to state z, it is not di½cult to obtain the following relation analogous to (3.9): E~y;f x LXn I T > n

n X n k0

k

1

0

a n k a k Ey;f x LXk I T > k;

x 0 z:

From this equality, (3.8) can be established for x 0 z along the lines used in Case 1. r ~ y g; According to Lemma 3.1, Assumption 2.2, is satis®ed by the collection fM since this family also satis®es Assumption 2.1 (see Remark 3.1(i)), Lemma 2.1 ~ y , and that yields that the optimal average reward is constant for each model M an optimal policy can be determined if a solution to the corresponding AROE is known. The following lemma shows explicitly the relation between the solu~ y. tions of the optimality equations associated to My and M Lemma 3.2. For each y A Y, let the pair gy ; hy as in Lemma 2.1, and de®ne ~hy : S ! R by

Adaptive control under the LFC

~hy x 1 hy x; a

73

x A S:

3:10

In this case, statements (i) and (ii) below occur: hy xja (i) ~hy z 0 and ~hy is dominated by L, i.e., for some constant C, j~ CLx; x A S. ~ y , that is, for each (ii) The pair gy ; ~hy satis®es the AROE associated to M x A S, " gy ~hy x sup

a A Ux

X

rx; a; y

y

# qx; y a; y~ hy y :

3:11

Proof. Part (i) follows immediately combining Lemma 2.1(ii) and (3.10). To establish part (ii), observe that for every x A S, the AROE in (2.4) and (3.10) together yield that for each x A S " gy a~hy x sup

a A Ax

and adding 1

Rx; a; y

X y

# px; y a; ya~ hy y ;

a~hy x to both sides of this equality, it follows that "

gy ~hy x sup

a A Ax

a~ hy x

Rx; a; y 1

X y

# ~ px; y a; yahy y :

Recalling the de®nition of the reward function and the transition law for model ~ y , this latter equality is equivalent to M " gy ~hy x sup

a A Ax

rx; a; y

X y

# ~ qx; y a; yhy y ;

x A S:

3:12

Since Ux Ax when x 0 z, this establishes (3.11) for x 0 z. To conclude, it is su½cient to show that gy ~hy z > rz; D; y

X y

qz; y D; y~ hy y;

3:13

Indeed, since Uz Az W fDg, this inequality and the case x z in (3.12) together imply that (3.11) is also valid for x z. To verify (3.13), just observe that (3.2), (3.3) and (3.5) together yield that rz; D; y

X y

qz; y D; y~hy y

so that (3.13) is equivalent to gy > by (2.5).

lz lz

1~ hy z; 1, inequality that is certainly true, r

74

R. Cavazos-Cadena

4 Value iteration with converging parameters Throughout this and the following sections fyn j n 0; 1; 2; . . .g H Y is a ®xed sequence converging to the target value y A Y: lim yn y:

4:1

n!y

Also, it is convenient to introduce the following notation: y0n y0 ; y1 ; . . . ; yn ;

n A N;

and

y0k y0 ;

k < 0:

4:2

De®nition 4.1. [Non Stationary Value Iteration.] The sequence fVn jy0n : S ! Rg of value iteration (VI) functions associated to the sequence fyn g is recursively de®ned as follows: V 1 x; y0 1 0;

x A S;

4:3

and " Vn x; y0n

sup

a A Ux

rx; a; yn

X y

# qx; y a; yn Vn 1 y; y0n 1

;

x A S; n A N;

4:4

Remark 4.1. (i) Notice that the VI functions have been de®ned in terms of the ~ y g, and not via the original models fMy g in Section 2. transformed family fM (ii) Vn ; y0n is the optimal value function associated to a non stationary MDP endowed with the expected total-reward criterion and running from time 0 to time n. For this model, the terminal reward is null and the reward function and transition law at time t are r; ; yn t and fqx; y ; yn t g, respectively. (iii) The value iteration functions are ®nite. Indeed, from an induction argument using Lemma 3.1, it is not di½cult to verify that the inequality jVn x; y0n j a Lx n 1Lz is always valid; these bounds will be improved in Section 5 below. The VI functions will be now used to build approximations to the solution of AROE associated to model My , where y is as in (4.1). De®nition 4.2. The sequence fgn ; y0n : S ! Rg of di¨erential rewards is de®ned by gn x; y0n Vn x; y0n

Vn 1 x; y0n 1 ;

x A S; n A N:

whereas the sequence fhn ; y0n : S ! Rg of relative value functions is given by hn x; y0n Vn x; y0n

Vn z; y0n ;

x A S; n

1; 0; 1; 2; . . .

Adaptive control under the LFC

75

With this notation, it is not di½cult to see that (4.4) can be equivalently written as " # X n n n 1 qx; y a; yn hn 1 y; y0 ; gn z; y0 hn x; y0 sup rx; a; yn a A Ux

y

x A S; n A N;

4:5

~ y ; see (3.11). The folequality that resembles the AROE corresponding to M lowing is the main technical result of this note and will play an essential role to establish the adaptive optimality of the policy introduced in Section 8. Theorem 4.1. Let fyn g H Y be as in (4.1). With the notation in De®nition 4.2, the following convergences hold: lim gn z; y0n gy ;

n!y

and

lim hn x; y0n ~ hy x;

n!y

x A S;

~ y ; see Lemma where gy ; ~hy is the solution to the AROE corresponding to M 3.2. The approach used below to establish this result follows the ideas used in Cavazos-Cadena (1998), where the case in which Y is a singleton was analyzed. The argument consists in two steps: First, it will be shown that the sequence of di¨erential rewards at z, i.e., fgn z; y0n g, is bounded, and that the relative value functions hn ; y0n are dominated by a constant multiple of the Lyapunov function L; in this part the presence of the arti®cial control D will be relevant. Next, the asymptotic behaviour of the di¨erential rewards fgn ; y0n g will be studied, and the results obtained will rely heavily on the inequality (3.6), which stems from Schweitzer's transformation (3.4). After these steps are performed, the proof of Theorem 4.1 will be presented in Section 7. 5 Useful bounds The ®rst step of the proof of Theorem 4.1 is the following result, establishing bounds for the the di¨erential rewards and the relative value functions; notice that (5.1) below improves the bounds that can be directly obtained from Remark 4.1(iii). Theorem 5.1. Let the sequences fgn z; y0n g and fhn ; y0n g be as in De®nition 4.2. In this case, there exists a constant C, which does not depend on the sequence fyn g such that for every n A N jgn z; y0n j a C;

and

jhn x; y0n j a CLx;

x A S:

5:1

The proof of this result has been splitted into the following two lemmas; the arguments rely on equation (4.5), and the ®rst lemma uses the arti®cial action D.

76

R. Cavazos-Cadena

Lemma 5.1. The following assertions (i) and (ii) occur; see (3.3) and (3.7). (i) gn z; y0n b B; n A N (ii) hn x; y0n a BLx; x A S; n A N W f 1g: Proof. (i) From (4.5) it follows that gn z; y0n hn z; y0n b rz; D; yn

X y

qz; y D; yn hn 1 y; y0n 1

B hn 1 z; y0n 1

where the equality used (3.2) and (3.5). Since the relative value functions satisfy that hk z; y0k 0 for every integer k b 1 (see De®nition 4.2), part (i) follows. (ii) The argument is by induction. Notice that, the conclusion certainly holds for n 1, since h 1 ; y0 1 1 0. Assume now that hn 1 ; y0n 1 a BL for some n A N, and observe that (4.5) implies that for every x A S, " # X n n n 1 qx; y a; yn hn 1 y; y0 hn x; y0 sup rx; a; yn gn z; y0 a A Ux

y

" a sup

a A Ux

rx; a; yn B

" sup

a A Ux

rx; a; yn B

X y

X y0z

# qx; y a; yn hn 1 y; y0n 1 #

qx; y a; yn hn 1 y; y0n 1

where the inequality follows from part (i) and the second equality uses the fact that hn 1 z; y0n 1 0. Using that B > 1, the induction hypothesis and the above relation together yield that for each state x " # X n qx; y a; yn BLy hn x; y0 a sup Bjrx; a; yn j B a A Ux

y0z

" B sup

a A Ux

jrx; a; yn j 1

X y0z

# qx; y a; yn Ly

and then, from Lemma 3.1(i), it follows that hn x; y0n a BLx; this completes the induction argument, since x A S is arbitrary. r Lemma 5.2. The following assertions (i) and (ii) below are valid. (i) gn z; y0n a BLz; n A N. (ii) hn x; y0n b BLzLx; x A S; n A N W f 1g: Proof. (i) Using that hk z; y0k 0 for every k, (4.5) with x z and Lemma 5.1(ii) together yield that for every n A N

Adaptive control under the LFC

77

" gn z; y0n

sup

rz; a; yn

a A Uz

" a sup

rz; a; yn

a A Uz

sup

X y0z

Bjrz; a; yn j

a A Uz

" B sup

a A Uz

qz; y a; yn hn 1 y; y0n 1

y0z

since B > 1, this relation implies " gn z; y0n a

#

X

jrz; a; yn j

# qz; y a; yn BL y ;

X y0z

X y0z

# qz; y a; yn BL y # qz; y a; yn Ly

and then, via Lemma 3.1(i), it follows that gn z; y0n a BLz. (ii) The argument is by induction. Observe that the inequality hn ; y0n b BLzL clearly holds for n 1, since h 1 ; y0 1 1 0. Suppose now that hn 1 ; y0n 1 b BLzL for some n A N. Starting form (4.5) and using part (i), it follows that for every x A S " # X n n n 1 qx; y a; yn hn 1 y; y0 hn x; y0 sup rx; a; yn gn z; y0 a A Ux

y

" b sup

a A Ux

jrx; a; yn j

BLz

X y0z

# qx; y a; yn hn 1 y; y0n 1

;

recall that hn 1 z; y0n 1 0. Hence, the inequality BLz > 1 and the induction hypothesis together yield " hn x; y0n b sup

a A Ux

BLz

BLzjrx; a; yn j X y0z

BLz #

qx; y a; yn Ly "

b BLz inf

a A Ux

jrx; a; yn j 1

X y0z

# qx; y a; yn Ly ;

and applying Lemma 3.1(i) it follows that hn x; y0n b BLzLx; this completes the induction argument, since x A S is arbitrary. r Setting C BLz, the conclusion of Theorem 5.1 follows from Lemmas 5.1 and 5.2. Observe that, by De®nitions 4.1 and 4.2, for each state x the equality gn x; y0n gn z; y0n hn x; y0n

hn 1 x; y0n 1

5:2

78

R. Cavazos-Cadena

is valid for every integer n b 0, so that (5.1) yields jgn x; y0n j a 2CLx C a 3CLx;

x A S; n A N:

5:3

6 Asymptotic behaviour of the di¨erential rewards This section concerns the limit behaviour of the sequence of di¨erential rewards. To begin with, it is convenient to introduce the following notation for the extreme limit points of fgn ; y0n g. De®nition 6.1. The functions L0 : S ! R and L1 : S ! R are de®ned as follows: For each x A S, L0 x lim inf gn x; y0n ; n!y

and

L1 x lim sup gn x; y0n : n!y

The main objective of this section is to establish the following result. Theorem 6.1. (i) State z maximizes L1 , i.e., for every x A S, L1 x a L1 z:

6:1

Moreover, if the sequence fnk gy k1 of positive integers goes to y and satis®es lim gnk z; y0nk L1 z;

6:2

k!y

then, lim gnk

k!y

nk 1 1 z; y0

L1 z:

6:3

Similarly, (ii) L0 x b L0 z for every x A S, and if the sequence fnk gy k1 of positive integers converges to y and satis®es that limk!y gnk z; y0nk L0 z, then limk!y gnk 1 z; y0nk 1 L0 z. The proof of this result uses the following lemma, which is derived from the ~ y g satis®es the conditions in Assumptions 2.1 and 2.2; fact that the family fM see Remark 3.1(i) and Lemma 3.1. Lemma 6.1. Let fyn g H Y and y be as in (4.1). With the notation in Sections 3 and 4, statements (i)±(iv) below are valid. (i) For each x A S, the following convergences hold as n ! y. sup jrx; a; yn

rx; a; yj ! 0

sup jqx; y a; yn

qx; y a; yj ! 0;

a A Ux

a A Ux

and y A S:

Adaptive control under the LFC

79

(ii) Let x A S and the compact subset C H Y be ®xed. Given e > 0, there exists a ®nite set Gx; e 1 G H S such that X qx; y a; dLy < e; a A Ux; d A C: 6:4 yBG

(iii) Let fHn : S ! Rg be a sequence satisfying that, for some constant k, jHn xj a kLx;

x A S; n A N:

Under this condition, for every x A S X X qx; y a; yn Hn y qx; y a; yn 1 Hn y ! 0 sup a A Ux y y

as n ! y:

Proof. (i) As already noted in Remark 3.1(i), Assumption 2.1 and the construction in Section 3 together imply that for each x; y A S, the mappings a; d 7! rx; a; y and a; d 7! qx; y a; y are continuous in a; d A Ux Y. Since each set Ux is a compact topological space, the `tube lemma' in Munkres (1975), p. 169, yields that, given x A S, for every e > 0 there exists a neighborhood V y H Y such that sup jrx; a; d

rx; a; yj < e;

sup jqx; y a; d

qx; y a; yj < e;

a A Ux

a A Ux

and d A V y;

since fyn g converges to y, the conclusion follows from this statement. (ii) Let fGn g be a sequence of subsets of S such that Gn % S as n % y. In this case, for every x A S X X qx; y a; dLy % qx; y a; dLy; a A Ux; d A Y; y

y A Gn

using that each set Gn is ®nite, Remark 3.1(i) and Lemma 3.1(ii) yield that both sides in this convergence depend continuously on a; d A Ux Y. Since Ux is a compact metric space, and C is a compact subset of Y, the product Ux C is compact, so that by Dini's Theorem (Royden, 1968), the above convergence is uniform in Ux C, i.e., given e > 0 there exists N A N for which X X X qx; y a; dLy qx; y a; dLy qx; y a; dLy < e; y

y B GN

y A GN

a A Ux; d A C; so that the desired conclusion is satis®ed by G 1 GN . (iii) Let x A S be ®xed, and set C fyn g W fyg which, since fyn g converges to y, is a compact subset of Y. Given x A S and e > 0, select a ®nite set G such that (6.4) holds and notice that for every positive integer n, the condition jHt j a kL yields

80

R. Cavazos-Cadena

X sup qx; y a; yn Hn y a A Ux y a

X

X y

sup jqx; y a; yn

qx; y a; yn 1 j jHn yj

y A G a A Ux

X

sup

a A Ux y B G

ak

X

qx; y a; yn jHn yj sup

sup jqx; y a; yn X

a A Ux y B G

ak

X

X

a A Ux y B G

qx; y a; yn 1 jHn yj

qx; y a; yn 1 jL y

y A G a A Ux

k sup

qx; y a; yn 1 Hn y

qx; y a; yn Ly k sup

X

a A Ux y B G

sup jqx; y a; yn

qx; y a; yn 1 Ly

qx; y a; yn 1 jL y 2ke

y A G a A Ux

where, observing that yk A C for every k, the last inequality follows from (6.4) Then, via part (i), it follows that X X qx; y a; yn Hn y qx; y a; yn 1 Hn y a 2ke; lim sup sup n!y a A Ux y y and the conclusion follows, since e > 0 and x A S are arbitrary.

r

Proof of Theorem 6.1. Let n be a positive integer. Since the VI functions are dominated by the Lyapunov function L plus a constant term (see Remark 4.1 (iii)), from Lemma 3.1(ii), Remark 3.1(i) and Proposition 18 in Royden (1968) p. 232, it is not di½cult to see that for each x A S the term within brackets in (4.4) is a continuous function of a A Ux. Since the action sets Ux are com~ such that for every x A S pact, for each positive integer n there exists fn A F X qx; y fn x; yn Vn 1 y; y0n 1 Vn x; y0n rx; fn x; yn y

rx; fn x; yn

X y

X

qx; y fn x; yn Vn 2 y; y0n 1

Vn 2 z; y0n 2 rx; fn x; yn

X y

y

qx; y fn x; yn Vn 1 y; y0n 1 Vn 2 y; y0n 2

X y

Vn 2 z; y0n 2

qx; y fn x; yn gn 1 y; y0n 1

qx; y fn x; yn hn 2 y; y0n 1 Vn 2 z; y0n 2

Adaptive control under the LFC

81

where the third equality follows from De®nition 4.2. On the other hand, (4.4) with n 1 instead of n yields X qx; y fn x; yn 1 Vn 2 y; y0n 2 Vn 1 x; y0n 1 b rx; fn x; yn 1 y

rx; fn x; yn 1

X y

qx; y fn x; yn 1

Vn 2 y; y0n 2

Vn 2 z; y0n 2 Vn 2 z; y0n 2 X rx; fn x; yn 1 qx; y fn x; yn 1 hn 2 y; y0n 2 y

Vn 2 z; y0n 2 From these relations and De®nition 4.2, it follows that X gn x; y0n a qx; y fn x; yn gn 1 y; y0n 1 gn x; x A S;

6:5

y

where, for each x A S, gn x : rx; fn x; yn X y

rx; fn x; yn 1

X y

qx; y fn x; yn hn 2 y; y0n 1

qx; y fn x; yn 1 hn 2 y; y0n 2 ;

notice that Theorem 5.1 and Lemma 6.1 together imply that lim gn x 0;

n!y

x A S:

6:6

Next, let x A S be ®xed, and select an increasing sequence of positive integers fnk g such that lim gnk x; y0nk L1 x;

6:7

k!y

and observe that gn k

nk 1 1 ; y0

A

Y

3CLx; 3CLx;

6:8

xAS

where C is the constant in Theorem 5.1; see (5.3). Since the product space in this inclusion and Ux are compact metric spaces, taking a subsequence of fnk g, if necessary, it can be assumed that in addition to (6.7) the following limits exist: lim gnk

k!y

nk 1 1 y; y0

: Ly;

y A S;

and

lim fnk x : ax A Ux;

k!y

6:9

82

R. Cavazos-Cadena

notice that, since L1 y is the largest limit point of the whole sequence fgn y; y0n g, Ly a L1 y:

6:10

Combining (6.8)±(6.10) with the continuity property in Lemma 3.1(ii), Proposition 18 in Royden (1968) p. 232 yields lim

k!y

X y

qx; y fnk x; ynk gnk

X y

nk 1 1 y; y0

qx; y ax ; yLy a

X y

qx; y ax ; yL1 y:

6:11

Replacing n by nk in (6.5), and taking limit as k goes to in®nity in both sides of the resulting inequality, (6.6), (6.7) and (6.11) together imply that L1 x a

X y

qx; y ax ; yL1 y

~ by f x : ax , x A S, This argument is valid for each x A S, and de®ning f A F the above inequality yields that, for every state x, L1 x a E~y;f x L1 X1 L1 zP~y;f x T 1 E~y;f x L1 X1 I T > 1; and via an induction argument using the Markov property, it follows that L1 x a L1 z

n X k1

P~y;f x T k E~y;f x L1 Xn1 I T > n 1;

x A S n A N:

Observe now that (5.3) and De®nition 6.1 together imply that jL1 ja3CL, and then, for an arbitrary initial state x, E~y;f x L1 Xn I T > n ! 0 as n ! y, ~ y . Thus, taking limit as n goes to y in the above by Remark 2.1(v) applied to M displayed inequality, this latter convergence implies that for each x A S, L1 x a L1 z

y X k1

P~y;f x T k L1 zP~y;f x T < y a L1 z:

This establishes (6.1), and to complete the proof of part (i), it will be proved that (6.2) implies (6.3). Select a sequence fnk g of positive integers such that (6.2) holds, i.e., lim gnk z; y0nk L1 z:

k!y

To verify that limk!y gnk 1 z; y0nk 1 L1 z, it is su½cient to prove that each limit point of the sequence fgnk 1 z; y0nk 1 g coincides with L1 z. Let Lz be

Adaptive control under the LFC

83

an arbitrary limit point of gnk 1 z; y0nk 1 ), so that taking a subsequence of fnk g, without loss of generality it can be supposed that lim gnk

k!y

nk 1 1 z; y0

Lz:

Moreover, after taking an additional subsequence, it can be assumed that the following convergences hold: lim gnk

k!y

nk 1 1 y; y0

: Ly;

y A Snfzg;

and

lim fnk z az A Uz:

k!y

In this case, the convergence in (6.11) is valid with x z: X X lim qz; y fnk z; ynk gnk 1 y; y0nk 1 qz; y az ; yL y k!y

y

y

Replacing x by z and n by nk in (6.5) and taking limit as k goes to y in both sides of the resulting inequality, the four last displayed convergences and (6.6) together yield that X X qz; y az ; yLy a qz; y az ; yL1 y; L1 z a y

y

for the second inequality recall that L1 y is the limit superior of the whole sequence fgn y; y0n g. Since L1 a L1 z, it follows that X X qz; y az ; yLy a qz; y az ; yL1 y a L1 z; L1 z a y

y

and then L y L1 y if qz; y az ; y > 0, so that from (3.6) it follows that Lz L1 z. In short, it has been proved that when (6.2) is valid, then an arbitrary limit point of fgnk 1 z; y0nk 1 g coincides with L1 z, so that (6.3) occurs. This completes the proof of part (i), whereas part (ii) can be established along the same lines. r 7 Proof of Theorem 4.1 This section presents a proof of Theorem 4.1. The argument has been splitted into Lemmas 7.2 and 7.3 below, which rely on Theorem 6.1 as well as on the following continuity result derived from Lemma 6.1. Lemma 7.1. Let fyn g H Y and y be as in (4.1). Suppose that fHn : S ! Rg is a sequence satisfying jHn xj a kLx;

x A S; n A N;

where k is a constant, and lim Hn x Hx;

n!y

x A S:

In this case, for each state x

84

R. Cavazos-Cadena

(i) As n ! y, " # X qx; y a; yn Hn y sup rx; a; yn a A Ux y "

X

rx; a; y

y

and, consequently, " (ii) lim

sup

rx; a; yn

n!y a A Ux

X y

" sup

a A Ux

# qx; y a; y yH y ! 0

rx; a; y

# qx; y a; yn Hn y

X y

# qx; y a; yHy

Proof. (i) Paralleling the argument used to prove Lemma 6.1(iii), the following convergence can be established for each x A S: X X qx; y a; yn Hn y qx; y a; yH y ! 0 as n ! y: 7:1 sup a A Ux y y Since

" # X qx; y a; yn Hn y sup rx; a; yn a A Ux y " rx; a; y

X y

a sup jrx; a; yn a A Ux

# qx; y a; yH y rx; a; yj

X qx; y a; yn Hn y sup a A Ux y

X y

qx; y a; yH y

the conclusion follows combining (7.1) and Lemma 6.1(i). (ii) This part is obtained from part (i) via Lemma 3.3 in Hinderer (1970). r Lemma 7.2. Let the functions L1 and L0 be as in De®nition 6.1. In this case (i) L1 z gy . (ii) L0 z gy . Consequently, (iii) For each x A S, limn!y gn x; y0n gy .

Adaptive control under the LFC

85

Proof. (i) Select a sequence fnk g H N such that limk!y gnk z; y0nk L1 z, and set gk ; y0k 1 0 for k < 0 and hk ; y0k 1 0 for k < 1. Applying repeatedly Theorem 6.1, it follows that lim gnk t z; y0nk t L1 z;

tAN

k!y

7:2

whereas, by Theorem 5.1, Y CLx; CLx : W; hn ; y0n A

7:3

xAS

since W is a compact metric space, Cantor's diagonal method can be used to obtain a subsequence of fnk g, say fmk g, such that lim hmk t x; y0mk t : Ht x;

k!y

x A S; t A N:

7:4

Using equation (4.5) with mk t instead of n and taking limit as k ! y in both sides of the resulting equality, Lemma 7.1 and (7.2)±(7.4) together yield that for every t A N, " # X qx; y a; yHt1 y ; x A S: 7:5 L1 z Ht x sup rx; a; y a A Ux

y

This implies that the inequality L1 z Ht x b rx; a; y

X y

qx; y a; yHt1 y

is always valid, and using this fact, an induction argument yields that for every ~ policy p A P, " # n X p ~ rXt ; At ; y Hn1 Xn1 ; x A S; n A N: n 1L1 z H0 xb E y; x

k0

7:6 On the other hand jHk j a CL, by (7.3) and (7.4), so that 1 ~p C ~p E jHn1 Xn1 j a E LXn1 ! 0 n 1 y; x n 1 y; x

as n ! y;

~ y. where the convergence follows from Remark 2.1(iv) applied to model M Therefore, (7.6) leads to " # n 1 ~p X ~ Ey; x rXt ; At ; y ; x A S; p A P: 7:7 L1 z b lim sup n!y n 1 t0 Using Lemma 3.1(ii) and Remark 3.1(i), an application of Proposition 18 in Royden (1968) p. 232 yields that the term within brackets in (7.5) is a contin-

86

R. Cavazos-Cadena

uous function of a A Ux; since the action sets U are compact, for each ~ such that t A N there exists ft A F X qx; y ft x; yHt1 y; t A N; x A S: L1 z Ht x rx; ft x; y y

De®ning the Markov policy p by p f f0 ; f1 ; f2 ; . . .g, this equality yields, via an induction argument using the Markov property, that " # n X p ~ rXt ; At ; y Hn1 Xn1 ; x A S; n A N; n 1L1 z H0 x E y; x

k0

and paralleling the argument leading to (7.7), it follows that " # n 1 ~p X Ey; x L1 z lim sup rXt ; At ; y ; x A S: n!y n 1 t0

7:8

From (7.7) and (7.8) it is clear that L1 z is the optimal average reward for ~ y ; since this latter value is gy , by Lemma 3.2, part (i) follows, whereas model M part (ii) can be obtained along the same lines. (iii) From de®nition 6.1 it is clear that L0 a L1 and then, combining Theorem 6.1 with parts (i) and (ii) above, it follows that gy a L0 a L1 a gy , so that lim inf gn x; y0n L0 x gy L1 x lim sup gn x; y0n ; n!y

n!y

completing the proof.

x A S; r

Lemma 7.3. As n goes to y, hn x; y0n ! ~hy x;

x A S:

Proof. Since the sequence fhn ; y0n g is contained in the compact metric space W (see (7.3)), it is su½cient to prove that each one of its limit points in W coincides with ~hy . With this in mind, let h A W be a limit point of fhn ; y0n g, and select a sequence of positive integers fnk g increasing to y and satisfying that lim hnk x; y0nk hx;

k!y

x A S:

7:9

In this case, using (5.2), Lemma 7.2 yields that lim hnk

k!y

nk 1 1 x; y0

hx;

x A S:

7:10

Thus, replacing n by nk in (4.5) and taking limit as k converges to y in both sides of the resulting equality, Lemmas 7.1 and 7.2 together imply that " # X gy hx sup rx; a; y qx; y a; yhy ; x A S; a A Ux

y

Adaptive control under the LFC

87

see (7.3), (7.9) and (7.10). Therefore, the pair gy ; h satis®es the AROE ~ y . From (7.3) and (7.9) it is clear that jhxj a CLx for each associated to M x A S, whereas the fact that hk z; y0k 0 for every k yields that hz 0, so that ~ y implies that h the uniqueness statement in Lemma 2.1(v) applied to M ~hy . To summarize, it has been shown that every limit point of fhn ; y n g 0 r coincides with ~hy , and as already noted, this completes the proof. The conclusion of Theorem 4.1 follows combining Lemmas 7.2 and 7.3. Before leaving this section, it is useful to point out the simple but useful remark in Lemma 7.4 below. As noted in the proof of Theorem 6.1, for each n A N ~ such that there exists a stationary policy fn A F X qx; y fn x; yn Vn 1 y; y0n 1 ; x A S 7:11 Vn x; y0n rx; fn x; yn y

According to the following lemma, policy fn belongs to the space F as soon as n is large enough. ~ be as in (7.11). In this case, there exists Lemma 7.4. For each n A N, let fn A F N A N such that fn z 0 D for n > N. Proof. Observe that (2.5) and (3.3) yield that jgy j < B, so that, by Theorem 4.1, there exists N A N such that jgn z; y0n j < B

for n > N

7:12

On the other hand, form (3.2), (3.5) and (7.11), it follows that fn z D ) Vn z; y0n ) gn z; y0n

B Vn 1 z; y0n 1 B;

see De®nition 4.2. Thus, together with (7.12), this yields that fn z 0 D when n > N. r 8 The adaptive policy The previous results on the VI scheme with converging parameters will be now used to construct an optimal adaptive policy; before going ant further, it should be emphasized that the dynamics of the underlying system is supposed to be described by some of the models in the original family fMy g in Section 2, but ~ y g used in the connot by one of the members in the auxiliary collection fM ^ struction of the VI functions. Let fyt : Ht ! Yg be the sequence of consistent estimators in Assumption 3.1. As the system is driven by the controller, the estimate y^t y^t ht becomes available at time t, and the VI function Vt ; y^0t can be determined in terms of the Vt 1 ; y^0t 1 , which has been already computed: " # X Vt x; y^t sup rx; a; y^t qx; y a; y^t Vt 1 x; y^t 1 : 8:1 0

a A Ux

y

0

88

R. Cavazos-Cadena

As before, the term within brackets is a continuous function of a A Ux, and the compactness of the action sets U yields that there exists a policy ft ; y^0t A ~ such that F X qx; y ft x; y^0t ; y^t Vt 1 x; y^0t 1 ; x A S: Vt x; y^0t rx; ft x; y^0t ; y^t y

8:2 Moreover, using Proposition D.2 in HernaÂndez-Lerma (1989) p. 130, it is not di½cult to see that for each x A S, the mapping ht ! ft x; y^0t can be assumed to be a Borel measurable function of ht A Ht , a condition that, hereafter, is supposed to hold. ~ be as in (8.2), and select a ®xed De®nition 8.1. For each t A N, let ft ; y0t A F action a A Az. The deterministic policy p f ft g is de®ned as follows: For each t A N and ht A Ht , 8 ^t > < ft xt ; y0 ; if xt 0 z ft ht ft z; y^0t ; if xt z and ft z; y^0t 0 D > : a ; if xt z and ft z; y^0t D. A controller using this policy to drive the system chooses actions as follows: At each time t A N, function Vt ; y^0t is computed in terms of Vt 1 ; y^0t 1 , which is already known. Then, if Xt xt is observed, the maximizer at ft xt ; y^0t A Uxt is determined and the action applied is at , except when at D, in which case the action a A Az is used; recall that the arti®cial ac~ y g but not for tion D is only available at state z for the auxiliary family fM the models in fMy g. This policy p is referred to as the non stationary value iteration (NVI) policy. The following is the main result in this note. Theorem 8.1. The NVI policy p in De®nition 8.1 is adaptive optimal, i.e., Jp ; x; y J x; y;

x A S; y A Y:

The argument used to establish this result depends on the following idea (Mandl, 1974). De®nition 8.2. For each y A Y, let the pair gy ; hy be as in Lemma 2.1. The discrepancy function F : K Y ! R associated to the family fMy g is de®ned by X px; y a; yhy y; Fx; a; y gy hy x Rx; a; y y

x A S; a A Ax; y A Y: Notice that the AROE in (2.5) yields that F is a nonnegative function.

Adaptive control under the LFC

89

Moreover, from Lemma 2.2(ii) and Assumptions 2.1 and 2.2 it follows, via Proposition 18 in Royden (1968) p. 232, that for each y A Y, a 7! Fx; a; y is a continuous function of a A Ax. In particular, since the action sets are compact, for each x A S and y A Y, Fx; ; y is a bounded function. The key tool to establish Theorem 8.1 is the following result. Theorem 8.2. For the NVI policy p the following convergence holds for every y A Y and x A S: " # n X 1 p E FXt ; At ; y 0: lim n!y n 1 y; x t0 The somewhat technical proof of this theorem will be given in Section 9 below. Using this result, Theorem 8.1 follows immediately. Proof of Theorem 8.1. From the de®nition of the discrepancy function, it is clear that the equality gy hy x Rx; a; y Fx; a; y

X y

px; y a; yhy y

always holds, and via an induction argument it yields that for every policy p A P, x A S and y A Y, " n 1gy hy x Ey;p x

# n X RXt ; At ; y FXt ; At ; y hy Xn1 ; t0

n A N: Then, applying Remark 2.1(iv), it follows that " # n X 1 Ep RXt ; At ; y FXt ; At ; y : gy lim n!y n 1 y; x t0 Using this equality with p instead of p, Theorem 8.2 implies that " # n X 1 p E RXt ; At ; y ; gy lim n!y n 1 y; x t0 so that Jp ; x; y gy J x; y; see Lemma 2.1(i). Since x A S and y A Y r are arbitrary, the NVI policy p is adaptive optimal. 9 Proof of Theorem 8.2 The argument used to prove Theorem 8.2 has been splitted into three parts presented below as Lemmas 9.1±9.3 below. First, it will be established that if

90

R. Cavazos-Cadena

the discrepancy function is truncated to zero outside a ®nite set, then the average reward associated to this truncation under the NVI policy p is null. Lemma 9.1. (i) Let x0 ; a0 ; x1 ; a1 ; x2 ; a2 ; . . . A Ky be a trajectory of the stateaction process fXt ; At for which the following convergence hold: ^ t y: lim yh

9:1

t!y

In this case, for each x A S, limt!y Fx; ft x; y^0t ; y 0; see (8.2) and De®nition 8.2. (ii) For each x; y A S and y A Y, lim FXt ; At ; yI Xt y 0

t!y

Py;px -almost surely:

Consequently (iii) For each ®nite subset G of S, " # n X 1 p E FXt ; At ; yI Xt A G 0: lim n!y n 1 y; x t0 Proof. (i) By (9.1) and Lemma 7.4, ft z; y^0t 0 D is t is large enough. In this case straightforward calculations using (8.2) and De®nition 4.2 show that for each x A S, gt z; y^0t aht x; y^0t Rx; ft x; y^0t ; yt X px; y ft x; y^0t ; yt aht 1 y; y^0t 1 y

so that Fx; ft x; y^0t ; y gy hy x Rx; ft x; y^0t ; y X px; y ft x; y^0t ; yhy y y

gy gt z; y^0t hy x aht x; y^0t " Rx; ft x; y^t ; y Rx; ft x; y^t ; yt 0

X y

X y

0

px; y ft x; y^0t ; yhy y # t t 1 ^ ^ px; y ft x; y0 ; yt aht 1 y; y0

Adaptive control under the LFC

91

Each one of the terms within brackets in the last equality converges to zero. Indeed, by (9.1) and Theorem 4.1, as t % y, gy gt z; y^0t ! 0 and hy x aht x; y^0t ! hy x a~hy x 0; see (3.10). Concerning the third term, its absolute value is bounded above by X px; y a; yhy y sup Rx; a; y Rx; a; yt a A Ax y X y

px; y a; yt aht 1 y; y^0t 1 ;

term that also converges to zero as t goes to y; this latter fact can be obtained along the lines used to prove Lemma 7.1(i). Thus, limt!y Fx; ft x; y^0t ; y 0. (ii) Let H H Ky be the set of trajectories such that (9.1) holds. For each trajectory in H, the selector ft ; y^0t satis®es that ft z; y^0t 0 D for t large enough and, in this case, under p the action At applied at time is ft Xt ; y^0t with probability one (see De®nition 8.1), so that for every x A S and y A Y, the following assertion holds: Along a trajectory in H;

FXt ; At ; yI Xt y

F y; ft y; y^0t ; yI Xt y

if t is large enough:

Since Py;p x H 1, by Assumption 2.3, the conclusion follows from part (i). (iii) As already noted, Fy; ; y is a bounded function, so that by part (ii) and dominated convergence, limt!y Ey;p x FXt ; At ; yI Xt y 0 always holds. Then for each ®nite set G H S, X lim Ey;p x FXt ; At ; yI Xt y 0; lim Ey;p x FXt ; At ; yI Xt A G t!y

yAG

t!y

x A S; y A Y; and the conclusion follows immediately.

r

The next step consists in studying the average reward associated to the discrepancy function truncated to zero within a ®nite set when the system is driven by the NVI policy p ; the main goal is to show that such an average reward can be made as small as desired if the ®nite set is large enough. To establish this result, the following lemma will be useful. Lemma 9.2. For each y A Y there exists a function Ly : S ! 1; y such that X px; y a; yLy y a Ly x; x A S; a A Ax; 9:2 1 Fx; a; y y0z

moreover, for some constant C1 , Ly a C1 l:

9:3

92

R. Cavazos-Cadena

Proof. Given y A Y, let gy ; hy be the pair satisfying the AROE (2.4). In this case, recalling that hy z 0, the de®nition of the discrepancy function yields that for every x A S and a A Ax, hy x Rx; a; y Fx; a; y

gy

X y0z

px; y a; yhy y:

On the other hand, from Assumption 2.2, lx b jRx; a; yj 1

X y0z

px; y a; yly;

so that for each constant b > 0 blx hy x b bjRx; a; yj Rx; a; y b gy Fx; a; y X px; y a; y bl y hy y: y0z

Set b lz b 1 and Ly hy bl. Using (2.5), it follows that (9.2) holds, whereas (9.3) is obtained from Lemma 2.1(ii). r Let y A Y be ®xed, and consider the MDP M hS; A; fAxg; F; ; y; f px; y ; ygi for which the discrepancy function is the reward function. Lemma 9.2 implies that M satis®es the Lyapunov function condition, i.e, Assumption 2.2 is satis®ed by M when Ly replaces l and the parameter space is reduced to the singleton fyg. Indeed, in this case (9.2) yields the ®rst part of Assumption 2.3, the second part follows from (9.3) and Proposition 18 in Royden (1968) p. 232, and (9.3) yields the third part. For each G H S de®ne FG x; a; y

Fx; a; y; 0;

if x B G if x A G.

9:4

By Remark 2.1(v), the optimal average reward associated to this reward function is a constant gG : " #! n X 1 p Ey; x FG Xt ; At ; y gG sup lim sup p n!y n 1 t0

9:5

and the corresponding AROE is satis®ed: " gG hG x sup

a A Ax

FG x; a

X y

where the function hG : S ! R satis®es

# px; y a; yhG y ;

9:6

Adaptive control under the LFC

93

jhG j a Ly zLy x;

9:7

see Remark 2.1(iii). Lemma 9.3. Let y A Y and e > 0 be ®xed. With the notation in (9.4)±(9.7), there exists a ®nite set G H S, possibly depending on y, such that gG < e: Proof. Let fGn g be a sequence of ®nite subsets of S increasing to S, and observe that, since the discrepancy function is nonnegative, (9.4) and (9.5) yield that fgGn g is a decreasing sequence of nonnegative numbers. Set g lim gGn : n!y

It will be shown that g 0, equality that yields that gGn < e for n large enough, establishing the lemma. For each n A N, consider the optimality equation " gGn hGn x sup

FGn x; a

a A Ax

X y

# px; y a; yhGn y ;

x A S:

Using (9.7) pick a sequence fnk g of positive integers converging to y such that limk!y hGnk w : hw exists for every w A S. Replacing n by nk in the last displayed equality and taking limit as k % y, it follows that " g hx lim sup

k!y a A Ax

X y

# px; y a; yhGnk y ;

x A S;

observe that, since the sets Gn form a sequence increasing to S, for a given state x there exists N A N such that x A Gn for n > N, so that FGn x; ; y 0. On the other hand, form (9.3), (9.7) and Assumption 2.2, the following convergence can be established along the lines used to prove Lemma 7.1(ii): " lim sup

k!y a A Ax

X y

#

"

px; y a; yhGnk y sup

a A Ax

X y

# px; y a; yh y ;

so that " g hx sup

a A Ax

X y

# px; y a; yh y ;

x A S:

Thus, the pair g; h satis®es the AROE associated to the null reward function, and then Lemma 2.1(v) yields that g 0. r Proof of Theorem 8.2. Let y A Y, x A S and e > 0 be arbitrary but ®xed, and select the ®nite set G H S as in Lemma 9.3. In this case,

94

R. Cavazos-Cadena

" # n X 1 p Ey; x lim sup FXt ; At ; y n!y n 1 t0 " n X 1 lim sup Ey;p x FXt ; At ; yI Xt A G n!y n 1 t0 # FXt ; At ; yI Xt B G " # n X 1 p Ey; x FXt ; At ; yI Xt A G a lim sup n!y n 1 t0 " # n X 1 p Ey; x lim sup FXt ; At ; yI Xt B G n!y n 1 t0 " # n X 1 p Ey; x FG Xt ; At ; y lim sup n!y n 1 t0 0 is arbitrary, it follows that " # n X 1 Ep FXt ; At ; y 0; x A S; y A Y; lim n!y n 1 y; x t0 establishing the desired conclusion.

r

10 An example This section presents an example for which the NVI policy is adaptive optimal. Example 10.1. Consider a service station with three servers and customers arriving according to a Poisson stream fxt g, with mean y A Y 0; 1=2; it is supposed that each server can complete a job in a given unit of time. Let Xt be the number of customers waiting for service at time t A N. When Xt x > 0 is observed, the controller must decide the number of servers At a that will be assigned to attend the customers, where a b 1 and a a minf3; xg. In this case Xt1 , the number of customers waiting for service at time t 1, will be given by Xt1 Xt

A t xt ;

10:1

whereas if Xt 0, the controller just wait for the customers that will arrive in the interval t; t 1, so that Xt1 xt . The reward function in this example will be taken as

Adaptive control under the LFC

Rx; a

d0 x

d1 a

95

d2 ;

10:2

where d0 measures the unit cost of the waiting time of the customers, d1 is the running service cost, and d2 is the ®xed cost for having the three servers available; here, di b 0 for i 1; 2; 3. This example can be modeled by a family of MDPs fMy gy A Y , where Y 0; 1=2. The components of My hS; A; fAxg; R; f px; y a; ygi are given as follows: S N;

A f0; 1; 2; 3g

and for each x A S, A0 f0g;

A1 f1g;

A2 f1; 2g;

and

Ax f1; 2; 3g;

x b 3; the reward function is determined by (10.2), while using (10.1), the transition lay is given by the following expression: For each x A N, a A Ax and y A 0; 1=2, qx; y a; y e

y

y y x a ; y x a!

y A N;

and

ybx

a:

10:3

It will be shown that Assumptions 2.1±2.3 are satis®ed in this example, so that the NVI policy is adaptiv optimal. Since the action space A is ®nite, (10.2) and (10.3) immediately yield that Assumption P n 1 2.1 occurs. On the other hand, xt converges to Ey x1 y with by the strong law of large numbers, n 1 t0 probability one, so that setting y^0 0 and ^ n 1 yh n

n X

xt

xt

1

at 1 ;

n 1; 2; 3; . . . ; hn A Hn ;

t1

Assumption 2.3 is satis®ed for this sequence of estimators; see (10.1). The veri®cation of Assumption 2.2 uses the following technical result which is obtained by direct calculations. Proposition 10.1. Let x be a random variable with Poisson distribution with parameter y A 0; y, and set m1 y Ey x y;

my Ey x 2 y y 2 ;

m3 y Ey x 3 y 3 3y 2 y: For each x A Nnf0g assertions (i) and (ii) below hold. (i) x 2 (ii) x 3 1

Ey x 1 x 2 2x Ey x 1 x 3 3x m3 y.

11 m1 y 1 m2 y. 1 2 1 m1 y 3x 11

m2 y

This proposition will be used to show that Assumption 2.2 is satis®ed in Example 10.1. Set z 0 and de®ne l2 : N ! R by

96

R. Cavazos-Cadena

l2 x 8Dx 2 e 1=2 ;

x A N;

10:4

where D d0 3d1 d2 1:

10:5

Proposition 10.2. Setting z 0 and l l2 , Assumption 2.2 is satis®ed in Example 10.1.

.

Proof. Let x A Nnf0g be ®xed. Since l2 is increasing, (10.3) yields that for every a A Ax X

qx; y a; yl2 y Ey l2 x

a x1 I x

a x1 0 0

y00

a Ey l2 x

a x1 a Ey l2 x

1 x1

Thus, for each y A Y 0; 1=2, (10.4) and Proposition 10.1 together yield l2 x

X

qx; y a; yl2 y b l2 x

Ey l2 x

1 x1

Ex

1 x1 2

y00

8Dx 2 8D2x

11

m1 y 1

m2 y

Since y A 0; 1=2, it follows that m1 y a 1=2 and m2 y a 3=4, so that l2 x

X y0z

qx; y a; yl2 y b 8D x

1 1 4

D8x

6 b Dx 1;

using (10.2) and (10.5) this implies that l2 x b

X

qx; y a; yl2 y jRx; aj 1;

x A Nnf0g; a A Ax;

y00

y A 0; 1=2:

10:6

Next, observe that for every y A 0; 1=2, l2 0

X

q0; y 0; yl2 y 8De 1=2

Ey l2 x1 I x1 0 0

y00

8DEy x12 I x1 00

8De 1=2 8De 1=2

y

b 8D1

m2 y b 2D

8Dm2 y

8De 1=2 Py x1 00

Adaptive control under the LFC

97

where the last inequality used that y A 0; 1=2, so that m2 y a 3=4. Then, (10.2) and (10.5) yield that X q0; y 0; yl2 y jR0; 0j 1; y A Y; l2 0 b y00

together with (10.6), this inequality shows that part (i) of Assumption 2.2 occurs. P Observe that for each x A N and a A Ax, y 7! y qx; y a; yl2 y Ey l2 x a x1 is a polynomial function of y; since the action set is ®nite, part (ii) in Assumption 2.2 certainly holds. Set l3 x 2x 3 12x 2 13e 1=2 . With this notation, straightforward calculations using that l3 is increasing and Proposition 10.1 yield that for each positive integer x; a A Ax and y A 0; 1=2, X qx; y a : yl3 y l3 x Ey l3 x a x1 I l3 x a x1 00 l3 x

. .

y00

b l3 x

Ey l3 x

a x1

b l3 x

Ey l3 x

1 x1

b 23x

1 2 1

m3 y

m2 y

121

3 3 4

1 2

b 3x

m1 y 1

where it was used that m1 y a 1=2, m3 y a 11=8 and m2 y a 3=4 for every y A 0; 1=2. From the last inequality, it is not di½cult to see that for every positive integer x, X qx; y a : yl3 y b x 2 1; a A Ax; y A 0; 1=2: 10:7 l3 x y00

On the other hand, observe that for each y A 0; 1=2 X l3 0 qx; y 0 : yl3 y l3 0 Ey l3 x1 I x1 0 0 y00

Ey 2x13 12x12 13e 1=2 I x1 0 0

13e 1=2 13e 1=2 b 13

y

11 4

2m3 y

12m2 y

9;

recall that y A 0; 1=2, so that m3 y a 11=8 and m2 y a 3=4. Therefore X l3 0 qx; y 0; yl3 y b 1; y00

98

R. Cavazos-Cadena

since A0 f0g, it follows that (10.7) also holds for x 0. To conclude observe that for each stationary policy f and y A 0;1=2, (10.7) yields that the inequality x 2 1 Ey;f x l3 X1 I T > 1 a l3 x is valid for every state x, and by induction it follows that n X t0

Ey;f x Xt2 1I T > t Ey;f x l3 Xn1 I T > n 1 a l3 x; n A N; y A 0; 1=2:

P f 2 Taking limit as n goes to y, it follows that the series y t0 Ey; x Xt 1I T > t is convergent, for every state x and y A 0; 1=2, so that limt!y Ey;f x Xt2 1I T > t 0. Therefore, by (10.4), lim Ey;f x l2 Xt I T > t 0;

t!y

i.e., part (iii) in Assumption 2.2 is satis®ed.

r

11 Conclusion This work considered average-reward MDPs whose reward function and transition law depend on an unknown parameter. Under the mild continuitycompactness conditions in Assumption 2.1, and the adaptive version of the Lyapunov function condition in Assumption 2.2, an implementation of the non stationary value iteration technique was analyzed. This procedure was not applied to the original model, but to a suitably modi®ed family of MDPs introduced in Section 3. Given a sequence of parameters converging to a target value, the transformation allowed to prove that the di¨erential rewards and the relative value functions, obtained from the approximation scheme as described in De®nition 4.2, converge to the solution of the AROE of the modi®ed model corresponding to the limit parameter; see Theorem 4.1. Using this result, it was shown that the NVI policy in Section 8 is adaptive optimal; the argument used in this direction relayed on the existence of a Lyapunov function for the discrepancy function; see Lemma 9.2. As already noted, another adaptive policy p 0 , referred to as the PEC policy, is frequently used, and its construction is as follows: At each time t A N the es^ t is computed, and an optimal stationary policy f ^ corresponding timate yh yht 0 to Myh ^ t as in Lemma 2.1 is determined. In this case, p prescribes the applica0 tion of action At fyh ^ t Xt . The adaptive optimality of the PEC policy p can be obtained along the lines used to prove Theorem 8.1. However, there is an important practical di¨erence between the PEC and NVI policies, namely, the usage of p 0 requires either (a) the a priori knowledge of fyh ^ , or (b) to solve the ^ t . Assuming that (a) holds is, in tgeneral, unrealistic, AROE (2.4) with y yh whereas (b) requires to implement a method for solving the an AROE at each time t A N, perhaps the VI scheme itself (without unknown parameters). Since this computations must be done at each decision time, the application of the

Adaptive control under the LFC

99

PEC policy p 0 implies a heavy computational burden. In contrast, to use the NVI policy just one iteration of the VI scheme is required at each decision time. Finally, extending the results in this note to more general contexts, as MDPs with Borel state space, or to combine Theorem 4.1 with non parametric estimates to obtain optimal adaptive policies seem to be interesting problems; research on these directions is currently in progress. References Cavazos-Cadena R, HernaÂndez-Lerma O (1991) Recursive adaptive control of Markov decision processes with the average reward criterion. Journal of Applied Mathematics and Optimization 23:193±207 Cavazos-Cadena R, HernaÂndez-Lerma O (1992) Equivalence of Lyapunov stability criteria in a class of Markov decision processes. Journal of Applied Mathematics and Optimization 26:113± 137 Cavazos-Cadena R, FernaÂndez-Gaucherand E (1996) Value iteration in a class of average controlled Markov chains with unbounded costs: necessary and su½cient conditions for pointwise convergence. Journal of Applied Probability 35:986±1002 Cavazos-Cadena R (1998) A pause control approach to the value iteration scheme in average Markov decision processes. Systems and Control Letters 33:209±219 Federgruen A, Schweitzer PJ (1981) Nonstationary Markov decision problems with converging parameters. Journal of Optimization Theory and Applications 34:207±241 Foster FG (1953) On the stochastic processes associated with certain queueing systems. Annals of Mathematical Statistics 24:355±360 HernaÂndez-Lerma O, Marcus SI (1985) Adaptive control of discounted Markov decision chains. Journal of Optimization Theory and Applications 46:227±235 Hinderer K (1970) Foundations of non-stationary dynamic programming with discrete time parameter. Springer-Verlag, New York HernaÂndez-Lerma O (1988) Adaptive Markov control processes. Springer-Verlag, New York Hordjik A (1977) Dynamic programming and Markov potential theory. Mathematical Centre Tract No. 51, Mathematisch Centrum, Amsterdam Mandl P (1974) Estimation and control in Markov chains. Advances in Applied Probability 6:40± 60 MinjaÂrez-Sosa JA (1999) Nonparametric adaptive control for discrete-time Markov processes with unbounded cost under the average criterion. Applicationes Mathematicae 26:267±280 Munkres J (1975) Topology: a ®rst course. Prentice-Hall, Englewood Cli¨s, New Jersey Puterman M (1994) Markov decision processes. Wiley, New York Royden HL (1968) Real analysis. Macmillan, London Schweitzer PJ (1971) Iterative solution of the functional equation for undiscounted Markov renewal programming. Journal of Mathematical Analysis and Applications 34:495±501 Sennott LI (1999) Stochastic dynamic programming and the control of queueing systems. Wiley, New York Thomas LC (1965) Conectedness conditions for denumerable state Markov decision processes. In: Hartley R, Thomas LC, White DJ (Eds.), Recent Developments in Markov Decision Processes, Academic Press, New York, pp. 181±204

Adaptive control of average Markov decision chains under the Lyapunov stability condition

Recommend Documents