Journal of the Operational Research Society (2001) 52, 1402–1407
#2001 Operational Research Society Ltd. All rights reserved. 0160-5682/01 $15.00 www.palgrave-journals.com/jors
Variance reduction for Markov chain processes using state space evaluation for control variates FA Dahl* Norwegian Defence Research Establishment, Kjeller, Norway We develop a method for reducing variance in Monte Carlo simulation of Markov chain processes based on extracting accurate control variates from state space evaluation functions. An example is given in the form of a simple combat model, where the net variance reduction (adjusted for additional calculation) is larger than a factor of 80. We also indicate how our algorithm may be applied to discrete event simulations and system dynamic models with discrete random events. Keywords: simulation; Markov chains; neural net training; variance reduction
Introduction Stochastic modelling is an important class of OR techniques. Such models quickly become too complex to allow analytical solutions, and one is reduced to running Monte Carlo (MC) computer simulations. The basic idea of MC simulations is to produce a large number of independent runs of the model, and take the average output as the estimate of the model’s expected output. By the Lindeberg–Levy central limit theorem,1 such simulations have the general property that the standard deviation of the estimate is inversely proportional to the square root of the number of simulation runs. Therefore one must increase the number of runs by a factor of 100 to reduce the error by a factor of 10. This slow rate of convergence is the reason why we are interested in techniques that can reduce the variance of MC simulations. We formulate our algorithm for Markov processes with discrete time and space (Markov chains). Markov chain simulation is an established technique in military OR,2 and our application is a simplified combat model. Our method requires that the simulation model terminates with a numeric output, and that our problem is to estimate the expected output value. The output may of course be multidimensional, in which case we apply our method to each component. Variance reduction in simulation In an MC simulation one typically uses a sequence of basic quasi-random variables, uniformly distributed from 0 to 1, which can be viewed as input to the otherwise deterministic model. The model can therefore be viewed as a function *Correspondence: FA Dahl, Division for System Analysis, Norwegian Defense Research Establishment, PO Box 25, NO-2027 Kjeller, Norway.
mapping the vector of basic variables to an outcome. This function may include any kind of algorithm that can be implemented on a computer. We refer to Wilson3 and Nelson4 for introductions to variance reduction in general. There are essentially three different approaches one can take to variance reduction: antithetic variates, conditional expectancies and control variates. We like to think of high output values as advantageous, to simplify the explanation. The idea of antithetic variates is to manipulate the basic variables so that good and bad luck is evenly distributed between the different simulation instances. One standard implementation of this idea is to run the simulations in pairs. The first run of a pair is simulated without any modifications, and in the second, each basic variable is flipped over to its mirror image over 12. When using conditional expectancies, one utilises the fact that, when some of the basic variables have been assigned values, the expected output value of the model may be calculated exactly. This reduces the variance that would have resulted from those variables. The third approach to variance reduction is that of control variates, see Nelson.5 To use this method one needs an ^ , which uses (a alternative—often simplified—model M subset of) the same set of basic variables as the actual simulation model. The alternative model must have a known mean m, and should otherwise be as similar as possible to ^ is the actual model. In the work of Myers6 the model M 7 referred to as a response surface, while Tew uses the term ‘metamodel’. For each simulation run, the actual and alternative models are evaluated with the same value assignment for the basic variables, and the actual model’s output is ^ mÞ from it. This does not modified by subtracting ðM ^ mÞ by introduce any bias in the estimate, because ðM ^ construction has mean 0, and if M is similar to the actual model, it will reduce the variance. One can see the correc-
FA Dahl—Markov chain variance reduction
^ mÞ as an unbiased estimate of how lucky the tion term ðM drawing of the basic variables was. A common way of constructing control variates is by building linear models, which estimate the expected model outcome from the basic variables (or from simple functions of them). Regression is then used to optimise the model. In the present paper we give an algorithm that works by utilizing knowledge about a simulated Markov chain. This knowledge has the form of estimates of expected outcomes for the different non-terminal states of the process. From this knowledge we calculate unbiased estimates of how ‘lucky’ each state transition is, and subtract the estimated accumulated luck for the entire simulation from the actual outcome. Markov chains We define our algorithm for time-homogeneous Markov processes that are discrete in time and space, ie Markov chains. Let S be the finite state space of the process, and denote the sequence of states visited by the process by fXk g, where the initial state X0 2 S is given. Being Markov implies that the process is without memory. This means that, if Xk ¼ x, the process has no memory of states prior to Xk apart from the information encoded in the current state x. This is equivalent to independence of history, in the sense that, if i < j < k, then PðXk ¼ yjXj ¼ xÞ is independent of Xi . Due to this Markov property, there exists a unique probability of transition from state x to y in one step: Pxy ¼ PðXkþ1 ¼ yjXk ¼ xÞ. As a part of the problem definition, a set of terminal states D S is given, so that the process X stops the first time it enters D. We assume that X reaches D with probability 1 and we let t denote the random time when this happens. A function g is defined on D, and our task is to find an estimate of the expected g value when X stops: E½gðXt Þ . This formalism also allows for deterministic stopping times: if the process is to be stopped after T steps, one redefines the state space by adding the discrete time as a state dimension, transforming the state space from S to S f1; . . . ; T g, which is still finite. P The basic MC-estimator is ð1=N Þ Ni¼1 gðXti Þ, where X i are independent simulations of X . The following simple algorithm implements this: sum 0 repeat N times k 0 X0 x while ðXk 62 DÞ hdraw Xkþ1 according to the distribution PXk Xkþ1 i k
kþ1
sum sum þ gðXk Þ returnðsum=N Þ
1403
As we mentioned in the introduction, the problem with this simple MC algorithm is that the standard p deviation of the ffiffiffiffi random error is inversely proportional to N . Recall that our task is to reduce the variance of the MC estimate by constructing a control variate. Note that, because S is finite, it is possible to enumerate it, and represent the process state as a number: Xk 2 f1; 2; 3; . . . ; ng, where n is the size of S. The process can be characterised by its transition matrix M 2 Rn n , where the matrix entry mij ¼ Pij ¼ PðXkþ1 ¼ jjXk ¼ iÞ gives the probability of jumping from state i to state j. While transition matrices are extremely useful for deriving general properties for Markov chains, they are seldom useful for non-trivial applications. This is because the set of possible transitions usually is small relative to n2 , so that storing only the non-zero transition probabilities is more efficient. We therefore stick to the notation of the state space S with typical elements x and y. The algorithm Let f : S ! R be a function that assigns estimated expected end state payoffs to process states: f ðxÞ E½gðXt ÞjXk ¼ x . (If we enumerate the states and take f1; 2; . . . ; ng as the state space, the function f takes the form of a vector in Rn .) The function f represents our knowledge about the process X , which we will use for extracting control variates. The idea of our algorithm is simple; for each random transition, the function f is applied to each state that may result from the draw, and the expected f value after the transition is calculated. The difference between the f value of the state actually drawn, and the expected f value, gives the ‘luck’ of the draw. These ‘luck estimates’ are summed over the simulation, and subtracted from the g value of the terminal state. To formalize this, we need to make some definitions. The generator A of the process is an operator that maps the function f to a function Að f Þ, so that Að f Þ: S ! R. This means that, if x 2 S, then Að f ÞðxÞ 2 R. Assume Xk ¼ x, and consider f ðXk Þ ¼ f ðxÞ. Then Að f ÞðxÞ is defined as the expected change to f ðXk Þ after one time step: Að f ÞðxÞ ¼ E½ f ðXkþ1 Þ f ðXk ÞjXk ¼ x :
ð1Þ
We can express the generator using the state transition probabilities: P Að f ÞðxÞ ¼ ½Pxy f ð yÞ f ðxÞ: y2S
It can be shown that a function f satisfies f ðxÞ ¼ E½gðXt ÞjX0 ¼ x for all x if ; and only if ; Að f ÞðxÞ ¼ 0 for all x; and f ¼ g on D:
ð2Þ
If one enumerates the states, (2) gives a linear system of equations determining f . Therefore, if the number of states is not too large, f can be computed directly from (2) with
1404 Journal of the Operational Research Society Vol. 52, No. 12
standard matrix algorithms. Otherwise reinforcement learning algorithms like temporal difference learning8 ðTDðlÞ learning) can produce approximate solutions by iteratively minimizing the average error in expressions related to (2). We are interested in the latter case, where the system of equation (2) is too large to solve, and we have access to some approximate solution f . We will show how the insight represented in f can be utilised in a variance reduction algorithm. Define a stochastic process fLk g by Lkþ1 ¼ Lk þ f ðXkþ1 Þ f ðXk Þ Að f ÞðXk Þ L0 ¼ 0
ð3Þ
Our variance reduced Monte Carlo algorithm, with initial value x and N iterations, is given by
sum
0
repeat N times s
0
k
0 x
X0
while ðXk 62 DÞ s
or equivalently Lk ¼ f ðXk Þ f ðX0 Þ
k1 P
s þ Að f ÞðXk Þ
hdraw Xkþ1 according to the distribution PXk ;Xkþ1 i
Að f ÞðXi Þ:
r¼0
By identity (1), L is a martingale.9 Simply put, a martingale is a stochastic process with the property that its expected value at a later time is equal to its current value: E½Xk jXi ¼ x ¼ x, where i < k. Our control variate is Lt :
We give a direct proof of the following proposition, which states that our algorithm gives unbiased estimates, although it may be regarded as trivial in light of general martingale theory. E½g~ ¼ E½gðXt Þ .
Proof The process L is defined by (3) for k 4 t. For k > t we let Lk ¼ Lk1 , thereby defining Lk for all k 5 0. Let ek ¼ Lkþl Lk . For all x 2 S such that PðXk P ¼ xÞ > 0, we have E½ek jXk ¼ x ¼ 0. Therefore, E½ek ¼ x2S:PðXk ¼xÞ> 0 E½ek jXk ¼ x PðXk ¼ xÞ ¼ 0. It follows that E½Lk ¼ P E½ k1 i¼0 ek ¼ 0 for all k 5 0. Because we assume Pðt ¼ 1Þ ¼ 0, it follows that limk!1 Lk ¼ Lt almost surely. We will use the dominated convergence theorem1 to show that E½Lt ¼ 0. Assume Xk ¼ x. There must exist a finite sequence of state transitions taking X from x to D with positive probability, as otherwise Pðt ¼ 1Þ > 0. Because the set of states is finite, there must exist an N and e > 0 such that PðXkþN 2 DÞ > e. Therefore a K exists so that Pðt > kÞ < expðk=KÞ. Because the set of states is finite, and ek only depends on Xk and Xkþ1, there exists an F so that jek j 4 F for all k. Define L^ ¼ supk 5 0 jLk j, and let Ik be the indicator variable of the event t ¼ k (Ik ¼ 1 if t ¼ k; 0 otherwise). Because Pðt ¼ 1Þ ¼ 0, we have k1 1 1 P P P ^ ^ E½L ¼ E½Ik L 4 E Ik jei j 4
k¼0 1 P
k¼0
g~ sum
kþ1 gðXk Þ f ðXk Þ þ f ðX0 Þ þ s sum þ g~
returnðsum=N Þ
g~ ¼ gðXt Þ Lt :
Proposition
k
i¼0
ðexpðk=KÞkFÞ < 1:
k¼0
By construction, L^ dominates the sequence fLk g, so dominated convergence gives E½Lt ¼ 0, and the result follows. u
From the preceding analysis, the correction term ð f ðXk Þ f ðX0 Þ sÞ has expected value 0, so that each g~ is an unbiased estimate of E½gðXt Þ .
Combat application We now give an application to a simple combat model that is similar to models studied by Jaiswal et al.2 There are two sides, denoted Blue and Red. At time t, bðtÞ and rðtÞ give the number of live weapon systems for both sides. We assume that each live weapon system is able to engage one opposing system. During a short time interval dt, each Blue weapon system has a probability pb dt of hitting and destroying a Red system. Similarly, pr gives Red’s effect on Blue systems. Combat lasts until one side has lost all its weapon systems, making the other side the winner. The model has continuous time, apparently requiring a discretisation of the time dimension, but this can be avoided by the following simplification procedure. Observe that weapon systems of the same side are indistinguishable. We can therefore regard all live Blue weapon systems as a single system, having probability bðtÞpb dt of killing a Red system within the next dt. As long as we are only interested in the probability of a side winning the battle, we can simplify the time dimension. Because the relative likelihood of Blue and Red hitting a shot is constant, until a shot is hit, the probability that Blue delivers the next hit is given by bðtÞpb =rðtÞpr þ bðtÞpb . Let bk and rk represent the number of live Blue and Red systems after a total of k systems have
FA Dahl—Markov chain variance reduction
been killed. Then our combat process is a Markov chain with transition probabilities: bpb rpr þ bpb rpr P½ðb; rÞ ! ðb 1; rÞ ¼ rpr þ bpb giving this generator: P½ðb; rÞ ! ðb; r 1Þ ¼
Að f Þðb; rÞ ¼
bpb rpr f ðb; r 1Þ þ f ðb 1; rÞ rpr þ bpb rpr þ bpb f ðb; rÞ:
The set of terminal states is given by D ¼ fðb; rÞ: br ¼ 0g, and if we see the process from Blue’s perspective the payoff function on D is, n 1 if r = 0 gðb; rÞ ¼ 0 if b = 0 As an example, we let b0 ¼ 20 and r0 ¼ 10. Also let pr ¼ 4pb . This means that Blue has twice as many systems as Red, while the Red systems are four times as effective as the Blue ones. According to Lanchester’s square law,10,11 this should give an even battle. Lanchester’s law is formulated for differential equations, and it might be interesting to test its validity for our discrete Markov model. The problem is easily solved with a computer using the equation (2), so it may be classified as a toy problem. Still, the problem serves to illustrate the algorithm. We have implemented two different kinds of state evaluations, discussed in the following subsections. The exact solution is that Blue wins with probability 0.478 83, which fits the predictions from Lanchester’s square law well. The variance is 0.249 55.
1405
well f approximates the true expected outcome for process states. State evaluation with self-trained neural nets We have also implemented a more automatic state evaluation function, in the form of a small neural net (NN). The NN takes bi and ri as input, and give as output an estimate of the probability that Blue eventually wins. The NN is a standard multi-layer perceptron with one hidden layer of four nodes, updated with back-propagation of errors. For an introduction to NNs, we refer to Hassoun.12 We train the NN by the reinforcement learning algorithm TDðlÞ.8 This algorithm essentially modifies the NN’s internal parameters, called weights, using feedback collected from sample paths of the process. We have found it useful to apply our variance reduction method within the learning algorithm also, thereby improving the quality of the NN’s feedback. However, the algorithmic details of this go beyond the scope of the present paper. The variance of the modified estimator g~ has been sampled to 0.000 15, which gives a variance reduction factor of approximately 1660. However, the variance reduction calculations slow the simulation down with a factor of 20, so the net gain from the variance reduction is approximately a factor of 83. The NN training was performed once and for all, before the variance reduction tests, so it is a constant factor that does not affect the variance reduction factor. In the quoted experiment, approximately an hour of computer time (on a 150 MHz PC) was spent on training. With respect to runtime efficiency, this time should be regarded as a part of the time it takes to develop the program (except that the programming is performed by the computer itself).
Heuristic state evaluation One approach to constructing a state evaluation function is to use common sense together with experimentation. From the Lanchester law, it is reasonable to demand that f be increasing as a function of b and return 0.5 if b ¼ 2r. A first guess at a function may be f ðb; rÞ ¼ 0:5 þ aðb 2rÞ, for some constant a. However, f approximates probabilities, so we truncate the calculated value at 0 and 1. After some experimentation, we set a ¼ 0:1, which gave a variance of 0.0165 and a variance reduction factor of approximately 15. The variance reduction calculations slow the program down by a factor close to 3, so the net variance reduction factor is about 5. Note that in the construction of this f , we did not include any knowledge derived from the calculated solution. We only used a theoretical result from a related model, and some quick experimenting. A good thing about the method is that the results speak for themselves, in the sense that we can sample the variance, and search for a function that makes it low. Recall that from our proposition, we know that the algorithm gives unbiased estimates, regardless of how
Scope of our algorithm Our algorithm is formulated for Markov chains, which we believe to appear more restrictive than is actually the case. In the following we attempt to clarify the scope of our algorithm. The algorithm is defined for stochastic processes, which implies that there is a time dimension in the model. Note, however, that the process time only serves as a means of ordering the random state transitions, and does not necessarily correspond with actual time in a physical sense. If there is no time dimension in the model, one can still argue that there is a single random state transition, and apply our algorithm. However, this is rather pointless, because the algorithm then demands that all possible end states of the model be calculated (as a part of the calculation of Að f ÞX0 ÞÞ, in which case one might just as well calculate E½gðXt Þ ¼ E½gðX1 Þ , and no MC simulation is called for. The algorithm is therefore not useful unless there is a sequence of random state transitions.
1406 Journal of the Operational Research Society Vol. 52, No. 12
The process is assumed to be Markov, which means that the current state contains sufficient information for the calculation of the probability distribution for the next state transition. If a process X is not Markov, one can always construct a Markov process Z by adding the process history to the state representation: Zk ¼ ½Xk ; Xk1 ; . . . ; X0 . The first component of Zk will clearly have the same distribution as Xk , so therefore the question of the ‘Markovness’ of a process can always be viewed as a question of choosing a large enough state space. If one is able to simulate the process in the first place, the computer program that performs the simulation must be able to calculate the transition probabilities. Therefore, the process representation used by the program must be rich enough to ensure the Markov property. We therefore conclude that the Markov property is trivially satisfied for any process that can be simulated. We assume that the process is discrete in space and time, but this can be relaxed. If the Markov process is discrete in space, continuous in time and time-homogenous, one can always transform it to a process with discrete time by the procedure used in our combat example. One simply eliminates the time dimension by observing that the relative probabilities for the possible state transitions are independent of how long it takes before the transition occurs. If the process is discrete in time and has probability densities for the state transitions, the state space S will be continuous. In this case, the generator is given by ð Að f ÞðxÞ ¼ f ð yÞjx ð yÞdy f ðxÞ S
where jx is the probability density for the state transition from state x. With this definition, the algorithm is still applicable, although its practical usefulness will depend on how easy it is to evaluate Að f ÞðXk Þ. If the Markov process has continuous time and continuous paths in space (ie X is a diffusion), the generator A is a semi-elliptic second order differential operator.9 In this case, our algorithm is still applicable, and the ‘accumulated luck process’ is given by ðt Lt ¼ f ðXt Þ f ðX0 Þ Að f ÞðXt Þdt:
tions are performed ‘in advance’, and the outcomes tabulated. In the following we will show how the Markov chain formalism (and thereby our algorithm) can relate to OR methods like system dynamics (SD) and discrete event simulation (DES). An SD model in its purest form consists of a graph where nodes represent numeric entities and edges give interactions between them. The graph indirectly represents a system of ordinary differential equation that can be simulated by numerical methods such as Runge–Kutta schemes. As such, there is no randomness, and no need for variance reduction. Assume, however, that some model parameters are drawn randomly at various points in time. As an example, the model may represent an airline’s transport of passengers, and for each day a random set of airports may be inoperable due to weather conditions. The end state evaluation function g might, for example, be the profit made by the airline during a month of traffic. In the simulation model, one would start each simulated day by randomly drawing the set of inoperable airports, whereafter the Runge–Kutta module would calculate the flow of passengers on that day. From the Markov chain perspective, the transition of the model state from one morning to the next is a direct consequence of the random draw of the weather, even though the calculations implied are non-trivial. In this setting, the function f would supply estimates of expected total profits based on model states. One might also consider modelling this airline traffic problem by DES, where the taking off and landing of airplanes are discrete events. Just like in the SD example above, the model state on one morning would be a function of the model state on the preceding morning and random weather on that day. There may also be random draws within the simulation of a single day, which would also be viewed as state transitions in the Markov chain. Again, the function f would supply estimates of expected profit based on model states. To sum up, Markov chain modelling, and therefore our algorithm, is applicable to any simulation model that features a sequence of discrete random draws, regardless of how deterministic aspects are modelled.
0
However, the details of this go beyond the scope of the present paper. The use of Markov chains gives associations to simple dynamics, where the transition from one state to the next is performed by drawing a new state according to a list of given transition probabilities. In a strict mathematical sense this is correct, as the set of states can be enumerated, and transition probabilities can be calculated once and for all. However, the set of states may be too large to make this feasible in practice. Also, there may be complex deterministic dynamics that control the process between the random draws. From the Markov chain perspective, all such calcula-
Conclusions We have given a variance reduction method for Markov chain processes that relies on control variates extracted from approximate state evaluation functions. The algorithm has been shown to work well on a simplified combat model. In this model, a simple handcrafted evaluation function gave a net variance reduction factor of 5 (taking account of the additional variance reducing calculations). We have shown that the state evaluator may be also be given as a neural net, trained by TDðlÞ learning. This gave a net variance reduction factor of more than 80. We also note that the variance reduction can
FA Dahl—Markov chain variance reduction
improve the TDðlÞ algorithm, although the algorithmic details are not given in this paper. We indicate that a wide range of modelling paradigms, such as system dynamics and discrete event simulations, can be viewed as Markov chains (and may therefore benefit from our algorithm), provided they feature sequences of discrete random events. References 1 Billingsley P (1995). Probability and Measure, 3rd edn. Wiley: New York. 2 Jaiswal NK, Sangeeta Y and Gaur SC (1995). Stochastic modelling of combat with reinforcement. Eur J Opl Res 100: 225–235. 3 Wilson JR (1984). Variance reduction techniques for digital simulation. Am J Math Mngt Sci 4: 277–312. 4 Nelson BL (1987). A perspective on variance reduction in dynamic simulation experiments. Commun Statist B16: 385– 225. 5 Nelson BL (1990). Control variate remedies. Opns Res 38: 974– 992.
1407
6 Myers RH (1976). Response Surface Methodology. Edwards Brothers: Ann Arbor, MI. 7 Tew JD (1995). Simulation metamodel estimation using a combined correlation-based variance reduction technique for first and higher-order metamodels. Eur J Opl Res 87: 349–367. 8 Kaelbling LP, Littman ML and Moore AW (1996). Reinforcement learning: a survey. J Artif Intell Res 4: 237–285. 9 Øksendal B (1998). Stochastic Differential Equations, An Introduction with Applications, 5th edn. Springer: Berlin. 10 Taylor JG (1983). Lanchester Models of Warfare, vols 1 and 2. Military application section of the Operations Research Society of America, Ketron Inc, 1700 N Moore St, Arlington VA 22209, USA. 11 Roberts DM and Conolly BW (1992). An extension of the Lanchester square law to unhomogeneous forces with an application to force allocation methodology, J Opl Res Soc 43: 741–752. 12 Hassoun MH (1995). Fundamentals of Artificial Neural Networks. Massachusetts Institute of Technology Press: Cambridge, MA.
Received August 2000; accepted June 2001 after two revisions