Experimental Economics, 8:325–345 (2005) c 2005 Economic Science Association
Why We Should Not Be Silent About Noise JOHN D. HEY Universities of York and Bari; Department of Economics, University of York, Heslington, York YO10 5DD, UK email:
[email protected] Received July 5, 2004; Revised February 6, 2005; Accepted February 6, 2005
Abstract There is an odd contradiction about much of the empirical (experimental) literature: The data is analysed using statistical tools which presuppose that there is some noise or randomness in the data, but the source and possible nature of the noise are rarely explicitly discussed. This paper argues that the noise should be brought out into the open, and its nature and implications openly discussed. Whether the statistical analysis involves testing or estimation, the analysis inevitably is built upon some assumed stochastic structure to the noise. Different assumptions justify different analyses, which means that the appropriate type of analysis depends crucially on the stochastic nature of the noise. This paper explores such issues and argues that ignoring the noise can be dangerous. Keywords:
noise, stochastic assumptions, testing, estimation, inferences
JEL Classification: B41, C50, C91, D81
1.
Introduction
If one looks at the nature of the empirical analysis reported in economics journals, one notes that it is almost always statistical inference. There is virtually no logical deduction from data. Why is this so? The reason seems to be because the data almost always contains a stochastic component. In everyday language, there is noise in the data. The existence of this noise means that one has to use statistical inference, rather than logical deduction, when analysing the data. The nature of this noise, or, in more formal terms, the stochastic structure of the data, determines the appropriate process of statistical inference that should be carried out on the data. If one makes the wrong assumptions about the stochastic structure of the noise, then one usually makes wrong inferences from the data. The understanding of the nature of the noise is therefore crucial to the issue of drawing correct inferences. This is the key point of this paper. We illustrate it with particular reference to the field of inferences from experimental data—where all too often the assumptions being made about the stochastic structure of the data are considered of secondary importance. In fact, they are of primary importance. Given that noise is generally pervasive in economic data, it is perhaps somewhat surprising that noise is largely absent in economic theory. The paper begins by discussing this apparent contradiction and what economists typically do about it. Usually it is left to the
326
HEY
econometrician to reconcile these two things—which he or she typically does by adding some noise into the theory, or to the application of the theory to the available data set. We argue that this is perhaps something that should be done by the economist, partly because it seems often to be the case that the noise is something about which the economist can say something—if it disappears with time, for example, it might be attributed to learning in some sense. This naturally leads us on to two issues: first, how we might model the noise; and second, what differences different specifications make to the nature of the statistical inferences that we carry out on the data. We also discuss the type of statistical inferences that we might want to draw—and, particularly, we discuss whether economists should be testing hypotheses or estimating models. I should perhaps reveal at this stage that I have a prejudice against testing—for reasons related to the existence of noise in data. The point is simple: if there is noise in data, and if some hypothesis under test is not exactly true (which no hypotheses ever are) then you can always reject any hypothesis you like at any level of significance you like—as long as you have enough data. What, therefore, does one learn from the fact that some hypothesis has been rejected? At this point, we are inevitably drawn into a discussion of how we (relatively) evaluate economic theories, given that all of them very rarely fit any given data set perfectly—once again, because of the existence of noise in the data. Since it follows that all theories do not exactly explain the data (because of the existence of noise in the data) we are naturally led into a discussion of ‘how close’ are theories to explaining the data—some are obviously better than others. This then turns into a discussion of the parsimony of competing theories— once again a discussion initiated and justified by the existence of noise in the data. I cannot hope to resolve all the issues involved in this discussion, but I hope to be able to convince the reader that noise is important and should not be ignored. 2.
Noise in theory and in practice
Let me begin with a discussion of what I perceive economists are trying to do. My view is that we are trying to explain the behaviour of groups of individuals, with the ultimate objective of trying to predict the behaviour of those individuals (or some other group of individuals) in some other context. The ultimate objective is prediction; the route to that objective is through explanation. Economists use theories to help in this process—the idea being that one can use the theory to make predictions in some context other than that in which the theory has been investigated. Economists use data to evaluate and calibrate theories. To fix ideas suppose that the following simple Competition is announced.1 Some population of the world is specified and two sets of decision problems specified. Two random samples of people are selected from this population; the first sample presented with the first set of decision problems in an incentive compatible way, and the second sample presented with the second set of decision problems again in an incentive compatible way. The answers of the first sample are published—let us call this The Data. The answers of the second sample are not published until the end of the Competition. Entrants are not allowed to undertake any further sampling of the population. The objective of the Competition is to predict the answers of the second sample. The winner of the Competition will be that
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
327
person who predicted most accurately the aggregate2 decisions of that second sample.3 In more sophisticated competitions, some demographic data might also be published. Note crucially that the two samples are different (although possibly overlapping) and that the two sets of decision problems are different (although possibly overlapping). How would an economist proceed? The economist would enlist the aid of theory. To fix ideas, let me suppose that the decision problems are all risky pairwise choice problems. Here the economist has a whole raft of theories on which to draw. The economist would try and see which theory, or which combination of theories, best fitted The Data, and would then use the best-fitting theory, or the best-fitting combination, to make the predictions about the second sample. There are several ways that this could be done. I will begin with a discussion of situations in which The Data is analysed individual by individual. Later I will discuss the issues involved when the data is pooled across individuals. It will be helpful to introduce some notation. Let me denote the decision made by individual j from the first sample on decision problem i in the first decision set by xi j (i varies over 1,. . . , I while j varies over 1, . . . , J ).4 In a pairwise choice decision problem x just takes one of two values. The Data is the set xi j (i = 1 . . . I, j = 1 . . . J ) and, of course, the set of questions posed. Let us presume that there is some underlying Data Generating Process (DGP) which determined the set xi j (i = 1 . . . I, j = 1 . . . J ). The DGP is presumably unknown and unobservable. As a consequence we cannot say whether this underlying DGP is stochastic or deterministic. Let me suppose to begin with that the economist is searching for a single theory to use in the Competition. If that were Expected Value maximisation (EV), then the prediction task is simple and the economist would not need to use The Data in order to make the prediction—because no parameters are involved in the EV model. However, the economist might use The Data to see how well EV explained the responses of the first sample. It could be the case that EV explains perfectly the data, in which case there would be no further work to do. If EV did not fit exactly, the economist might search for other theories. Most other theories involve parameters, and this is where the interesting problems begin. Let me write the predictions of theory k on the ith problem in the first decision set by xik (α) where the vector α denotes some parameter vector of the theory. xik (α) could be stochastic, but let me first consider the case when it is deterministic. For most theories of individual decision making, this is the case, but there are exceptions (see Chew et al. (1991), Machina (1995) and Hey and Carbone (1995)) where the underlying theory is stochastic. Suppose that theory k exactly explains the behaviour of individual j and suppose that his or her value of the parameter α is α kj . Then, if the economist is checking to see whether The Data for this individual is explicable by the theory, the economist will search for the value of α that best explains the data. Call this a kj . Presumably the economist will find a kj = α kj . Because I have supposed that this theory exactly explains the behaviour of this individual, then this theory with the correct value of the parameter will fit The Data for this individual exactly and there will be no error. However, this is very rarely the case. Suppose it is not. Then either this theory does not explain The Data for this individual, or the individual’s behaviour is explained by this theory plus some error. By this I mean that the individual’s intended behaviour is described by that theory, but for various reasons the individual implements
328
HEY
his or her intentions with some error. There is evidence from experiments that this story is valid—as we see that when the same individual confronted with the same decision problem on several occasions, he or she gives different responses on different occasions. If all the theories that the economist is considering are deterministic, and none of them fit The Data exactly for any parameter vector, but we want to use one or other of the theories to describe the behaviour of that individual, then we are forced to make this interpretation. So, let us now assume that theory k does describe the intended behaviour of the individual but that the individual implements the theory with some error. I express this by writing xi j = xik α kj + u ikj
(1)
where u ikj represents the error made by the jth individual when responding to question i. In standard econometric terminology, these are the disturbance terms in the equation. Again, the parameter α kj represents the ‘true’ (intended) value of the parameter α for that individual. At this stage, the economist will try and fit the theory to The Data for that subject and will try to estimate the true parameterα kj . Let us denote again the estimated parameter bya kj . The economist will try and find the best-fitting parameter in the equation xi j = xik a kj + eikj
(2)
I either by minimising the sum of squared errors i=1 (eikj )2 or by some other procedure. As I will discuss later, the appropriate procedure depends upon the stochastic structure of the disturbance terms in equation (1). For example, if the errors are not independent of each other, then one should take that into account in the estimation process; or, if the error variance depends upon the nature of the question, so that we have heteroscedasticity, then it would be inefficient if we did not take that into account while we are estimating the parameter vector. The issues here are exactly the same as in any econometric discussion. In equation (2), the eikj are the residuals in the estimated equation. Note the difference between equations (1) and (2). The former is the true DGP while the latter is the estimated DGP. We could call the terms u ikj the noise in the individual’s behaviour. It is manifested in the estimated equation through the residuals. Let me denote the minimised value of by SSEkj . This is a measure of the goodness of fit of the data to the theory. (There are some procedures which lead to different measures of goodness of fit, though I do not want to elaborate on this point here.) This procedure might be repeated for a set of different preference functionals. For each functional (or theory) k, the economist could find the best-fitting parameter vector a kj for individual j and the implied SSEkj . This whole procedure could be repeated for every individual in the first sample. What will almost certainly be found is that the best-fitting functional varies from individual to individual. Moreover the best-fitting parameter vector for any given preference functional will vary from individual to individual. At this point, a decision has to be made as to how the various estimated functionals should be used in the prediction part of the Competition. One possibility is to choose just one functional, and to assume that this functional describes the approximate behaviour of the individuals in the second sample from the population. There is then a problem in knowing the distribution
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
329
of the parameter vector in the second sample. It could be assumed that the whole of the population has the same preference functional with some population distribution of the parameter vector. The distribution of the estimated parameter vector in the first sample could be used as an estimate of this population distribution, and this in turn could be used to make the prediction for the second sample. Alternatively, it could be assumed that different proportions of the population have different preference functionals, with distributions of the relevant parameter vector over the population. The proportions in the population with different preference functionals could be estimated from the proportions in the first sample with the respective best-fitting functionals, and the distributions of the relevant parameter vector estimated from the estimated distributions in each sector of the first sample. This latter story could also be implemented in a slightly different way. Instead of fitting preference functionals individual by individual, one could adopt one or other of the following procedures: (1) it could be assumed that all individuals have the same preference functional with a given distribution over the population, and the parameters of the distribution estimated by pooling together all the first sample; (2) it could be assumed that different proportions of the population had different preference functionals, with corresponding distributions of the relevant parameter vectors over the different proportions, and the proportions and the parameters of the distributions estimated by pooling together all the first sample. Econometricians might prefer one or other of the two methods described in this paragraph, as the estimation might be more efficient (if there were no specification biases). In the experimental literature, we see both types of procedure adopted—some in which the data is analysed subject by subject, and some in which the data is pooled across all subjects. Much of the earlier literature adopted this latter stance, often with different decision problems posed to different groups of subjects. In this case the statistical analyses were different—because often the fitted functional forms were different for the two groups. In essence the two groups were not identical and some story had to be invoked to explain the differences. The usual story was that the parameters of the functionals were different subject by subject, with both groups being random drawings from some greater population. In this case, the natural story was that the noise was differences between subjects, rather than noise within subjects. All of the above discussion has assumed that the theories being used are deterministic. Things are slightly different if the theories are stochastic—as the noise is built in to the theory. But the same considerations apply. Once again, the economist will try and estimate the underlying DGP—this time possibly aided by the theory in terms of specifying the appropriate stochastic nature of the observations. The same considerations apply to the prediction part of the Competition. However, I deliberately do not devote much space to the consideration of stochastic theories, as most of the existing theories are deterministic. Let me now turn to a point that I have effectively ignored throughout the above discussion—the importance of the correct modelling of the noise in behaviour. 3.
How to model the noise
This section argues three things: first, that the noise in experimental data might be between subjects or within subjects, or both; second, that in order to get a sensible interpretation
330
HEY
of experimental data, we should take account the possible existence of both; third, that implicit in any statistical analysis of experimental data is some assumption concerning the stochastic properties of the noise in the experimental data—and that it is better that any implicit assumptions are made explicit (and tested) so that their validity can be exposed to discussion and test. Economists have a long tradition of modelling noise in data, and econometricians have developed sophisticated techniques for taking account of it. In the early—pre-experimental— days, it was usually thought that the noise came from uncontrolled elements in the application of the theory: in the collection of the data, there were variables, not taken into account by the theory, which varied across the sample; or there were errors of measurement. The noise came from these uncontrolled factors. However, with experimental data, where the experimenter tries to control for irrelevant factors, we have to think again about the sources and nature of noise. If we give the same decision task to the same individual on more than one occasion, and the individual gives different responses on different occasions, we are virtually forced to conclude that the noise is within the subject, rather than outside. This implies that we are forced to use different econometric methods to take the noise into account. This difference between noise within and noise between subjects is important and I will turn to that next, before turning to the more general issue of the appropriate modelling of the noise. I shall adopt an historical perspective. Hidden in any statistical analysis is some implicit assumption about the noise in the data—the stochastic structure of the data generating process. It is useful to try and make explicit the hidden implicit assumption. I will give an extended example at the end of this section. Consider the Competition once more. Suppose the set of questions being posed to the two samples are a set of I pairwise choice questions: all of the form: “Do you prefer the lottery on the Left or the lottery on the Right?”. All the theories under consideration come up with predictions about what the answers will be, conditional on some appropriate model-specific parameter values. There are two kinds of noise observed: noise between subjects (people are different); and noise within subjects (people’s behaviour has some random component). How one processes the data, and how one specifies the stochastic component (the noise) depends on the objectives of the experimenter and the nature of the data. Let me consider a number of possibilities, beginning with the chronologically early ones. I start with a very simple example which was typical of the early tests carried out by people like Maurice Allais—a good early example is Allais (1953), while a collection of early results can be found in Allais and Hagen (1979): suppose one group of subjects is asked one pairwise choice question (between L 1 and R1 ), while a second group is asked a different pairwise choice question (between L 2 and R2 ), and suppose the experiment is designed to test Expected Utility (EU) theory against a rather vague unspecified class of non-EU theories. Suppose further that the two pairwise choice questions are designed so that if a person obeys EU then he or she prefers L 1 to R1 if and only if he or she prefers L 2 to R2 . Suppose the experimenter observes that the proportion choosing Left on the first is p1 while the proportion choosing Left on the second is p2 . What the experimenter typically does at this stage, in order to test whether ‘the subjects obey EU or not’ is to test whether p1 and p2 are significantly different, using the usual test of the difference between two proportions. What is the justification for this test procedure? If the subjects posed the two
331
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
questions are different, then the justification is the following (for the null hypothesis that subjects follow EU): a proportion p of the population would choose L 1 on the first question and L 2 on the second (note, under the null hypothesis that all the population are EU, the same proportion would choose left on the first and left on the second). Therefore, if we take a random sample of size n 1 from the population to answer the first question, and a random sample of size n 2 to answer the second question, and these are independent random p) samples, then pi is N ( p, p(1− ) for i = 1, 2, p1 and p2 are independent, and hence p1 – p2 ni p(1− p) p(1− p) is N (0, n 1 + n 2 ). Thus a standard test of the difference between two proportions is valid if the two samples (the two groups of subjects) are independent random samples from the population at large. Implicit in this procedure is the notion that the noise is across subjects—that subjects are different. A proportion p in the population chooses L 1 and the same proportion chooses L 2 : if we choose one subject at random from the population the probability is p that this subject chooses Left. The choice varies randomly across subjects— the noise is across subjects. The test used by these early experimenters was valid. Later experimenters thought that asking just one question to a subject was inefficient, so they followed a different procedure: instead of presenting question 1 to one group of subjects and question 2 to a second, they presented both questions to the same set of subjects. The data set is therefore richer and more can be done with it than before. In fact the data takes the form of the table below.
Number choosing L 1
Number choosing L 2
Number choosing R2
Total
n 11
n 12
n 1∗
Number choosing R1
n 21
n 22
n 2∗
Total
n∗ 1
n∗ 2
n ∗∗
Some experimenters used the full set of data—and I will discuss a particular use of it at the end of this section after I have developed some further notation and technical apparatus. However, some other experimenters effectively threw away some of the data available in the above table and carried out exactly the same statistical test as Allais and his followers had done. To be precise, they compared the proportion choosing L 1 on the first question (n 1∗ /n ∗∗ ) with the proportion choosing L 2 on the second question, and tested if they were significantly different from each other. How was this justified? Was it that the noise was only across subjects? Clearly not, as we have the same set of subjects in each group. If individual subjects are not noisy (that is, there is no error in their decision-making) then we can test EU individual by individual: EU is rejected for any one subject if that subject chose L 1 and R2 (or L 2 and R1 ). But, typically, experimenters did not do this—instead they carried out the same statistical test as they had done when the questions were asked to different sets of subjects. How do we begin to make sense of this? Only if there is noise within subjects.5 Clearly the idea that there is noise within subjects makes empirical sense. As a matter of fact, if we ask the same questions to the same subjects they typically give different answers, even in situations where they ‘should not’. A good example is reported in Hey (2001) in which subjects were given the same set of questions on five different occasions. For almost
332
HEY
all the subjects, the answers given differed on all five occasions—though there were subjects for whom their answers converged: these may be regarded as subjects who were learning through the repetitions (either about the questions or about their preferences). Learning usually, but not always, implies a reduction in noise and there are clearly situations (for example, market experiments) in which some of the noise (that, for example, relating to the market price) effectively disappears after sufficient repetitions. How you treat the data depends on what you are interested in: the steady state (and whether it is affected by the noise in the early stages) or the early stages (replete with noise) themselves. But, whatever it is that you are interested in, you still have to model the noise—if only for the purposes of justifying your econometric methods. To summarise: we could have noise between subjects (people are different) or noise within subjects (people’s behaviour has some random component) or both. Noise between subjects means that the parameters (of whatever preference functional) differ between individuals. One way to model this is to assume some distribution of the parameters over the population and assume that your subject pool is a random sample from that population.6 Alternatively, you could analyse subjects individual by individual. In the literature, we see both types of procedure; which is the ‘appropriate’ one depends to a large extent on the nature of the data and the nature of the assumptions that you are prepared to make on the stochastic structure. I will return to this point later. For simplicity at the moment, let us assume that we have decided to analyse our subjects one by one—on the grounds that they are different (and that there is no econometric reason for pooling all the subjects together.7 ). How do we model noise within a subject? How do we make explicit what has been implicit in the discussion of Section 2? In practice we see three main approaches. We could call these: (1) the constant error probability story; (2) the noisy measurement story; and (3) the random preferences story. Let us take these in turn, using V (G) to denote a particular theory’s evaluation (through the relevant preference function) of a particular gamble/lottery G. V (G) could be deterministic or random; I begin with the first of these. Suppose subjects are presented with I pairwise choice questions (L i , Ri ) for i = 1, . . . , I, on each of which they have to state whether they prefer L i or Ri (we will assume that the experimenter does not allow subjects to state indifference). If the preference functional of a theory being investigated is V (·), and this is deterministic, as I am assuming at the moment, then, if the subject does not make any errors, then he or she states that he or she prefers L i to Ri if V (L i ) > V (Ri ), that he or she prefers Ri to L i if V (Ri ) > V (L i ), and states some preference at random if V (L i ) = V (Ri ). These may be considered the subject’s true preferences. Usually it is the case that no data set (with a sufficiently large number of questions) is exactly consistent with any preference functional for any given individual. One interpretation of this is that the subject makes some error when reporting their true preferences. You may take some other interpretation—but you have to make some interpretation. I make mine because it allows me to take into consideration economic theory, yet at the same time acknowledging the existence of noise. If you do not like this story, then it is necessary to come up with an alternative one in order to justify the econometric techniques that you have chosen to analyse the data. In the meantime, let me discuss the implications of my interpretation.
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
333
How might a subject have true deterministic preferences yet not report them? I would argue that this is because subjects make mistakes. There are two main stories how this might be done. The first and simplest one is just that on any question the subject reports by accident the wrong preferences: L i is actually preferred but Ri is reported as preferred with probability pi . The story is just that the subject reports the preference with some error—there is some moment of inattention, or the subject hits the wrong key by mistake, or the subject misreads the question. Quite clearly we could hypothesise that pi is dependent on i, the question, but it is much simpler to assume that pi is independent of i, that is, it is constant. This is the first of our models—the constant error probability model. It was brought to prominence by Harless and Camerer (1994). It is a bit odd in that it assumes that the probability of making a mistake is constant, independent of the nature of the question. In contrast, we might think that there are certain questions for which the choice is so obvious that the chances of a mistake are very small. The second of our models incorporates this feature. This second model obviously has a slightly different structure. It has recently been used by Hey and Orme (1994) though its use, particularly in psychology, dates back much earlier. Here the story is that the subject evaluates the two gambles with his or her preference functional V (·) but does so with some measurement error. So L i is evaluated as V (L i ) + ε Li and Ri is evaluated as V (Ri ) + ε Ri and the difference between them by V (L i ) − V (Ri ) + εi where εi = ε Li− ε Ri . The analogy to think of here is that of measuring the length of a table for example: if you are given a tape measure and are asked to measure the length of a table, you will inevitably measure it with error—and every time that you measure it, you will get a different length. This is an interesting analogy as it helps to shed light on the accuracy of the measurement—clearly this will be greater, the higher the incentive: if you are given no incentive to measure the table accurately, there will be a lot of error; if instead, you are going to have a dining room alcove made to fit the table neatly, you will probably measure it more accurately. The same we feel is true in economics experiments. The two stories above assume that the subject actually knows his or her preference function. This may well not be true. There is an additional problem with both these stories: they both imply that, faced with a choice between two lotteries, one of which dominates the other, the subject will with positive probability choose the dominated lottery. This seems counter-intuitive—it is also counter the facts (as long as the dominance is ‘obvious’). One error story that gets round this latter problem is our third—the random preferences model. This assumes that the subject does not have a unique preference functional, but a set of them. When faced with a decision problem, the subject draws one preference functional out of that set and uses that to come up with an answer to that decision problem. (I note, in passing, that a subject who follows this procedure will never violate dominance—choose the dominated lottery—as long as all the preference functionals in the set satisfy monotonicity.) We have three possible stories—all of which have some kind of economic justification. The crucial point for this present paper is that the econometric procedures that you would use to analyse the data are critically different story by story. Suppose, for example, that you want to fit a particular preference functional to a subject’s data—to see (1) what parameters of that preference functional best fit that subject’s behaviour; and (2) how well that preference functional explains that subject’s behaviour—then the appropriate econometric procedure is determined by the error story that you choose. This is an important point, but I do not
334
HEY
want to get bogged down in technical details, so I will restrict my discussion to the first two of the three error models discussed above. Let me slightly change the way I denote the preference functional that we are using to V (·|α), (where α denotes the parameter vector) to emphasise the point that virtually all preference functionals embody some parameters.8 Our objective is to find the ‘best estimate’ of α—which presumably means the estimate of α for which the data and the model are closest—and the implied ‘goodness of fit’ of that particular preference functional. At this point we could go on a long econometric detour exploring what we mean by ‘best fitting’. But life is too short and the discussion inconclusive, so I will just take the Maximum Likelihood procedure as being the one that ‘best’ achieves this objective. The point that I am making remains valid whatever ‘best fitting’ procedure we are using: the point is that the results depend upon the error story. The procedure thus is the following: we write down the (log-9 ) likelihood function given our error specification. We then take as our estimate of the parameter vector α the one that maximises the (log-) likelihood, and the resulting maximised (log-) likelihood is our measure of the goodness of fit of that preference functional. Thus, in order to make our point, that the error specification matters, all we need to do is to show that the different error specifications lead to different (log-) likelihoods. This is straightforward, though notationally a bit messy. The decisions of the subject are either L i or Ri on each question i = 1, . . . , I . It simplifies our notation if we use d to indicate the decision, with d = 0 indicating that the subject chose Left and d = 1 indicating that the subject chose Right. Whatever preference functional that we use comes up with some predicted decisions on each question. Let us denote these by c1 (α),c2 (α),. . . ,c I (α) and let us assume that these take the same two values—0 for Left predicted and 1 for Right predicted. Note that they obviously depend upon the parameter vector. It will save notational clutter if we define the error (or the difference between the predictions and the data) by ei (α) = |ci (α)—di |. Note that ei (α) = 0 if the data agrees with the prediction of the theory while ei (α) = 1 if the data does not agree. Let us now construct the likelihood function under the first error story—the constant error probability story. The likelihood is simply given by: L=
i=I
p ei (α) (1 − p)1−ei (α)
(3)
i=1
This is because if ei = 1 then there is a mistake (according to that preference function and for that parameter vector), while if ei = 0 there is not a mistake. The respective probabilities of these are p and 1 − p. At this stage, it is easier to see what is going on if we examine the log-likelihood. This is given by: LL =
i=I
{ei (α) ln( p) + [1 − ei (α)] ln(1 − p)}
(4)
i=1
This has to be maximised with respect to p and α. Elementary calculus shows that the i=I e (α) optimal value of p for any parameter vector is simply p ∗ = i=1I i that is, the
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
335
maximum-likelihood estimate of the probability of making a mistake is simply equal to the proportion of wrong predictions. To save writing in what follows let me denote by E(α) the number of wrong predictions of a particular preference functional combined with a particular parameter vector. Then let us put p = p ∗ = E(α)/I in the equation (3) above. We get that the log-likelihood is: LL = E(α) ln[E(α)/I ] + (I − E(α)) ln[(I − E(α))/I ] from which it is clear that the log-likelihood maximising value of α is that which maximises I − E(α)—that is, which maximises the number of correct predictions. As is well known this is the Maximum Score estimator of α. If, however, we follow the second error story, we have to proceed in a different direction. The specification of the likelihood function depends upon the specification of the error term ε. Usually for practical reasons (connected with the availability of computer software) this is assumed to be normal or logistic. What is crucial is that the distribution function is specified. Let me denote this by F(·). Thus F(E) denotes the probability that ε is less than E. Our procedure is as follows. Given this second error story, the decision maker chooses Left on question i if V (L i |α) − V (Ri |α) + εi > 0 and chooses Right if V (L i |α) − V (Ri |α) + εi < 0. (The subject is indifferent with probability 0 if the variable ε is continuous.) So the probability of the subject choosing Left is the probability that V (L i |α) − V (Ri |α) + εi > 0 which is equal to the probability that εi > V (Ri |α) − V (L i |α) which, in turn, is equal to 1 − F[V (Ri |α) − V (L i |α)]. The probability of the subject choosing Right is, therefore, F[V (Ri |α) − V (L i |α)]. Given data as before denoted by d1 , d2 , . . . , d I the log-likelihood function is given by: LL =
i=I
{di ln F[V (Ri |α) − V (L i |α)] + (1 − di ) ln(1−F[V (Ri |α)
i=1
−V (L i |α)])}
(5)
(Recall that di = 0 indicates that the subject chose Left and di = 1 indicates that the subject chose Right). We note that the preference functional V(·|α) depends upon the parameter vector α and hence the log-likelihood depends upon α. The log-likelihood maximising value of this parameter vector is the one that maximises the log-likelihood in equation (4). Now compare (5) with (4). They are different. It therefore follows that the maximumlikelihood estimates of the parameters are different and the implied ‘goodnesses of fit’ are different. The same is true under the third specification, but we refrain from going into detail. Here the stochastic component is the parameter vector itself and the estimation/fitting procedure is in finding the best-fitting distribution of this parameter vector. This line of enquiry has been followed by Loomes et al. (2002). Of course, one can combine two or more of these three stories into one (as Loomes et al. have been doing) or evolve different stories. The crucial point is that the estimation procedure differs according to the stochastic specification and hence the resulting estimates differ, as does the goodness of fit. As is well known from the econometric literature, if one mis-specifies the stochastic structure of the
336
HEY
data generating process, then the inferences one makes from the data are misleading and may be biased. A number of detailed points could be made at this point, but I will restrict attention to two broad points—one connected with the general specification of the estimation and the other connected with testing the stochastic assumptions made. In the above discussion, we have been quite restrictive in describing what might be done with the data—here we have assumed that the estimation is done subject by subject and that the estimation is concerned with estimating the parameters of particular preference functionals along with the estimation of any parameters describing the stochastic specification. Other possibilities are clearly allowable. Here is a partial list in no particular order. Much depends on how much data you have and what you want to do with it. (a) You could estimate subject by subject and assume particular functional forms in the preference functionals (rather than parameterise them); (b) You could pool all your subjects together and assume that they all have the same stochastic parameters; (c) You could pool all your subjects together and put some assumed structure on the distribution of the parameters across the subjects; (d) You could pool all your subjects together, and, rather than try and estimate particular preference functionals, simply assume that a certain proportion of them are EU, a certain proportion are Rank Dependent EU, and so on, and then estimate there various proportions. Of course, you could assume the same or differing error mechanisms for different subjects. The list is almost endless. The final main point to note in this section is that one cannot just go around assuming any kind of structure on the noise. Though it is true that, in experiments, as in any other area of empirical economics, the economist tends to choose the specification that is easiest in some sense (to model or to compute or to estimate), the economist should make efforts to test the assumed structure. A good example is the paper by Ballinger and Wilcox (1997). Such testing is well established in more conventional areas of econometrics but could be practised more in experimental areas. Note that these tests should extend to other implicit assumptions. In the Allais-type experiments discussed above, there is the assumption that the observations are independent of each other. If the observations come from different subjects, who have no contact with each other, it may well be true that the observations are independent. If, however, the observations come from a series of decisions by one individual, the assumption of independence may be open to doubt. For example, a subject who has just chosen the riskier gamble in a pairwise choice problem may well chose the safer gamble on the next problem. This is something that is rarely tested, but which should be. Similarly the assumption that the residuals are homoscedastic should be tested. Some preliminary work on the existence of heteroscedasticity is reported in Hey (1995) and more sophisticated analysis in Blavatskyy (2005). Further discussion on the appropriate modelling of noise can be found in Loomes and Sugden (1995). Again there is the usual implication: if the assumed stochastic structure is wrong, then the inferences made will be wrong. For some reason, experimental economists seem less
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
337
concerned about this than their non-experimental colleagues. Perhaps there are good reasons why this is so, but these reasons need to be made explicit. Perhaps it would be useful to conclude this section with an extended example of explicit and apparently honest and innocent assumptions which turn out to have hidden implications. Consider Starmer and Sugden (1989), who provide a sophisticated test of various nonExpected Utility theories.10 Their strategy is an extension of that used by Allais (1953) and by Camerer (1989). The description that follows discusses only a subset of their analysis, but it does illustrate well the point I want to make. Let me restrict attention to their analysis of the responses of subjects to two pairwise choice questions. Using my notation, let me write them as (L 1 , R1 ) and (L 2 , R2 ): Starmer and Sugden ask the subjects whether they prefer L 1 or R1 ; and then they ask them whether they prefer L 2 or R2 . They then classify the responses—either L 1 L 2 or L 1 R2 or R1 L 2 or R1 R2 . The questions have been constructed in such a way that, if the individual’s preferences respect Expected Utility theory, then L 1 is preferred to R1 if and only if L 2 is preferred to R2 . The violations of EU are the observations in the categories L 1 R2 and R1 L 2 . They note that there are lots of subjects in these categories but not the same number in L 1 R2 as in R1 L 2 . They want to test whether these are significantly different in some sense. They employ the following null hypothesis, and I quote at length because it is important (page 169, but changing their notation to mine): “The problem is to determine what pattern of deviation from EU is implied by an hypothesis of random error. There is no single, obviously correct answer to this question. There is no received theory of the mental processes by which individuals arrive at choices that are consistent with EU: EU is a theory of what people do but not of how they do it. Because of this, we cannot derive from EU a theory of how people might deviate from it. Our strategy is to use a simple and salient null hypothesis, namely that for any pair of questions, L 1 R2 and R1 L 2 observations are equally likely.” Later on page 169, they repeat: “our null hypothesis of random switching is that L 1 R2 and R1 L 2 observations are equally likely to occur”. This seems extremely plausible, but it contrasts with an equally plausible story. Consider the noisy measurement story. Here L 1 is stated as preferred to R1 if V (L 1 )− V (R1 )+ε1 > 0 and L 2 as preferred to R2 if V (L 2 ) − V (R2 ) + ε2 > 0. So the probability that L 1 is stated as preferred to R1 is P[ε1 > V (R1 ) − V (L 1 )] while the probability that L 2 is stated as preferred to R2 is P[ε2 > V (R2 ) − V (L 2 )]. Now let us use the fact that the gambles were constructed in such a way that EU predicts that L 1 is preferred to R1 if and only if L 2 is preferred to R2 . The relevant questions in Starmer and Sugden’s paper were built using the common ratio effect and it follows, therefore, using the properties of the EU preference functional that V (R2 ) − V (L 2 ) = λ[V (R1 ) − V (L 1 )] where λ is a fixed number, either greater than or less than 1, depending upon how the lotteries were constructed. This means that, without error, V (R2 ) > V (L 2 ) if and only if V (R1 ) > V (L 1 ), that is, L 1 is preferred to R1 if and only if L 2 is preferred to R2 . Taking into account the measurement error, it follows that the probability that L 1 is chosen over R1 is P[ε1 > V (R1 ) − V (L 1 )] while the probability that L 2 is chosen over R2 is P[ε2 > λ{V (R2 ) − V (L 2 )}]. Note that these are not equal if we assume that ε1 and ε2 have the same distribution. Note further that if V (L 1 ) > V (R1 )—that is, if L 1 is really preferred over R1 —then the probability that L 2 will be chosen over R2 will be less (more) than the probability that L 1 is chosen over R1 if λ is smaller (greater) than 1. Let me give an example to show the implications. Suppose
338
HEY
that the probability that L 1 is chosen over R1 is 0.85 and the probability that L 2 is chosen over R2 is 0.55. Suppose further that ε1 and ε2 are independent. Then we have the following respective probabilities for the four possible outcomes: L 1 L 2 : 0.4675
L 1 R2 : 0.3825
R1 L 2 : 0.0825
R1 R2 : 0.0675
It follows therefore that it is not true that L 1 R2 and R1 L 2 are equally likely. Indeed, in this example, which is by no means implausible, L 1 R2 is almost 5 times more likely than R1 L 2 . This suggests either that Starmer and Sugden’s innocent looking assumption is in fact not so, or that the measurement error story is less innocent that it appears. Whatever is the case, it cautions us to be careful about the stochastic assumptions about our data.
4.
What should we do with our data, given that there is noise in it?
If we were Bayesians, the answer to the above question would be simple: at any stage we have some beliefs over the set of all possibilities and we update these beliefs in the light of any new evidence. So, in the context of decision making under risk, we have probabilities attached to each possible preference functional, and probability distributions over the parameter vector associated with each preference functional. Every time we get new evidence, we update these distributions in the light of the data. In practice we do not do this, partly because we are not all Bayesians, partly because of the computational difficulties in calculating the posterior distributions, partly because it is difficult in getting a consensus on the prior distributions, and partly because we are not equally happy about accepting the experimental data of others.11 We could circumvent some of these problems by simply reporting the likelihood implied by any new experimental data and leave it to others to update whatever priors they had. This would be simpler. But note what the objective of this is—to take into account the implications of new data. If we are not Bayesians, then we are forced to adopt alternative procedures. Suppose that we are Classical statisticians—as is implicit in virtually all publications in economics and particularly those in experimental economics. Classical statistics has two main inferential procedures—estimation and testing. If we look at the literature, we see that both these procedures are used. I would like to argue that we should largely adopt the first of these. This for two reasons: first, that with a sufficient number of observations one can always reject any null hypothesis (as long as there is some noise in the data); second, that this hypothesis testing procedure, certainly as practiced in the literature, almost inevitably fragments the data available to the economist—so the economist ends up by not looking at the whole picture painted by the data. I expand on each of these points below. I continue to make my comments in relation to the literature on decision making under risk. Consider the early literature testing Expected Utility theory, which essentially took a pair of Pairwise Choice questions (L 1 , R1 ) and (L 2 , R2 ), constructed in such a way that if a person obeyed EU then he or she either preferred L 1 to R1 and L 2 to R2 or R1 to L 1 and R2 to L 2 (or was indifferent between L 1 and R1 and between L 2 and R2 ). Violations
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
339
of EU were reported if an individual preferred L 1 to R1 and R2 to L 2 or R1 to L 1 and L 2 to R2 , or, taking noise into account, if the proportion of people choosing L 1 over R1 was significantly different from the proportion choosing L 2 over R2 . I am sceptical about this procedure because, as long as there is some departure from EU however small, then, given enough data, the test will reject the null hypothesis that behaviour is consistent with EU. This may be true, but it may not be interesting. Whether it is or not seems to depend more on the economic significance of departures from EU. Let me demonstrate the point I am trying to make. Suppose a proportion p1 in the population would choose L 1 in the first question and a proportion p2 in the population would choose L 2 in the second. Denote by p1 the sample proportion choosing L 1 and by p2 the sample the proportion choosing L 2 . Then to test the null hypothesis that p1 = p2 one calculates the test statistic TS =
p1 − p2 p1 (1− p1 ) n
+
p2 (1− p2 ) n
and checks whether it is greater than some appropriate critical value—for example, 1.96 at a 5% level of significance. Suppose that the actual values in the population is p1 − p2 = p. Then it is trivially true that sooner or later the observed value of TS is bound to be bigger than the critical value as long as p is not√ zero and n is sufficiently large. Take, for example p1 − p2 = δ, then T S = √ δ n , which can clearly be made larger p1 (1− p1 )+ p2 (1− p2 )
than any critical value by choosing n large enough—as long as, of course, δ is not zero and neither p1 nor p2 are either 0 or 1. So, as long as the two population proportions are not precisely equal, then sooner or later you will reject the hypotheses that the two proportions are equal. Hence, unless EU is exactly true (on average) then sooner or later you will reject the hypothesis that it is true. When and if you do simply reflects on how many observations you have and how much noise there is in the data. What seems to be more important to me is not whether EU is exactly true (which we know it is not) but how far away from the truth it is. Is the departure big in economics terms? Is the departure from EU of economic significance? The statistical significance seems to me to be of a secondary order of importance. The second, and more serious, point—that the hypothesis testing framework encourages the fragmentation of the data—can best be illustrated by example. Let me turn to the paper by Starmer (1992). This is an excellent example of a type of statistical inference that is widely practiced by economists. In the experiment on which the analysis is based, subjects were asked 20 pairwise choice questions, but only 10 of these are relevant to the analysis carried out by Starmer. His central analysis is contained in Table 3 of this paper, in which he reports tests based on 13 comparisons of the answers to one pairwise choice question with the answers to second. So, for example, the first row of this table compares the answers to question 1 with the answers to question 2, the second row compares the answers to question 2 with the answers to question 3, and so on. So we have the answers to 10 questions and 13 different comparisons—the tests of which are all carried out completely independently of each other. Of course, this is not a valid procedure as the comparisons are not independent of each other. One is losing information by not taking all the data together. Formally, if
340
HEY
one is interested in the set of comparisons listed in Table 3 one needs to formulate an appropriate test of all 13 comparisons jointly. Clearly this is a difficult procedure—and that it the reason why Starmer (and many other economist testing hypotheses) avoid it. Until such test procedures are developed the use of hypothesis testing as one’s inferential procedure encourages this fragmentation. I would argue that, having done an experiment, one should report the implication of the experiment as a whole, not a series of fragmented, and possibly opposing, implications obtained from certain subsets of the data. If one takes the Starmer data, and fits models to it all, then one can provide an overall picture of the implications of the experiment as a whole. This takes us back to what we were doing before—fitting models to data, finding the best fitting parameters and seeing how well the model fits the data. If you like, you can do formal econometric/statistical tests of whether one model fits significantly better than another. If one of the models is nested inside the other this is particularly easy—you just do some kind of chi-square test of the differences between the maximised log-likelihoods. Nonnested models can also be compared though the procedure is less straightforward. I should note, however, that these are statistical tests and suffer from the same qualifications that I discussed in the paragraph above. The crucial point is the appropriate trade-off between descriptive power of a model and its prescriptive parsimony: other things being equal we prefer a theory that predicts better; other things being equal we prefer a theory that is more parsimonious (has fewer ‘parameters’ in its implementation). The latter is why, a priori, we prefer EU to, say, Rank-Dependent EU because EU has fewer parameters, and is more powerful when it comes to prediction—though we recognise that Rank-Dependent better fits the data. There is no obvious way to trade-off descriptive power and prescriptive parsimony, though Selten has come up with one way (Selten, 1991). This is not the time or the place to enter into a detailed discussion of these issues—though we note that, whatever the appropriate trade-off, the specification of the stochastic structure of the data lies at its very heart. In essence, the appropriate procedure depends upon the task in hand—it depends on what you are going to use the analysis for. One possibility is that one wants to win the Competition described in Section 2. Although this is very specific, it seems to me to capture what the economist is trying to do, and takes us back to the ultimate objective of economics. I appreciate that not all share my view—and this is manifested in the types of experiments that are conducted. Some are intended to simply explore what might exist out there in the real world. Indeed, this was a view expressed by a referee on an earlier version of this paper, and amplified by the Editors of this special issue. They argued that “. . . there is a defensible role for a type of testing which is immune to that [my] criticism because it is directed towards identifying new regularities (not rejecting theories). . . ”. I am all in favour of using experiments to identify interesting new regularities, though I appreciate that many journals do not like experiments which are hypothesis-generating experiments and that they prefer hypothesis-testing ones. But this hides an inconsistency on the part of journals—one cannot use the same experiment (the same data set) both to generate new hypotheses and to test them. There should be a place for reporting theory-generating experiments. But this is a separate point from that of the appropriate statistical procedure (if any) for analysing the data from such experiments. In a sense the issues discussed in the paragraph above are peripheral to the central theme of this paper, but it may be useful to comment a little more about what is going on. Let
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
341
me eliminate some potential problems by assuming that the typical subject pool used in economic experiments is ‘representative’ of some larger population, so that the results from the experiment can be used in predicting the behaviour of this larger population. There is a more serious problem as to the ‘representativeness’ of the problems posed to subjects in experiments. Suppose, however, that they are representative of some larger set of problems encountered by economic agents in the real world. Then, for example, the conclusion that Expected Utility theory is rejected in favour of some more general theory is interesting, and suggests that using EU as your prediction tool could lead to biases in the prediction. However, one needs to know how serious are these biases. Moreover, in order to apply some more general theory for prediction purposes one requires more information. There is a cost to acquiring this extra information and it is not clear that there is an overall gain to the economist in collecting it. After all, the value of prediction depends (positively) upon the precision of the predictions and also (negatively) on the cost of making those predictions. I have assumed above that both the set of subjects are in some sense representative of some population and that the set of questions is in some sense representative. The importance of the first assumption can, of course, be investigated by using different samples and collecting demographic data and exploring its effect on the decisions of the subjects. The second assumption is more tricky. Let me give an example. As is now well-documented, departures from EU seem to depend upon the question. As Munier and his associations have shown— see, for example, Abdellaoui and Munier (1998), departures seem more likely around the edges of the Marschak-Machina Triangle than in the centre. How should we take this into account? I suppose the answer goes back to how we want to use our results: if we are going to predict choice behaviour in the middle of the Triangle, then perhaps we can use EU; if, however, we are interested in behaviour around the edges, then EU may not be particularly reliable. The trouble is that it is difficult to define what we mean by a set of ‘representative questions’. Nevertheless, it does force us back to thinking seriously about the uses to which we are going to put our research. It may also stop us from looking for a single ‘best’ theory. A further important point that is emerging from recent research is that earlier results might need re-interpretation when the existence of noise is taken into account. It is already well-known that some part of the common ratio effect, for example, can be explained by the existence of noise. This suggests that ignoring the noise leads to an incorrect, or at least a biased, interpretation of the data. Even apparently robust departures from rational behaviour—such as the preference reversal phenomenon—seem to be partly explicable by the existence of noise in behaviour. As Schmidt and Hey (2004) have shown, subjects show different amounts of noise in preference tasks than they do in pricing tasks. Schmidt and Hey show that taking this into account leads to a sizable reduction in the preference reversal phenomenon. There is also a working paper by Blavatskyy (2005) which applies the same ideas. I shall mention his work in Section 5. I do not want to suggest that these papers are representative of all recent research, but they make the point that I think is important: taking into account the existence of noise in subjects’ behaviour changes our interpretation of the experimental results.
342 5.
HEY
A brief historical perspective
It may be useful, before concluding, to give an overview of the way that the literature has been developing insofar as the formal consideration of noise is concerned. As I have already noted, the early pioneering work was carried out by Allais (1953). His work was largely concerned with the testing of non-EU theories against EU, and the question of the nature of the stochastic foundations of the data was of secondary importance. From that beginning, there was an accelerating growth in experimental work, in the consequent development of new theories and in their empirical investigation using experiments. Particularly influential in this work was Camerer (1989 and 1992): the titles of these two papers should be noted: “A Experimental Test of Several Generalized Utility Theories” and “Tests of Generalizations of EU Theories”, emphasising the point that the methodology up to then was very largely that of testing—using often the fragmentary approach discussed above. The nature of the stochastic process generating the data was still not particularly a subject for discussion, but the statistical tests were becoming more sophisticated (if not better justified). Camerer (1995) also provides an influential survey of developments up to that point, and a more up-to-date survey can be found in Starmer (2000). In 1994, there was a shift of emphasis in the literature, from testing to estimation, in two path-breaking articles published in the same issue in Econometrica: Harless and Camerer (1994) and Hey and Orme (1994). The first of these was unusual and innovative in that it combined data from 23 other experiments in order to try and answer the question “what is the best predictor of choice under risk”. It provided a methodology for combining the results from different data sets, rather than fragmenting them. Hey and Orme (1994) were in pursuit of a similar objective, but using a quite different methodology—trying to find the ‘best’ preference functional individual by individual from one large data set. Both of these articles shifted the focus of empirical exploration from testing to estimation. Since then, others have taken up the challenge, particularly Carbone in conjunction with Hey (Carbone (1997 and 1998), Carbone and Hey (1994 and 1995), Camerer and Ho (1994), Wu and Gonzales (1996 and 1998), Gonzales and Wu (1999) and the Loomes, Moffatt and Sugden team. Different authors use different noise specifications and estimate in different ways: the Carbone and Hey team estimate subject by subject and use the measurement error story. Camerer and Ho and, following them, Wu and Gonzales, assume what Wu and Gonzalez term a “single-agent stochastic choice model”—where all agents have the same preferences but utility is random. This is similar in approach to the random preferences model but with the restriction that all agents have the same preferences. Wu and Gonzales (1999) estimate subject by subject and assume a measurement error story to account for the noise. Neilson and Stowe (2002) explore some of the implications of the estimated functions. Loomes et al. (2002) use all three specifications that I discussed in Section 3, and include, for good measure, a trembling hand, in order to get a grip on one of the key issues emerging from this literature: the correct specification of the noise term. Other papers which tackle the same issues are Carbone and Hey (2000) and Loomes and Sugden (1998). Loomes et al. (2002) conclude that “In relation to the set of problems presented in this experiment, our subjects’ decision making behaviour can be modelled as converging towards a random preference model of stochastic choice with expected utility as the core theory.” This is the third of
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
343
my stories discussed in Section 3 above. A more recent paper, that by Blavatskyy (2005) which “. . . proposes a new model that explains the violations of expected utility theory through the role of random errors” comes up with somewhat different conclusions. After analysing the data from 10 previous studies he concludes that “. . . [my] model fits the data from ten well-known experimental studies at least as good as cumulative prospect theory.” Interestingly he refines previous analysis by taking into account the heteroscedastic nature of the data, which is one of the points raised in Hey (1995). In a sense we have come full circle: by adding in appropriately specified noise, Expected Utility theory seems to emerge as no worse than any of the new theories of decision making under risk. We could argue that the very existence of these new theories might be attributed to an incorrect attention to, and specification of, the noise in experimental data. A rather milder conclusion is voiced by Harrison and List (2004) in their survey of field experiments: “Tests of expected utility theory have provided a dramatic illustration of the importance of thought experiments being explicitly linked to stochastic assumptions involved in their use. Several studies offer a rich array of different error specifications leading to very different inferences about the validity of expected utility theory, and particularly about what part of it appears to be broken.” 6.
Conclusions
The key points are that noise exists and that the correct modelling of it is crucial. Just as in any other area of economics, the correct specification of the stochastic structure underlying the data is critical—we cannot just worry about the correct modelling of the deterministic component of behaviour, we must also get right the modelling of the stochastic component. Perhaps experimental economists have an advantage over their non-experimental colleagues in that the stochastic component—the noise in behaviour—is a product of the behaviour of the subjects. Perhaps experimental economists are better placed to model how and why there is noise in the behaviour of subjects. More importantly, experimental economists can use their experimental tools to investigate the various hypotheses concerning the noise in experiments. We already know a certain amount: for example, we know that in certain types of experiments (certain types of decision task) the noise declines through time; in other types the noise persists. We also know that the variance of noise is sensitive to the incentives provided to the subjects in experiments: the greater the incentives, the less the noise. We also know that in certain decision problems noise is almost non-existent—for example, when in a pairwise choice one of the pair dominates the other. Nevertheless, we are still a long way from fully understanding the nature of noise in experiments, and, more generally, the nature of noise in economic decision-making more generally. Notes 1. Obviously variations on this theme can be proposed. I deliberately keep the example simple. 2. I presume that economists are more interested in predicting the behaviour of groups of individuals rather than individuals.
344
HEY
3. I realise that this is a rather stylised account of what economists do—but it captures the essence sufficiently for the purposes of the discussion of this paper. 4. It could be the case that I depends on j or more generally different members of the sample are given different sets of decision problems, but to incorporate this into the notation would create too much notational clutter. 5. I should note that this by itself is not sufficient to make sense of the procedure—but it is a necessary condition. To fully justify the test procedure we would need to add further structure—as we discuss later. 6. Though this could be a rather flimsy justification if you were just using a ‘standard subject pool’ of students at your own university—which is hardly a random sample from the population at large. 7. Though there may be such a justification. I will return to this point later. 8. The only one that I can think of that does not is the Expected Value Maximisation model. 9. We can use either the likelihood or the log-likelihood, since they are related in an increasing monotone way. In practice, it is often computationally simpler to use the log-likelihood. 10. I should emphasise that there are many authors who make similar hidden assumptions. I have chosen this particular example simply because Chris Starmer very kindly pointed me towards it. 11. This statement, if explored, could lead to the opening of a large can of worms.
References Abdellaoui, M. and Munier, B. (1998). “The Risk-Structure Dependence Effect: Experimenting with an Eye to Decision Making.” Annals of Operational Research. 80, 237–252. Allais, M. (1953). “Le Comportamente de l’Homme Rationale devante le Risque: Critiques des Postulats et Axioms de l’Ecole Americaine.” Econometrica. 21, 503–546. Allais, M. and Hagen, O. (1979). Expected Utility Hypotheses and the Allais Paradox, Dordrecht: Reidel. Ballinger, T.P. and Wilcox, N.T. (1997). “Decisions, Error and Heterogeneity.” Economic Journal. 107, 1090– 1105. Blavatskyy, P. (2005). “A Stochastic Expected Utility Theory.” working paper Institute for Empirical Research in Economics, University of Zurich. Camerer, C. (1989). “An Experimental Test of Several Generalized Utility Theories.” Journal of Risk and Uncertainty. 2, 61–104. Camerer, C. (1992). “Recent Tests of Generalizations of EU Theories.” in Edwards, W. (ed). Utility: Theories, Measurement and Applications, Kluwer. Camerer, C. (1995). “Individual decision making.” In J Kagel and A Roth (eds)., The Handbook of Experimental Economics, Princeton, Princeton University Press, 587–703. Camerer, C. and Ho, T. (1994). “Violations of the Betweenness Axiom and Nonlinearity in Probability.” Journal of Risk and Uncertainty. 8, 167–196. Carbone, E. (1997). “Investigation of Stochastic Preference Theory Using Experimental Data.” Economic Letters. 57, 305–11. Carbone, E. and Hey, J.D. (1994). “Estimation of Expected Utility and Non-Expected Utility Preference Functionals Using Complete Ranking Data.” in Munier B and Machina M J (eds). Models and Experiments on Risk and Rationality, Kluwer Academic Publishers, 119–139. Carbone, E. and Hey, J.D. (1995). “A Comparison of the Estimates of EU and non-EU Preference Functionals Using Data from Pairwise Choice and Complete Ranking Experiments.” Geneva Papers on Risk and Insurance Theory. 20, 111–133. Carbone, E. and Hey, J.D. (2000). “Which Error Story is Best?.” Journal of Risk and Uncertainty. 20, 161–176. Chew, C.S., Epstein, L.G., and Segal, U. (1991). “Mixture Symmetry and Quadratic Utility.” Econometrica. 59, 139–164. Gonzalez, R. and Wu, G. (1999). “On the Shape of the Probability Weighting Function.” Cognitive Psychology. 38, 129–166. Harless, D.W. and Camerer, C.F. (1994). “The Predictive Utility of Generalized Expected Utility Theories.” Econometrica. 62, 1251–1290. Harrison, G.W. and List, J.A. (2004). “Field Experiments.” Journal of Economic Literature. 42, 1009– 1055.
WHY WE SHOULD NOT BE SILENT ABOUT NOISE
345
Hey, J.D. (1995). “Experimental Investigations of Errors in Decision Making Under Risk.” European Economic Review. 39, 641–648. Hey, J.D. (2001). “Does Repetition Improve Consistency?.” Experimental Economics. 4, 5–54. Hey, J.D. and Carbone, E. (1995). “Stochastic Choice with Deterministic Preferences: An Experimental Investigation.” Economics Letters. 47, 161–167. Hey, J.D. and Orme, C.D. (1994). “Investigating Generalisations of Expected utility Theory Using Experimental Data.” Econometrica. 62, 1291–1326. Loomes, G. and Sugden, R. (1995). “Incorporating a Stochastic Element into Decision Theories.” European Economic Review. 39, 641–648. Loomes, G. and Sugden, R. (1998). “Testing Different Stochastic Specifications of Risky Choice.” Economica. 65, 581–98. Loomes, G., Moffatt, P.G., and Sugden, R. (2002). “A microeconometric test of alternative stochastic theories of risky choice.” Journal of Risk and Uncertainty. 24, 103–130. Machina, M.J. (1985). “Stochastic Choice Functions Generated from Deterministic Preferences over Lotteries.” Economic Journal. 95, 575–594. Neilson, W. and Stowe, J. “A further examination of cumulative prospect theory parameterizations.” Journal of Risk and Uncertainty. 24, 31–46. Schmidt, U. and Hey, J.D. (2004). “Are Preference Reversals Errors?” Journal of Risk and Uncertainty. 29, 207–218. Selten, R. (1991). “Properties of a Measure of Predictive Success.” Mathematical Social Sciences. 21, 153–167. Starmer, C. (1992). “Testing New Theories of Choice Under Uncertainty Using the Common Consequence Effect.” Review of Economic Studies. 59, 813–830. Starmer, C. (2000). “Developments in Non-Expected Utility Theory: The Hunt for a Descriptive Theory of Choice Under Risk.” Journal of Economic Literature. 38, 332–382. Starmer, C. and Sugden, R. (1989). “Probability and Juxtaposition Effects: An Experimental Investigation of the Common Ratio Effect.” Journal of Risk and Uncertainty. 2, 159–178. Wu, G. and Gonzalez, R. (1996). “Curvature of the Probability Weighting Function.” Management Science. 42, 1676–90. Wu, G. and Gonzalez, R. (1998). “Common Consequence Conditions in Decision Making under Risk.” Journal of Risk and Uncertainty. 16, 115–139.