Synthese DOI 10.1007/s11229-014-0640-x
Measuring the overall incoherence of credence functions Julia Staffel
Received: 3 July 2014 / Accepted: 17 December 2014 © Springer Science+Business Media Dordrecht 2015
Abstract Many philosophers hold that the probability axioms constitute norms of rationality governing degrees of belief. This view, known as subjective Bayesianism, has been widely criticized for being too idealized. It is claimed that the norms on degrees of belief postulated by subjective Bayesianism cannot be followed by human agents, and hence have no normative force for beings like us. This problem is especially pressing since the standard framework of subjective Bayesianism only allows us to distinguish between two kinds of credence functions—coherent ones that obey the probability axioms perfectly, and incoherent ones that don’t. An attractive response to this problem is to extend the framework of subjective Bayesianism in such a way that we can measure differences between incoherent credence functions. This lets us explain how the Bayesian ideals can be approximated by humans. I argue that we should look for a measure that captures what I call the ‘overall degree of incoherence’ of a credence function. I then examine various incoherence measures that have been proposed in the literature, and evaluate whether they are suitable for measuring overall incoherence. The competitors are a qualitative measure that relies on finding coherent subsets of incoherent credence functions, a class of quantitative measures that measure incoherence in terms of normalized Dutch book loss, and a class of distance measures that determine the distance to the closest coherent credence function. I argue that one particular Dutch book measure and a corresponding distance measure are particularly well suited for capturing the overall degree of incoherence of a credence function. Keywords Bayesianism · Credence · Incoherence · Coherence · Measure · Probability
J. Staffel (B) Department of Philosophy, Washington University in St. Louis, One Brookings Drive, St. Louis, MO 63130-4899, USA e-mail:
[email protected]
123
Synthese
1 Introduction Many philosophers hold that the probability axioms constitute norms of rationality governing degrees of belief. This view is widely known as subjective Bayesianism. While this view is the foundation of a broad research program, it is also widely criticized for being too idealized. It is claimed that the norms on degrees of belief postulated by subjective Bayesianism cannot be followed by human agents, and hence that these norms have no normative force for beings like us. This problem is especially pressing since the standard framework of subjective Bayesianism only allows us to distinguish between two kinds of credence functions—coherent ones that obey the probability axioms perfectly, and incoherent ones that don’t. An attractive response to this problem is to extend the framework of subjective Bayesianism in such a way that we can capture differences between incoherent credence functions. Being able to measure to what degree a credence function is incoherent helps us model the degrees of belief of non-ideal agents. Further, a framework that allows for degrees of incoherence enables us to explain how the ideal rules of Bayesianism can be approximated by non-ideal agents, and hence to explain how such rules can be normative for non-ideal agents. And such a framework will also give us an important tool for evaluating reasoning processes of non-ideal agents. In this paper, I will first explain in more detail why we need to be able to measure degrees of incoherence, and what such a measure is supposed to capture. I will propose that we should look for a measure that captures what I call the ‘overall degree of incoherence’ of a credence function. I will then move on to examine various different incoherence measures that have been proposed in the literature, and evaluate whether they are suitable for measuring overall incoherence. The competitors are a qualitative measure that relies on finding coherent subsets of incoherent credence functions, a class of quantitative measures that measure incoherence in terms of normalized Dutch book loss, and a class of distance measures that determine the distance to the closest coherent credence function. I argue that one particular Dutch book measure and a corresponding distance measure are particularly well suited for capturing the overall degree of incoherence of a credence function.
2 Why we need to measure degrees of incoherence On the standard subjective Bayesian view, any rational credence function, which is an assignment of degrees of belief to propositions, must obey the axioms of probability. To set up the probability calculus, we begin with a set of atomic statements {Ai }, and we combine it with the standard sentential logical operators to define a language L. We also assume that the relation of logical entailment is defined in the classical way. A probability function P on L must satisfy the following axioms: Normalization For any statement A, if A is a tautology, P (A) = 1 Non-Negativity For any statement A, P (A) ≥ 0 Finite Additivity For any two mutually exclusive statements A, B, P(A ∨ B) = P (A) + P (B)
123
Synthese
An agent’s unconditional degrees of belief, or credences, are considered probabilistically coherent if and only if they obey these probability axioms. Moreover, it is standardly assumed that agents must also have coherent conditional credences, which means that their credences must obey the following ratio formula:1 Conditional Probability For any A and B in L, if P(B) > 0, then P(A|B) = P(A&B)/P(B) While the standard Bayesian system doesn’t let us distinguish any further between various incoherent credence functions, it is easy to see that, intuitively, some incoherent credence functions are more incoherent than others. Suppose, for example, that there are two agents, Sally and Polly, whose credence functions are the same, except for their respective credences in some tautology T. According to the probability axioms, the correct credence to assign to T is 1. Sally’s credence in T is 0.99, whereas Polly’s credence in T is 0.2. Intuitively, Polly’s credence in T displays a greater failure of coherence than Sally’s. Yet, we cannot capture this difference in the standard Bayesian framework. Being able to extend the Bayesian framework so as to capture these differences is a desirable goal in itself, but there are at least three further reasons to think that a measure of probabilistic incoherence would be desirable. The first reason is that having such a measure would help us respond to one of the standard objections to Bayesianism, which claims that Bayesianism has no application to real agents (e.g. Hacking 1967; Harman 1986; Christensen 2004). This is because the Bayesian norms are so demanding that no actual agent could be expected to have a credence function that complies with all of them. Human beings, having limited cognitive capacities, are not in a position to make complicated calculations easily, and they don’t have an immediate grasp of complex logical relations and properties, which is arguably needed in order to keep one’s credences coherent at all times. As Zynda (1996) has argued, it is hard to see how an ideal norm can apply to non-ideal agents if there is no meaningful sense in which these agents can come closer to complying with the norm. We can capture this worry in a general principle: Norm Governance For any agent A and end E, if A could never achieve E, then E can serve as a norm for A just in case (i) we can grade how closely A approximates E, and (ii) relative to this gradation, it is within A’s power to approximate E to varying degrees. If we apply Norm Governance to the standard Bayesian framework, the problem is immediately obvious: the standard Bayesian view doesn’t allow us to measure degrees of incoherence. But if we can’t measure degrees of incoherence, then Norm Governance tells us that probabilistic coherence is not an end that can serve as a norm for human agents. However, if we can measure probabilistic incoherence, then we can
1 In the examples I consider in this paper, I will mostly be concerned with unconditional credences. However, the measure I end up favoring can measure the incoherence of conditional as well as unconditional credences, so nothing here depends on focusing on unconditional credences for most of the paper. For a more detailed overview of the requirements on rational credences, see Weisberg (2011). For an excellent discussion of the status of the rules of probability as requirements of rationality, see Ch. 1–4 of Titelbaum (2013).
123
Synthese
explain how agents can approximate perfect coherence, and hence we can explain how the ideal Bayesian norms apply to non-ideal agents. A similar idea is expressed by Earman in a brief remark that predates Zynda’s detailed discussion. Addressing the concern that Bayesian norms are too hard to comply with for humans, he says: “The response that Bayesian norms should be regarded as goals towards which we should strive even if we always fall short is idle puffery unless it is specified how we can take steps to bring us closer to the goals.” (Earman 1992, p. 56) Of course, a measure of incoherence doesn’t yet provide a self-help guide for how to improve one’s epistemic state, but at least it spells out what it means to get closer to the epistemic ideal. A second reason why it would be helpful to be able to measure probabilistic incoherence is that we could thereby evaluate the methods by which non-ideal agents revise their degrees of belief. Normal agents, who never comply with all the ideal Bayesian norms, clearly revise their degrees of belief routinely. One dimension along which we can evaluate reasoning strategies is with respect to their effect on the agent’s degree of incoherence. Surely, even if an agent has credences that are already incoherent, it would be epistemically bad (all other things being equal) for her to reason in ways that make her even more incoherent. Yet, if all we can say is that non-ideal agents always move from one incoherent state to another, we can’t provide any useful assessment along these lines. By contrast, if we can measure how incoherent an agent’s credences are before and after she employs a certain reasoning strategy, we have a tool by which reasoning strategies can be assessed. A third reason why it is valuable to have a measure of degrees of incoherence is that it can help us evaluate the accuracy of people’s credences. In a recent study, Wang et al. (2011) collected probabilistic forecasts about the outcome of the 2008 U.S. presidential elections from 15940 subjects, and generated predictions of the election results by aggregating the individual judgments. This task was complicated by the fact that most of the forecasts weren’t probabilistically coherent, and by the challenge to identify which judges’ estimates were the most credible. The authors showed that measures of incoherence could be used to help solve both of these problems: first, they demonstrated that the forecasters whose judgments were the least incoherent also turned out to be the most accurate (where accuracy is measured by the Brier score;2 as distance from the true outcomes of the election), and that this method of finding the most trustworthy judges was better than relying on people’s self-reported 2 The Brier score is a proper scoring rule that can be used to measure the accuracy, or epistemic utility, of
an agent’s credence, or credence function, at a given world. It corresponds to the squared distance measure, and it is essentially a way of measuring how far away the agent’s credences are from the truth (or some other privileged credence function)—the larger the difference between an agent’s credence in some proposition p and the truth value of p at that world, the higher the inaccuracy, or epistemic disutility, of that credence. More formally: suppose c is a credence function defined over a set of propositions F, and the function I indicates the truth values of the propositions in F at world w by mapping them onto {0,1}. Then the following function gives us the Brier score of c at w: Brier (c, w) =
A∈F
123
(c(A) − I w(A))2
Synthese
confidence in their judgment. Furthermore, the authors used an incoherence measure to find the coherent forecast of the election result that was closest to the aggregate of the incoherent forecasts. This forecast turned out to be remarkably accurate, as it only predicted one state incorrectly on the day before the election.3
3 What are we trying to measure? The fact that there are intuitive differences in incoherence between credence functions is not a very precise starting point for laying out conditions that a suitable incoherence measure should meet. There could be different incoherence measures, each of which fulfills a useful function. Hence, we must become clearer about what out target is. A useful analogy to think about is how to measure the wealth of a country. One common way of doing so is by determining the size of its gross domestic product, which is essentially a measure of the size of a country’s economy. Another common way is to measure the gross domestic product based on purchasing power per capita, which relativizes the gross domestic product of a country to the size of its population, and takes into account the local cost of living.4 In principle, there are also other, nonstandard, measures we could adopt. For example, we could measure a country’s wealth according to the average income of the richest 1 % of the population, or the poorest 1 %. It is easy to see why these nonstandard measures aren’t as good as the commonly used ones. What we are trying to get at is a way of measuring the overall wealth of a country. That means a good measure must either take into account, roughly speaking, the wealth of every inhabitant, or it must somehow extrapolate from the wealth of a group that is deemed representative. But the nonstandard measures I mentioned rely on extrapolating from groups that aren’t representative of the entire country. For example, a very poor country with a small rich elite could mistakenly be classified as wealthy. Hence, even if our common sense notion of a country’s wealth isn’t precise enough to pick out a specific measure, it is still useful for ruling out measures that clearly don’t capture how wealthy a country is overall. How does this shed light on measuring incoherence? I submit that we should look for a measure of the overall incoherence of a credence function. Such a measure will track global differences in coherence between credence functions in a way that is sensitive to changes in coherence across all of the agent’s credences. And just like in the case of measuring wealth, we should reject measures that determine overall incoherence based on a non-representative sample of the agent’s credences. If an incoherence measure exhibits this behavior, I will say that it has a swamping problem. The swamping problem obtains when two credence functions appear to differ in how incoherent they are, but the measure in question assigns them the same degree of incoherence, because it relies on an unrepresentative sample of the agent’s credences, and relevant differences among the two functions are thereby swamped.
3 Thanks to an anonymous referee for pointing me to this result. 4 Source: http://www.worldbank.org/depweb/beyond/global/chapter2.html, accessed on November 26,
2013.
123
Synthese
These considerations, of course, may still not be precise enough to single out one measure as the best measure of overall incoherence, but I am confident that they will be very helpful in ruling out measures that clearly don’t capture our target. We will also have to rely on judgments about differences in incoherence in particular cases. If our goal is to give a precise account of overall incoherence that is an explication of our common sense understanding of degrees of incoherence, our judgments about particular cases are needed to delineate our target. One might worry that we often lack a firm grasp of differences in incoherence in particular cases. But in order to delineate our target, it will be sufficient to consider examples in which our judgments about differences in incoherence are clear. A measure of overall incoherence is particularly well suited for the applications I mentioned in the previous section: capturing intuitive differences between incoherent agents, explaining how non-ideal agents can approximate ideal norms, tracking changes in incoherence that result from reasoning, and identifying accurate forecasters. In the following sections, I will evaluate different incoherence measures that have been proposed in the literature, and examine whether they are measures of overall incoherence that are suitable for these applications. 4 Zynda’s measure of incoherence The claim that we need a measure of degrees of incoherence in order to explain how the ideal Bayesian norms can apply to real agents is what motivates Zynda (1996) in developing his measure. His discussion suggests that he is, like me, interested in measuring the overall incoherence of credence functions. The basic idea behind Zynda’s measure is that we can compare incoherent credence functions by comparing their maximal coherent restrictions. That means we compare incoherent credence functions by comparing the largest subsets of those functions that aren’t incoherent. He defines a credence function as a set of ordered pairs of propositions and their assigned credences, where the propositions must form a Boolean algebra.5 A maximal coherent restriction of an incoherent credence function can be generated by removing the smallest possible number of proposition/credence pairs from the agent’s credence function, such that the remaining credences can be extended to a coherent credence function over a Boolean algebra. Often, there will be more than one way of doing so, in which case all possible ways of creating a maximal coherent restriction must be considered. In comparing two different credence functions f and g, one can arrive at one of four different outcomes: (1) f and g have the same maximal coherent restrictions, (2) for all maximal coherent restrictions of f and g, each of the maximal coherent restrictions of f is a proper subset of one the maximal coherent restrictions of g, (3) for all maximal coherent restrictions of f and g, each of the maximal coherent restrictions of g is a proper subset of one of the maximal coherent restriction of f , or (4) none of the above. In the first case, f and g are equally coherent; in the second 5 A set of propositions has the structure of a Boolean algebra just in case it contains every logically distinct proposition that can be expressed by combining the atomic propositions in the set with the standard logical connectives.
123
Synthese
case, g is more coherent than f ; in the third case, f is more coherent than g; and in the fourth case, f and g are incommensurable. We can use this method to order credence functions that are commensurable with respect to how incoherent they are. I will now argue that Zynda’s measure is problematic, for three reasons: (i) it orders incoherent credence functions in ways that are intuitively incorrect, (ii) credence functions that are intuitively comparable are incommensurable in his framework, and (iii) Zynda’s measure is ill-suited to evaluate reasoning processes of incoherent agents. I will first show that the measure doesn’t capture intuitive differences between incoherent credences, and hence produces counterintuitive orderings of incoherent credence functions. For example, consider two credence functions f and g, which are both defined over the same set of propositions {p, ∼p, ⊥, T}. They assign credences as follows: f (p) = 0.5 f (∼p) = 0.5 f (⊥) = 0 f (T) = 0.5
g (p) = 0.5 g (∼p) = 0.5 g(⊥) = 0 g (T) = 0.99
The two functions have the same maximally coherent restrictions, namely the following set of proposition/credence pairs {< p, 0.5 >,< ∼p, 0.5 >, < ⊥, 0 >}. This means that according to Zynda’s measure, they are equally incoherent. However, this is not an intuitive result at all. Given that f assigns a degree of belief of 0.5 to the tautology, whereas g assigns it a degree of belief of 0.99, and that is their only difference, it seems much more natural to think that f is more incoherent than g. Hence, Zynda’s measure fails to capture the intuitive difference in incoherence between two credence functions. And since capturing intuitive differences between incoherent credence functions is one important goal of a measure of incoherence, this is a significant problem for this measure.6 The second problem with Zynda’s measure concerns cases in which two credence functions can intuitively be compared, but are incommensurable according to Zynda’s measure. Suppose again that we are comparing credence functions defined over the set of propositions {p, ∼p, ⊥, T}. We want to compare two credence functions f and g : 6 Zynda is in fact aware of this kind of result of his measure, and he comments on it in a footnote of his paper.
Consider, for example, a person whose degree of belief function f is thoroughly incoherent but is everywhere numerically close to a probability function. […] Intuitively, there is a sense in which such a person’s state of opinion is very ‘close’ to being coherent, but it would come out very badly on my account, since very little of it is actually coherent. […] This is a distinct sense of comparative coherence from the one offered above; in my view, both senses are interesting and worth developing in greater detail. (Zynda 1996, p. 215) Interestingly, Zynda acknowledges that his measure does not capture the very intuitive idea that the degree of incoherence of a credence function depends on numerical closeness to a probability function. Yet, he claims that there is a different graded notion of incoherence, which only depends on which parts of a credence function are actually coherent. I don’t think that this notion is what we are aiming for when we try to find a measure of probabilistic incoherence. If we want to know how much an agent diverges from being perfectly coherent, it seems natural and important to take numerical differences between agents’ credences into account. Zynda’s measure may still be of technical interest, but I think it fails to capture our most natural and interesting judgments about degrees of incoherence.
123
Synthese
f (p) = 0.5 f (∼p) = 0.51 f (⊥) = 0 f (T) = 1
g (p) = 0.9 g (∼p) = 0.9 g (⊥) = 0 g (T) = 1
Notice that the sum of the credences in p and ∼p is 1.01 for f , whereas it is 1.8 in the case of g . Since rationality requires that any agent’s credence in two propositions p and ∼p sum to 1, it seems intuitively obvious that g displays a greater failure of coherence than f . However, Zynda’s measure doesn’t let us compare the two credence functions. For each credence function, we can create two maximally coherent restrictions, either by removing the credence in p, or by removing the credence in ∼p. Doing so reveals that they neither have the same maximally coherent restrictions, nor do their maximally coherent restrictions stand in a subset relation, which means that they are incommensurable. This is an undesirable result, since it seems intuitively unproblematic to compare the two functions. The last problem concerns a serious limitation of Zynda’s framework. He must assume that every credence function is defined on a full Boolean algebra of propositions, i.e. contains every logically distinct proposition that can be formed from some set of atomic propositions and the standard logical operators. However, this is an undesirable idealizing assumption, because real agents most likely have “gaps” in their credence functions, which are propositions they have never entertained, and thus don’t have a credence in. Filling in gaps in one’s credence function is actually one important form of reasoning agents can engage in: an agent may consider some proposition p that she had not previously entertained, and wonder what credence to assign to it based on the credences she already has. Yet, the Boolean algebra requirement prevents us from evaluating whether an agent has successfully executed this kind of augmentative reasoning. Since it involves adding a credence to one’s existing credence function, it follows that either the agent’s initial credences, or her resulting credences, or both, cannot be defined over a Boolean algebra. Of course, the Boolean algebra requirement only presents a problem for Zynda’s measure if it cannot be relaxed. And it unfortunately turns out that Zynda needs this assumption, because otherwise his measure gives clearly incorrect results, as we can see from the following example: Suppose an agent reasons in the following way, augmenting her existing credences by adding a credence in ∼ (p ∨ ∼p). f (p) = 0.5 ⇒ f (∼p) = 0.5 f (p ∨ ∼p) = 1
f (p) = 0.5 f (∼p) = 0.5 f (p ∨ ∼p) = 1 f (∼ (p ∨ ∼p)) = 0.1
In evaluating this instance of augmentative reasoning, it is immediately obvious that the agent’s new credence in ∼(p ∨ ∼p) makes her credences incoherent, even though her credences were coherent initially. Hence, an adequate measure of incoherence should tell us that the agent increased her degree of incoherence by reasoning in a way that made her incoherent.
123
Synthese
With Zynda’s measure, however, we cannot evaluate how incoherent the initial credence function f is, because it is not a full Boolean algebra. It only becomes a Boolean algebra when ∼(p ∨ ∼p) is added. If Zynda allowed his measure to apply to f before and after ∼(p ∨ ∼p) is added, both credence functions would turn out to be equally incoherent, because they have the same maximal coherent restriction. However, f is initially coherent, but with f (∼ (p ∨ ∼p)) = 0.1 added, it is incoherent, so this cannot be true. It is precisely to avoid this sort of result that Zynda’s measure requires probability functions to be defined over complete Boolean algebras of propositions. In sum, we saw that Zynda’s measure has three major problems. The first problem is that his measure does not take into account numerical differences between incoherent agent’s credences, which leads to counterintuitive ways of ordering various credence functions according to their degree of incoherence. The second problem is that the measure renders credence functions incommensurable that can intuitively be easily compared. The third problem stems from the indispensable requirement that a credence function must be defined over a Boolean algebra of propositions, which makes the measure unsuitable for evaluating very common ways of reasoning. These problems are important to keep in mind, because they give us guidance for finding a better measure of incoherence. In the next sections, I will review how Dutch book arguments work, and I will discuss a class of Dutch book measures of incoherence that can overcome the problems that befall Zynda’s approach. 5 Incoherence and Dutch books Dutch book arguments are one of the standard ways of arguing that a rational agent’s credences must obey the probability axioms. They show that an agent whose credence function violates the probability axioms is vulnerable to a guaranteed betting loss from a set of bets that are individually sanctioned as fair by her credences. By contrast, a coherent agent faces no such guaranteed loss. But having degrees of belief that sanction as fair each bet in a set of bets that, by the laws of logic, jointly guarantee a monetary loss is rationally defective. Therefore, since only probabilistic credences avoid sanctioning as fair individual bets that lead to a sure loss when combined, only probabilistic credences avoid being rationally defective. As Christensen (2004) has emphasized, the reason why Dutch books indicate rational defectiveness is not that the agent is actually in danger of being impoverished by a clever bookie. Being cheated out of one’s money is a practical problem, not an epistemic one. Vulnerability to Dutch book loss indicates an epistemic defect, because the evaluation of the bets in an unfair betting situation as fair derives its justification directly from the agent’s credences. Each of the bets in the Dutch book is fair in light of the agent’s credences, yet the logically guaranteed outcome of the combination of these bets ensures an unfair advantage for the bookie. Yet, the credences of a rational agent should not justify regarding as fair each bet in a set of bets that logically guarantees an unfair advantage for the bookie. Hence, since having incoherent credences puts the agent in such a situation, incoherent credences are rationally defective.7 7 For a good overview of the ongoing debate about Dutch book arguments, see (Hájek 2008).
123
Synthese
Dutch book arguments rest on the basic assumption that there is a connection between an agent’s degrees of belief in a proposition and the cost the agent is willing to incur for a bet on that proposition. As Christensen (2004) argues, this connection is normative, in the sense that one’s credence in a proposition justifies, or sanctions as fair, paying a specific cost for a bet on that proposition. If one’s degree of belief in some proposition p is x, then one should consider it fair to pay a cost whose utility is xY in order to get a reward whose utility is Y if p is true, and nothing if p is false. In this scenario, we will say that an agent who takes part in this sort of transaction is buying a bet on p. Likewise, one should consider it fair to be on the other end of a gamble of this kind, so that one receives a payment whose utility is xY, and one must pay out a reward whose utility is Y just in case p is true. In this case, we will say that an agent who takes part in this sort of transaction is selling a bet on p. Hence, the agent’s credence marks a point that determines a fair price for both buying and selling a bet. Of course, an agent should also consider it fair to buy the same bet at a lower price, or sell it for a higher price, but the indifference point marked by the agent’s credence is special, because it is the highest buying price, and the lowest selling price justified by the agent’s credence. And since, in a Dutch book, we are trying to make the agent lose as much as possible from a set of bets, each of which she considers fair, we must take advantage of the highest buying prices and lowest selling prices that are justified by her credences. It is common practice to represent these gambles as actual monetary gambles, even though it is obviously an idealizing assumption that utility can be represented linearly in terms of dollar amounts. For ease and familiarity of exposition, I will from now on frame my arguments in terms of monetary gambles. Since Dutch books establish a direct connection between credences and betting prices, it is a natural idea that we can use them to measure incoherence. An agent who has coherent credences represents the world in a unified way, and her credences can thereby guide her actions in ways that avoid being self-undermining. The more incoherent an agent’s credences are, the more disunified is her picture of the world, and the more of a risk she runs of acting in ways that are self-undermining. And since gambling is just one way of acting, we may conjecture that more incoherence will lead to higher guaranteed Dutch book losses. We can further underpin this idea by looking at a simple example. Consider Penny and Jenny, and their credence in the tautology ∼(p & ∼p). Penny’s credence in it is 0.2, but Jenny’s is 0.99. Neither of them is assigning the correct credence of 1, but intuitively, Jenny’s credence is more coherent than Penny’s. This is reflected in the Dutch book loss to which they are vulnerable. If we make each of them sell us a ticket that says “Pay $1 to the owner of this ticket if ∼(p & ∼p)”, Penny’s credence justifies a transaction that leads to a sure loss of $0.80 for her, whereas Jenny’s credence justifies a transaction that leads to a sure loss of $0.01 for her. Measuring incoherence in terms of the Dutch book loss to which an incoherent agent is vulnerable also seems like a good strategy to avoid the problems with Zynda’s measure. In setting up Dutch books, numerical differences in agents’ credences lead to different guaranteed losses, as we saw in the examples just discussed. Hence, a Dutch book based measure can avoid Zynda’s measure’s problem of being insensitive to such numerical differences. Moreover, since incoherence would be measured in terms of
123
Synthese
guaranteed monetary loss, a Dutch book measure would not face the incommensurability problem that besets Zynda’s measure. Lastly, an agent’s credences don’t have to be defined over a Boolean algebra of propositions in order to make her vulnerable to a Dutch book if her credences are incoherent. Hence, a Dutch book measure of incoherence will avoid the Boolean algebra requirement that created problems for Zynda’s measure. These features seem like promising advantages of a Dutch book measure. Yet, there is an important problem that any such measure must find a solution to, the so-called normalization problem.8 It is easy to see how the problem arises: the standard way in which Dutch book arguments are formulated makes no prescriptions about the sizes of the bets involves in a Dutch book. We are told which buying or selling price would be justified for a given bet by the agent’s credence, but nothing constrains the amount of the payout. For example, if Sally and Polly both have a credence of 0.6 in some tautology T, I could make Sally sell a $1 bet on T for $0.60, thereby making her lose $0.40, and I could make Polly sell a $10 bet on T for $6, thereby making her lose $4. In this scenario, Polly would lose ten times as much as Sally. But of course, this difference in monetary loss does not reflect any difference in incoherence in this case. The only difference between Sally and Polly is that I chose to make them sell bets of different sizes. Similarly, if the number of times agents can be made to bet on the same proposition isn’t restricted, we can achieve differences in guaranteed loss without a difference in incoherence. Hence, without any way of normalizing Dutch book loss, we can’t use Dutch books to measure incoherence, because there simply won’t be a way of determining which Dutch book indicates how incoherent a credence function is. In order to be able to formulate any Dutch book measure of incoherence, we need to build some kind of normalization of the bet sizes into the measure in order to rule out scenarios like the one just described. Otherwise, the same incoherent credences can lead to wildly different Dutch book losses, which makes these losses useless for measuring degrees of incoherence. In effect, any Dutch book measure can just be viewed as a particular way of solving the normalization problem. In what follows, I will discuss several different approaches to solving this problem. 6 Schervish, Seidenfeld, and Kadane’s Dutch book measures of incoherence 6.1 Introducing a class of measures In a series of articles, Schervish et al. (2000, 2002a, 2002b—henceforth SSK) explore how we can exploit Dutch book losses to measure degrees of incoherence. They discuss a variety of measures that differ in how they solve the normalization problem. While they point out interesting differences in the formal features of the measures, they don’t favor any particular measure over the others, and they don’t discuss whether the measures explicate any particular pretheoretic notion(s) of incoherence. Yet, two of their measures stand out as serious candidates for being measures of overall inco8 The name for this problem is due to Schervish, Seidenfeld, and Kadane.
123
Synthese
herence. I will argue that one of them suffers from the ‘swamping problem’, but the other one is a good candidate for measuring overall incoherence. SSK distinguish two different aspects of solving the normalization problem: measuring the size of a single bet, and measuring the size of a collection of bets. I’ll begin by discussing their proposals for measuring the size of a single bet, and then I’ll explain how these options can be combined with ways of measuring the size of a collection of bets. The agent is the person who has incoherent credences, and the bookie is the person who is setting up the Dutch book.9 They make the following three proposals for measuring the size of a single bet: (i) The agent’s escrow: the size of the bet is measured in terms of the amount of money that the agent needs in order to cover her part of the bet. For example, if the agent is willing to pay $0.40 for a bet that pays out $1, then the agent’s escrow is $0.40, since this is the most the agent can lose from this bet. In other words, the agent’s escrow is the agent’s highest potential net loss from the bet. (ii) The bookie’s escrow: the size of the bet is measured in terms of the amount of money that the bookie needs in order to cover her part of the bet. For example, if the agent is willing to pay $0.40 to the bookie for a bet that pays out $1, then the bookie’s escrow is $0.60, since this is the most the bookie can lose from this bet. In other words, the bookie’s escrow is the bookie’s highest potential net loss from the bet. (iii) The neutral normalization: The neutral normalization is the sum of the bookie’s and the agent’s escrow, which is also sometimes called the stakes of the bet. In the betting scenario mentioned in (i) and (ii), the neutral normalization would be $1. Each of these single-bet-size measures can then be used to normalize a collection of multiple bets. SSK focus specifically on two ways of normalizing a collection of bets, which they call the ‘sum’ and the ‘max’ normalization. To employ the sumnormalization for a collection of bets, the guaranteed loss from this collection of bets is divided by the sum of the single-bet-normalizations. For example, if we use the bookie’s escrow as our single-bet normalization, we have to divide the guaranteed loss from a collection of bets by the sum of the bookie’s escrows for each of the bets in the collection. If instead the ‘max’ normalization is used in combination with the bookie’s escrow for a collection of bets, the guaranteed loss from the collection is divided by the largest bookie’s escrow from any of the bets involved in the collection.10 Hence, we end up with six ways of measuring incoherence, by combining each single-bet normalization with each of the two normalizations for collections of bets. The idea behind all of the proposals is similar: while we don’t initially prescribe how large the bets are supposed to be that are used to measure incoherence, we normalize the resulting guaranteed loss by measuring the sizes of the individual bets involved, and dividing the overall guaranteed loss from a set of bets by a quantity that is determined 9 SSK use the opposite terminology, i.e. the bookie is the incoherent person and the agent sets up the Dutch book. However, I’ve found it to be more common that the person is setting up the Dutch book is called the bookie, so I am diverging from SSK’s terminology here. 10 As SSK point out, there can also be intermediate normalizations that lie in between the ‘sum’ and the ‘max’ proposals. I won’t discuss them separately here, since the two extreme proposals are most interesting for our purposes.
123
Synthese
by the sizes of the individual bets involved in the gamble. In order to measure how incoherent an agent’s (finite) credence function is, we choose one of the six combinations of normalizations, and, using each of the agent’s credences for at most one bet, we choose a combination of bets based on her credences that maximizes her degree of incoherence according to our chosen normalization.11 I will first explain why that the neutral normalization is better than the other two single-bet normalizations, and then go on to discuss the relative merits of the neutral/sum and neutral/max incoherence measures. The problem with both the agent’s escrow and the bookie’s escrow is that they don’t work well when we consider bets on tautologies and contradictions. Suppose Sally has a single incoherent credence in a tautology, Cr(T) = 0.5, and suppose we choose either the bookie’s escrow/max, or the bookie’s escrow/sum normalization (if there’s only one bet, they are the same). To determine Sally’s degree of incoherence, we’d have to maximize the following quantity: Sally’s guaranteed loss from buying a bet on T, divided by the bookie’s escrow for the bet on T. But the problem is that the bookie can’t lose this bet, hence her escrow is 0, and we can’t get a well-defined value for Sally’s degree of incoherence. Alternatively, suppose we instead choose the agent’s escrow/max, or the agent’s escrow/sum normalization. Then, to determine Sally’s degree of incoherence, we’d have to maximize the following quantity: Sally’s guaranteed loss from buying a bet on T, divided by the Sally’s escrow for the bet on T. But this quantity is obviously going to be 1, no matter what incoherent credence Sally has in T. Hence, we can’t track differences in Sally’s degree of incoherence with respect to her credence in T in this way. Fortunately, the neutral normalization for individual bets is not vulnerable to either one of these problems. SSK recognize this, and as a result, they express some preference for measures based on the neutral normalization, even though they keep working with the other measures as well (SSK 2002b). 6.2 The neutral/sum measure I will thus focus on the remaining two measures, neutral/sum and neutral/max, and investigate whether either of them is suitable to determine the overall incoherence of an agent’s credence function. The neutral/sum measure can be described informally in the following way: the degree of incoherence of a credence function is determined by looking at the worst Dutch book that can be made against someone with that credence function (for a more formal version of the arguments in this section, please see the appendix). The worst Dutch book can be found by determining which set of bets makes the agent lose the most money relative to the sum of the stakes of all the bets involved in this Dutch book. In order to find this set of bets, the bookie may include bets on or against as many propositions in the agent’s credence function as necessary, as long as each proposition is used for no more than one bet. In this sort of arrangement, bets of any size are permissible, but since the guaranteed loss from these bets has to be 11 I choose an informal presentation of SSK’s measures here to make my discussion more reader-friendly. Readers who are interested in the formal details of SSK’s proposals are encouraged to consult the appendices, as well as their presentations of the material. SSK’s proposals can be applied to probability intervals as well; I am focusing on precise probabilities to simplify the discussion.
123
Synthese
divided by the sum of the total stakes of the bets involved in order to determine the degree of incoherence, the resulting degree of incoherence is normalized. We can illustrate this with a simple example. Suppose, for example, that I have the following credence function: f (p) = 0.6 f (∼p) = 0.5 f (T) = 0.9 We’ll assume for simplicity that all bets we can use for Dutch books have $1 stakes. This is not required by SSK’s measure, but in this case, we can safely make this simplifying assumption, because it does not change the result. Now, if these are my credences, then my degree of incoherence is determined by the Dutch book that creates the highest guaranteed loss relative to the sum of the stakes of the involved bets. There are three prima facie plausible candidates for being the worst Dutch book: (I) Buy one $1 bet on each of p, ∼p, costing $0.60 and $0.50, respectively. Result: guaranteed loss of $0.10, hence the loss ratio is 0.1/2 = 1/20 (II) Sell one $1 bet on T for $0.90. Result: guaranteed loss of $0.10, hence the loss ratio is 0.1/1 = 1/10 (III) Make all of the bets in (I) and (II) at the same time. Result: guaranteed loss of $0.20, hence the loss ratio is 0.2/3 = 1/15 As we can easily see, the Dutch book that results in the highest loss ratio is (II), and so, according to the neutral/sum measure, this is the Dutch book that actually determines f ’s degree of incoherence, which is 1/10. The interesting result to notice is that in order to find the worst Dutch book that can be made against an agent, it is not necessarily optimal to include bets on or against all of the propositions in which the agent has incoherent credences. The Dutch book that does this, which is (III), in fact leads to a lower loss ratio than Dutch book (II), which includes only one credence. And this is exactly the aspect of SSK’s measure that makes it problematic, because it makes the measure unsuitable to determine the overall incoherence of a credence function. We can see in this example that the neutral/sum measure suffers from what I call a swamping problem. Since only the worst incoherent credences in the agent’s credence function, which lead to the highest loss ratio, determine the agent’s degree of incoherence, these credences wind up swamping any other incoherent credences the agent might have, which then never get reflected in the degree of incoherence. As I will show now, the swamping problem is the reason why this measure gives us unintuitive results if we try to use it to compare credence functions with respect to their total incoherence. First, suppose there is an agent whose credences are defined over the propositions in the following set: {p, ∼p, q, ∼q}. The agent can adopt one of two credence functions f and g:
123
Synthese
f f f f
(p) = 0.6 (∼p) = 0.6 (q) = 0.5, (∼q) = 0.5
g (p) = 0.6 g (∼p) = 0.6 g (q) = 0.6 g (∼q) = 0.6
Intuitively, f is less incoherent than g, because f contains incoherent credences only for the partition {p, ∼p}, whereas g additionally contains incoherent credences for the partition {q, ∼q}. However, since the neutral/sum measure only looks at the Dutch book that delivers the worst loss ratio, it doesn’t deliver this intuitive result. Instead, it judges f and g to be equally incoherent, because in both cases, the worst Dutch book that can be made leads to a loss ratio of 0.1. This is because both credence functions allow us to generate a guaranteed loss of $0.20 from two $1 bets, and there is no betting arrangement that leads to a worse loss ratio. This result is clearly an instance of the swamping problem, because the credences that don’t contribute to the worst Dutch book get swamped in the neutral/sum measure, even though, intuitively, they should be taken into account. The fact that f is coherent on the partition {q, ∼q}, but g isn’t, makes no difference to the measure, whereas it clearly makes a difference to our intuitive judgments about the total incoherence of the two credence functions. Hence, if two credence functions have the same loss ratio from their worst Dutch book, then the remaining credences cannot influence the degree of incoherence. And, as the example shows very clearly, the measure therefore does not agree with intuitive judgments about the overall incoherence of different credence functions. In fact, because of the swamping problem, SSK’s measure can even give us the opposite of what is intuitively the correct ranking of credence functions in order of their total incoherence. Assume again that there is an agent whose credences are defined over the propositions in the following set: {p, ∼ p, q, ∼q}. The agent can adopt one of two credence functions g and h: g (p) = 0.6 g (∼p) = 0.6 g (q) = 0.6 g (∼q) = 0.6
h (p) = 0.600001 h (∼p) = 0.6 h (q) = 0.5 h (∼q) = 0.5
Intuitively, g is more incoherent than h, because g is much more incoherent than h with respect to the partition {q, ∼q}, and g is only minutely less incoherent than h with respect to the partition {p, ∼p}. However, because the worst Dutch book that can be made against h leads to a loss ratio of 0.200001/2, which is very slightly higher than the loss ratio of 0.2/2 from the worst Dutch book that can be made against g, the measure predicts that h is more incoherent than g. Again, the credences that don’t contribute to the worst Dutch book get swamped, which means that the differences in coherence on the partition {q, ∼q} have no effect on the measure. And, as a result, the measure’s verdicts are directly opposed to those of intuition. The fact that the neutral/sum measure faces a swamping problem is important, because it makes it unsuitable for one of the main applications of a measure of incoherence, the evaluation of reasoning processes. Recall that one way in which we might
123
Synthese
evaluate the goodness of a reasoning process is by checking how it affects the degree to which the agent’s credences are incoherent. The swamping problem has the effect that this measure is not sensitive to relevant differences in an agent’s credences, as long as the reasoning process does not change the worst Dutch book that can be made against the agent. Hence, the neutral/sum measure is not well suited for evaluating the reasoning processes of incoherent agents, and it fails as a measure of overall incoherence. 6.3 The neutral/max measure The neutral/sum measure turns out to be a bad measure of the total incoherence of a credence function for the same reason that the average income of the poorest 1 % is a bad measure of the wealth of a country: in both cases, we try to extrapolate from the worst case, but the worst case is not guaranteed to be representative of the overall wealth of a country, or the overall incoherence of a credence function. An incoherence measure that avoids the swamping problem must avoid using a non-representative sample of the agent’s credences to determine her overall degree of incoherence. It turns out that the remaining one of SSK’s measures—the neutral/max measure—does far better in this respect that the neutral/sum measure. The neutral/max measure requires that we measure the degree of incoherence of an agent’s credence function by maximizing the following quantity: the guaranteed loss that can be generated from a collection of bets of that involve some or all of the agent’s credences, divided by the largest stakes of any bet included in the collection. So, for example, if, given that we’re maximizing this quantity, the bet with the largest stakes in the collection is a $2 bet, and all the other bets have stakes that are $2 or smaller, then the denominator of the equation would be set to 2, and the degree of incoherence of the agent is the loss generated by the bets included in the Dutch book, divided by 2. In fact, we can simply require that the stakes of the bets involved must be in the interval [−1, 1], which means that we can measure a credence function’s degree of incoherence by determining the largest guaranteed loss we can achieve by using each of the agent’s credences for at most one bet, where each bet’s stakes are in the interval [−1, 1]. The neutral/max measure is superior to the neutral/sum measure as a measure of overall incoherence, because it doesn’t single out the worst Dutch book that can be made against an agent in order to determine how incoherent her credences are. Rather, it seeks the optimal way of exploiting as many of the agent’s credences as possible to create Dutch book losses, and sums the losses from all of them to determine a credence function’s degree of incoherence. In what follows, I will again simplify the calculations to make the arguments reader-friendly, please consult the appendix and SSK’s work for the full formalism. Let’s revisit the examples from the previous section. The first example exhibited the swamping problem, because the neutral/sum measure was only sensitive to the incoherence in the credence in T. f (p) = 0.6 f (∼p) = 0.5 f (T) = 0.9
123
Synthese
The neutral/max measure, by contrast, gets the correct result. Recall that we must normalize the Dutch book loss by the stakes of the largest bet. Suppose that the stakes of the bet on T are $1, so the agent will sell it for $0.90, hence losing $0.10. We can easily confirm that the agent’s degree of incoherence is maximized when the stakes of the bets on p and ∼p are also $1. We make the agent buy the bets on p and ∼p for a total of $1.10, guaranteeing a loss of $0.10. Hence, the agent’s degree of incoherence is $0.10 + $0.10, divided by the stakes of the largest bet (=$1), which is 0.2. The neutral/max measure thus registers all the ways in which this credence function is incoherent. The second example involved a comparison between two credence functions f and g, which the neutral/sum measure incorrectly judged to be equally incoherent: f f f f
(p) = 0.6 (∼p) = 0.6 (q) = 0.5, (∼q) = 0.5
g (p) = 0.6 g (∼p) = 0.6 g (q) = 0.6 g (∼q) = 0.6
The neutral/max measure, by contrast, correctly judges f to be less incoherent than g. As in the previous example, the degree of incoherence is maximized if all the bets are assigned the same stakes, for example $1. If the agent’s credence function is f , she must buy the bets on p and ∼p, leading to a guaranteed loss of $0.20 (divided by 1), which is her degree of incoherence according to the neutral/max measure. If the agent’s credence function is g, she must buy a bet on each proposition in her credence function, leading to a total guaranteed loss of $0.40, which is her degree of incoherence according to the neutral/max measure. Hence, the measure ranks the two functions correctly. The reader is invited to verify that the neutral/max measure also gives the correct result for the last example in the discussion of the neutral/sum measure. It is easy to see that the neutral/max measure is a better measure of overall incoherence than all of the other measures we’ve considered, especially the neutral/sum measure. Recall that the problem with the neutral/sum measure was that it determined the degree of incoherence of a credence function solely based on the subset of the agent’s credences that is most incoherent, which is similar to determining the wealth of a country based on the wealth of the poorest 1 % of its population. By contrast, the neutral/max measure sums together the losses from as many bets as possible, while trying to maximize the agent’s guaranteed loss (and normalizing the result by dividing by the stakes of the largest bet). The neutral/max measure obviously avoids the problems with Zynda’s measure as well. Since it tracks numerical differences between credence functions, it can be applied to credence functions that aren’t complete Boolean algebras, and it doesn’t render many credence functions incommensurable. In the next section, I will examine one more class of measures of incoherence, which rely on determining distance from coherence. 7 Distance measures In a psychology article called “Coherent Probability From Incoherent Judgment”, Osherson, Lane, Hartley and Batsell (henceforth OLHB) discuss a study in which
123
Synthese
they elicit judgments about the chances of meteorological events (Osherson et al. 2001). They show that while none of the subjects give completely coherent chance estimates, their estimates closely approximate coherent responses. In outlining their methodology, OLHB explain that they measure how closely a subject’s probability estimate approaches a coherent estimate by measuring the absolute distance between the incoherent estimate and the closest coherent estimate. In their words:12 We measure the distance between two assessments of chance by their absolute difference, because this is the simplest and most interpretable measure. Other potential measures include the squared difference, or some version of relative entropy (which is not, however, a true distance; Cover & Thomas, 1991). None of the results reported below are substantially affected by the use of these alternative measures. We thus have the following optimization problem […]: Let P map formulas ϕ1 . . . ϕk , and pairs of formulas (χ1 , ψ1 ) . . . χj , ψj , into numbers. Find a map P* with the same domain as P, such that P* is coherent, and P(φi ) − P∗ (φi ) + P(χi |ψi ) − P∗ (χi |ψi ) i≤k
i≤ j
is minimized. (Osherson et al. 2001, p. 5) As presented, the measure applies to chance estimates, and it measures incoherence of both unconditional and conditional probabilities. We can easily interpret P as a credence function, rather than a chance estimate, to compare it with the measures previously discussed. It is a desirable feature of the measure that it applies to conditional credences as well as unconditional credences, which it shares with SSK’s measures. The same measure is also used in two later articles on how to aggregate incoherent probabilistic forecasts, both of which have Osherson as a co-author (Osherson and Vardi 2006; Wang et al. 2011).13 The article by Wang et al. is particularly interesting, because it uses this distance measure of incoherence to demonstrate that in their study of election forecasts, the subjects whose forecasts are the least incoherent also turn out to be the most accurate predictors of the actual election results, where accuracy is measured with the Brier score. Also, in both the weather forecasting and the election studies, the aggregated forecasts that are adjusted to be coherent are more accurate than the original incoherent forecasts. I already mentioned this result in Sect. 2 as an interesting application of an incoherence measure in the literature. OLHB single out the absolute distance as being simplest, but there are of course many different distance measures that can be combined with their general proposal. Hence, there is a whole class of incoherence measures that can be generated in this way. However, the absolute distance measure is particularly interesting, because, as long as we are only considering unconditional probabilities, the measure is equivalent to SSK’s 12 The notation has been slightly altered from the original, which in no way affects the meaning. 13 The paper by Osherson and Vardi (2006) contains a more sophisticated, theoretical discussion of using
distance measures to measure incoherence. I’d like to thank an anonymous referee for making me aware of it.
123
Synthese
neutral/max measure. As de De Bona and Finger (2014) prove, the following two ways of measuring the incoherence of someone’s unconditional credences are equivalent: (i) determining the absolute distance to the closest coherent credence function, and (ii) determining the greatest guaranteed Dutch book loss that can be achieved by using each of the agent’s credences for at most one bet, where the stakes of each bet are within the interval [−1, 1]. De Bona & Finger also discuss the relation between Dutch book measures and distance measures when conditional credences are taken into account, and they consider some of SSK’s alternative Dutch book measures discussed earlier. Both distance and Dutch book measures can also be generalized to measure the incoherence of probability inequalities and intervals, as both SSK and de Glauber and Finger point out. I invite the reader to consult their papers for more details.14 Given the just mentioned equivalence, the absolute distance measure obviously gives the same results as the neutral/max measure in the examples discussed in the previous section, hence, we can be sure that the absolute distance measure avoids the swamping problem. But it is also easy to see independently of these examples that swamping can’t be a problem for incoherence measures that measure the distance to the closest coherent credence function: there is no subset of credences that plays a privileged role in determining which coherent credence function is the closest, rather, each of the agent’s credences makes the same contribution to the measure. Hence, we need not worry about the problem that a non-representative sample of the agent’s credences determines her degree of incoherence. While the absolute distance measure is attractive because it is simple, and because it is independently motivated by its relationship to the neutral/max measure, its antiswamping features are shared by other distance measures as well, since they retain the same general setup.15 I don’t want to take a stand here on which distance measure is best, but I will briefly consider the question of whether we should give special attention to distance measures that correspond to strictly proper scoring rules.16 A bit of background information is needed to clarify why this might seem like an attractive proposal. Strictly proper scoring rules are a central element in epistemic utility theory, which is currently widely discussed. Epistemic utility theory seeks to establish epistemic norms by combining norms of decision theory with an account of the epistemic utility of agents’ attitudes. The epistemic utility of an attitude towards a given proposition in a world depends on how closely it approximates the truth-value of the proposition at that world. Scoring rules are used to measure epistemic utility. For example, if p is true at w, having a credence of 1 in p at w has the most epistemic utility, whereas having a credence of 0 has the least. An example of an interesting result that has been established in this framework is the following: suppose you have incoherent credences. Then, assuming your epistemic utility function can be captured by a continuous strictly proper scoring rule, there is a coherent credence function that has better epistemic utility than your current credence function no matter which world turns out to be actual. By contrast, if your credences are coherent, they are not dom14 To see how SSK’s measures relate to metrics, the reader is also invited to consult Sect. 6 in (Schervish
et al. 2002a). 15 An interesting overview of the properties of different distance measures can be found in Cha (2007). 16 Thanks to Branden Fitelson and Richard Pettigrew for raising this question.
123
Synthese
inated in this way.17 Scoring rules can also be used for very practical purposes, such as measuring the accuracy of weather or election forecasts, like in the studies cited earlier. What distinguishes strictly proper scoring rules from improper scoring rules, and makes them special? A scoring rule is strictly proper if, according to it, any probabilistic credence function expects itself to have a higher epistemic utility than any alternative credence function an agent might adopt. One example of such a scoring rule that is popular in the literature is the Brier score, whereas the absolute distance measure is not a proper scoring rule. This means that an agent who has coherent credences judges her own credences to be best according to a strictly proper scoring rule, and is not motivated to switch to a different credence function in order to improve her epistemic utility. Why this is a desirable feature is especially obvious in considering practical applications of scoring rules, such as measuring the accuracy of weather forecasters. If a forecaster’s weather predictions are scored by a strictly proper scoring rule, the forecaster has no incentive to lie about her predictions—she expects her sincere forecast to get her the best score. By contrast, if an improper scoring rule is used—and the absolute distance measure is such an improper scoring rule—then it can happen that the forecaster expects that a prediction that doesn’t reflect her sincere judgment will earn the best possible score. In other words, scoring weather forecaster with improper scoring rules does not encourage them to make honest predictions. Hence, we have very compelling reasons to only use strictly proper scoring rules for measuring accuracy, or distance from truth. Do these reasons carry over to the case we’re interested in, namely to measure the distance to the closest coherent credence function? I submit that they don’t. First, it is worth pointing out that the motivation for using proper scoring rules—that they encourage honest reporting, because honest reports minimize expected inaccuracy—is on shaky grounds when incoherent agents are concerned. This is because for every incoherent agent, there is some coherent credence function that is more accurate in every possible world, hence the agent would benefit from lying, and reporting coherent credences. But, moreover, measuring distance to coherence is just a different matter than measuring accuracy, and as a result, we don’t gain any special advantage from using proper scoring rules.18 To see why, consider an agent whose credences are coherent. The distance between her credence function and the closest coherent credence function is zero, regardless of whether we use a distance measure that is a proper scoring rule or not. Hence, this case provides no motivation to prefer proper scoring rules. Next, consider an agent who has incoherent credences. The distance between her credence function and the closest coherent credence function is obviously greater than zero, regardless of whether we use a distance measure that is a proper scoring rule or not. If we tried to elicit the agent’s credences somehow, and she both knew what we were up to, and were also aware of her own incoherence, she’d be motivated to lie and 17 For the Brier score, this was shown by de Finetti (1974). For strictly proper scoring rules, the result is sketched by Savage (1971). More detailed discussions can be found in Joyce (1998, 2009), and Predd et al. (2009). See Schervish et al. (2009) for a demonstration that this result does not hold for proper scoring rules that are not continuous. 18 Thanks to an anonymous reviewer for helping me clarify this section.
123
Synthese
report coherent credences. But of course this will be a feature of any measure that (i) measures deviation from an ideal, and (ii) gives the agent some way of manipulating the things to be measured. Hence, this case provides no motivation to prefer proper scoring rules as distance measures for measuring incoherence. I conclude that, while there are good reasons to privilege proper scoring rules in other contexts, such as epistemic utility theory, and measuring accuracy, these reasons don’t carry over to our endeavor of measuring the overall incoherence of a credence function. Hence, the absolute distance measure is adequate for measuring degrees of incoherence, despite not being a proper scoring rule. 8 Conclusion I argued that there are multiple aims to be served by a measure of incoherence: capturing intuitive differences and similarities between incoherent credence functions, explaining how the ideal Bayesian rules can serve as norms for non-ideal agents, evaluating reasoning processes of incoherent agents, and helping to single out accurate forecasters. I argued that a measure that serves these purposes well should be a measure of the overall incoherence of a credence function, and I explained how to understand this pre-theoretical notion. I then proceeded to examine different measures of incoherence that have been proposed in the literature, using concrete examples as a guide to examine whether they are good candidates for capturing our notion of overall incoherence. I considered a qualitative measure proposed by Zynda, a class of Dutch book measures proposed by SSK, and a class of distance measures proposed by OLHB. I argued that the best measures given our criteria were SSK’s neutral/max measure, and the class of distance measures, specifically the absolute distance measure, which agrees with neutral/max in the unconditional case. I also argued that there is no good reason to prefer distance measures that employ proper scoring rules. Acknowledgments I would like to thank Horacio Arló-Costa, Brad Armendt, Glauber De Bona, Kenny Easwaran, Branden Fitelson, Alan Hájek, Adam Joel Keeney, Hanti Lin, Anya Plutynski, Jacob Ross, Mark Schroeder, Teddy Seidenfeld, Brian Talbot, Lyle Zynda, and two anonymous referees for helpful feedback and discussion. Part of this paper was written while I was a postdoctoral fellow at the Australian National University, thanks to the Australian Research Council Grant for the Discovery Project ‘The Objects of Probabilities’, DP 1097075.
Appendix: SSK’s neutral/sum and neutral/max measures In a series of papers, SSK have developed a class of measures of degrees of incoherence based on Dutch books (2000, 2002a, 2002b). In this appendix, I focus on the neutral/sum measure, to give a more detailed formal exposition of my arguments in Sect. 6.2. To see how this Dutch book measure works, suppose there is an agent who has a credence function P that assigns credences to a set of propositions {A1 , . . . , An }.19 19 Their incoherence measure is defined in terms of upper and lower previsions and it uses random variables
instead of propositions. The version I present here is somewhat simplified, because I use indicator functions
123
Synthese
We can represent a bet on or against one of these propositions according to the agent’s credences in the following way: Bet: α (I (Ai ) − P (Ai )) In this case, I(Ai ) is the indicator function of Ai , which assigns a value of 1 if Ai is true and a value of 0 if Ai is false. P(Ai ) is the credence the agent assigns to the proposition Ai . The coefficient α determines both the size of the bet, as well as whether the agent is betting on or against Ai . If α > 0, then the agent bets on the truth of Ai , whereas if α < 0, the agents bets against the truth of Ai . In the following, it will be assumed that an agent who assigns a precise credence to a proposition thereby evaluates as fair the bet on and the bet against that proposition at the price that is fixed by her credence. An agent is incoherent if there is a collection of gambles she evaluates as fair that together guarantee a loss. Formally, we can represent this as follows: Let A1 , . . . , An be the propositions that some agent assigns credences to, let Cr be the agent’s credence function, which may or may not be probabilistically coherent, and let S be the set of possible world states. If there is some choice of coefficients α1 , . . . , αn , such that the sum of the payoffs of the bets on or against A1 , . . . , An is negative for every world state s ∈ S, then the agent is vulnerable to a Dutch book. Thus, there is a Dutch book iff 20 sup
n
s∈S i=1
αi (I Ai (s) − Cr (Ai )) < 0
This formula tells us how to determine whether a Dutch book can be made against an agent who has a given credence function in a given set of propositions. We can capture the guaranteed loss an agent faces from a collection of gambles of the form Yi = α (I (Ai ) − Cr (Ai )) as follows:21 n G(Y ) = − min 0, sup αi (I Ai (s) − Cr (Ai )) s∈S i=1
In order to normalize the guaranteed loss to be able to measure an agent’s degree of incoherence, we can divide the guaranteed loss by the sum of the coefficients of the Footnote 19 continued instead of random variables, and I take credences to determine both the buying and selling price of a bet. That means that the measure I discuss is strictly speaking a special case of their more general measure. My criticisms of the measure are independent of these simplifying assumptions. 20 The function “sup” picks out the least upper bound of a set. In this context, it selects the highest value from the combined payoffs in all worlds in S. Thus, if the highest possible payoff is still negative, the agent can be Dutch-booked. 21 The “min” function is used here to select the smallest number of the numbers in a set. It ensures that if no Dutch book can be made against an agent, the guaranteed loss she faces is 0. If a Dutch book can be made, the “min” function selects it, and the negative sign in front guarantees that we end up with a positive number that indicates the agent’s guaranteed loss.
123
Synthese
individual bets. This normalization is called the “neutral/sum” normalization by SSK. We can thus compute the rate of loss H(Y): H (Y ) =
G(Y ) n |αi | i=1
The degree of incoherence can be determined for a set of propositions and a credence function over these propositions by choosing the coefficients α1 , . . . , αn in such a way that H(Y) is maximized. To maximize H(Y), it may be necessary not to include certain propositions in the Dutch book, which can be achieved by setting the relevant coefficients αi to 0. We can illustrate how the measure works with an example. Suppose an agent has credences in two propositions, q and ∼ q. Her credence assignment f is incoherent, since she assigns f (q) = 0.5 and f (∼q) = 0.6. In order to measure her rate of incoherence, we will first set up two bets with her, one for each proposition, and sum them in order to determine their combined payoff: Y = α1 (Iq (s) − 0.5) + α2 (I¬q (s) − 0.6) Since we can either be in a world where q is true or in a world where q is false, we can get two values for Y: If q, then Y = α1 0.5 − α2 0.6 If ∼q, then Y = α2 0.4 − α1 0.5 Thus, we can calculate G(Y) as follows, where α1 , α2 > 0: 22 if α2 ≥ 1.25 α1 or α1 ≥ 1.2 α2 , then the second term in the braces in the G(Y) equation23 is positive, which means that G(Y ) = 0 otherwise, G(Y ) = − sup{α1 (Iq (s) − 0.5) + α2 (I¬q (s) − 0.6)} s∈S
Thus, when a Dutch book can be made, (i.e. when G (Y) > 0) we can measure the rate of incoherence by choosing the coefficients in such a way that H(Y) is maximized: − sup{α1 (Iq (s) − 0.5) + α2 (I¬q (s) − 0.6)} H (Y ) =
s∈S
|α1 | + |α2 |
The rate of incoherence is maximized in this example if we choose α1 = α2 , which results in a rate of incoherence of 0.05. 22 If we set α , α < 0, the combined payoff would be guaranteed to be positive. 1 2 23 Recall that this is the relevant equation:
⎧ ⎨
G(Y ) = − min 0, sup ⎩ s∈S
n i=1
⎫ ⎬ αi (I Ai (s) − P(Ai )) ⎭
123
Synthese
I will now move on to the problems with the measure discussed in Sect. 6.2. The first example involved a comparison of the following two credence functions: f f f f
(p) = 0.6 (∼p) = 0.6 (q) = 0.5, (∼q) = 0.5
g (p) = 0.6 g (∼p) = 0.6 g (q) = 0.6 g (∼q) = 0.6
We noted that intuitively, g is overall more incoherent than f . However, this is not the result we get from the neutral/sum measure. According to this measure, the agent would be equally incoherent in both cases. Here’s how that result comes about. First, consider the case in which the agent adopts f . The formula to calculate the degree of incoherence is the following (as the reader can easily verify, including bets on q and ∼q couldn’t possibly lead to a higher guaranteed loss, so they can and should be left out): − sup{α1 (I p (s) − 0.6) + α2 (I¬ p (s) − 0.6)} H (Y ) =
s∈S
|α1 | + |α2 |
As before, the agent’s rate of loss is maximized when we set α1 = α2 > 0. We can thus simplify the calculation as follows: H (Y ) =
0.2a1 = 0.1 2a1
Thus, if the agent adopts f as her credence function, her rate of loss is 0.1. Let us now compare this to what happens if the agent adopts g. If her credence function is g, we can calculate the rate of loss as follows: H (Y ) −sup{α1 (I p (s) − 0.6) + α2 (I¬ p (s) − 0.6) + α3 (Iq (s) − 0.6) + α4 (I¬q (s) − 0.6)} s∈S = |α1 | + |α2 |+|a3 | + |a4 |
If we try to find values for α1 − α4 that maximize H(Y), we get an interesting result. There is no way of picking values for α1 − α4 that lead to a higher rate of incoherence than 0.1. Rather, we get exactly the same rate of incoherence we had before as long as we choose α1 = α2 and α3 = α4 , and we choose α1 > 0 and/or α3 > 0. However, this result is in tension with the intuition that an agent who adopts g is more incoherent than an agent who adopts f . We can see more easily how this result arises if we strip our example down to the essential parts. In the calculation of the rate of loss of g, we are essentially combining two normalized Dutch books of the same kind into one, by adding the numerators and adding the denominators of each one. The two normalized Dutch books are the same as the one Dutch book we made against the agent who
123
Synthese
adopts f . Thus, to make it simple, the calculation for the rate of loss of the agent who adopts g goes as follows: H (Y ) =
0.2a1 + 0.2a3 2a1 + 2a3
If two fractions are combined in such a way that the numerators and the denominators are added, the value of the resulting fraction is always in between or equal to the two original fractions. Thus, if we combine two normalized Dutch books with the same rate of loss in the neutral/sum measure, the resulting rate of loss is the same as before.24 Moreover, it can even be beneficial when employing the neutral/sum measure not to make certain Dutch books at all in order to maximize the rate of loss. Remember that we are allowed to choose αi = 0 if necessary to maximize the rate of loss. In a case in which there are two Dutch books that can be made against an agent, but one of them leads to a greater rate of loss on its own, the total rate of loss can be maximized by setting the relevant coefficients to 0. For example, suppose an agent has the credence function f , that is defined as follows: f (p) = 0.6, f (∼p) = 0.5, f (T) = 0.9. An agent who adopts f can be Dutch booked in two ways: on her incoherent credences in the partition {p, ∼p}, and on her non-zero credence in a contradiction. In this case, the rate of loss comes down to: H (Y ) =
0.1a1 + 0.1a3 2a1 + a3
The rate of loss in this case reaches its maximum value of 0.1 if we set α1 = 0. This amounts to only Dutch booking the agent on her credence in the tautology, but refraining from Dutch booking her on her incoherent credences in the partition {p, ∼p}. This feature of the measure’s normalization is the source of the swamping problem. Since only the worst Dutch book determines the agent’s degree of incoherence, incoherencies in other parts of the credence function get swamped and are not reflected 24 Here’s a proof of the result (thanks to Kenny Easwaran for pointing this out to me): We want to prove that if we combine two fractions by adding their respective numerators and denominators, the resulting fraction is going to lie in between the two original fractions. Suppose that a/b < c/d, with a,b,c,d being positive. Then it is the case that ad < bc. First, we show that the sum of the fractions is greater than a/b:
a/b = (ab + ad) /b (b + d) (a + c)/(b + d) = (ab + cb)/b (b + d) If you compare the two terms on the right side of each equation, you notice that they are the same except for the right summand in the numerator. And since ad < bc, we can conclude that a/b < (a + c)/(b + d). Then we show that the sum of the fractions is smaller than c/d: c/d = (cb + cd) /d (b + d) (a + c)/(b + d) = (ad + cd)/d (b + d) If you compare the two terms on the right side of each equation, you notice that they are the same except for the left summand in the numerator. And since ad < bc, we can conclude that c/d > (a + c) / (b + d).
123
Synthese
in the agent’s degree of incoherence. This also gives rise to the problematic example in Sect. 6.2, in which the measure orders two credence functions in a way that seems intuitively exactly backwards, which makes the measure unsuitable to evaluate reasoning processes. In order to determine the degree of incoherence of a credence function according to the neutral/max measure instead, we must make the following adjustment: instead of normalizing the Dutch book loss by dividing the betting loss by the sum of the stakes of the individual bets, we must normalize it by dividing the Dutch book loss by the stakes of the largest bet included in the gamble. Hence, the rate of loss that must be maximized becomes: H (Y ) =
G(Y ) max{|α1 | , . . . , |αn |}
I hope this appendix helpfully supplements the informal presentation of the arguments in the main body of the paper, and I will leave it to the reader to verify the results in the remaining sections of the main text. References Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307. Christensen, D. (2004). Putting logic in its place. Oxford: Oxford University Press. De Bona, G. & Finger, M. (2014). Notes on measuring inconsistency in probabilistic logic. Technical report RT-MAC-2014-02, Department of Computer Science, IME/USP. http://www.ime.usp.br/~mfinger/ www-home/papers/DBF2014-reltec.pdf. De Finetti, B. (1974). Theory of probability (Vol. 1). New York: Wiley. Earman, J. (1992). Bayes or bust? A critical examination of bayesian confirmation theory. Cambridge: MIT. Hacking, I. (1967). Slightly more realistic personal probability. Philosophy of Science, 34(4), 311–325. Hájek, A. (2008). Dutch book arguments. In P. Anand, P. Pattanaik, & C. Puppe (Eds.), The oxford handbook of rational and social choice (pp. 173–195). Oxford: Oxford University Press. Harman, G. (1986). Change in view. Cambridge: MIT Press. Joyce, J. M. (1998). A nonpragmatic vindication of probabilism. Philosophy of Science, 65(4), 575–603. Joyce, J. M. (2009). Accuracy and coherence: Prospects for an alethic epistemology of partial belief. In F. Huber & C. Schmidt-Petri (Eds.), Degrees of belief, synthese (pp. 263–297). Dordrecht: Springer. Osherson, D., Lane, D., Hartley, P., & Batsell, R. R. (2001). Coherent probability from incoherent judgment. Journal of Experimental Psychology: Applied, 7(1), 3–12. Osherson, D., & Vardi, M. Y. (2006). Aggregating disparate estimates of chance. Games and Economic Behavior, 56(1), 148–173. Predd, J., Seiringer, R., Lieb, E. H., Osherson, D., Poor, H. V., & Kulkarni, S. (2009). Probabilistic coherence and proper scoring rules. IEEE Transactions on Information Theory, 55(10), 4786–4792. Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336), 783–801. Schervish, M. J., Seidenfeld, T., & Kadane, J. B. (2000). How sets of coherent probabilities may serve as models for degrees of incoherence. International Journal of Uncertainty, Fuzziness and Knowlegdebased Systems, 8, 347–355. Schervish, M. J., Seidenfeld, T., & Kadane, J. B. (2002a). Measuring incoherence. Sankhya: The Indian Journal of Statistics, 64(Series A, Pt. 3), 561–587. Schervish, M. J., Seidenfeld, T., & Kadane, J. B. (2002b). Measures of incoherence: How not to gamble if you must. In J. M. Bernardo, et al. (Eds.), Bayesian statistics 7 (pp. 385–401). Oxford: Oxford University Press 2003.
123
Synthese Schervish, M. J., Seidenfeld, T., & Kadane, J. B. (2009). Proper scoring rules, dominated forecasts, and coherence. Decision Analysis, 6(4), 202–221. Titelbaum, M. (2013). Quitting certainties. A Bayesian framework modeling degrees of belief. Oxford: Oxford University Press. Wang, G., Kulkarni, S. R., Poor, H. V., & Osherson, D. N. (2011). Aggregating large sets of probabilistic forecasts by weighted coherent adjustment. Decision Analysis, 8(2), 128–144. Weisberg, J. (2011). Varieties of Bayesianism. In D. Gabbay, S. Hartmann, & J. Woods (Eds.), Handbook of the history of logic (Vol. 10). Amsterdam: Elsevier. Zynda, L. (1996). Coherence as an ideal of rationality. Synthese, 109(2), 175–216.
123