Enough of Cato Unbound’s What’s Wrong With Expert Predictions debate has now unfolded that it makes sense for me to offer some commentary. The discussion encompasses many predictive and decision-making subject areas and institutions – politics, economics, business, media punditry – but for the purposes of Analyst First I’m primarily interested in prediction in the context of organisations.

All the discussants agree that expert predictive track records are terrible, but they diverge in the degree to which they see this as problematic and in their recommendations as to what to do about it. The debate so far:

In their Lead Essay, Dan Gardner and Philip Tetlock present a puzzle:

Every year, corporations and governments spend staggering amounts of money on forecasting and one might think they would be keenly interested in determining the worth of their purchases and ensuring they are the very best available. But most aren’t. They spend little or nothing analyzing the accuracy of forecasts and not much more on research to develop and compare forecasting methods.

They go on to provide an overview of Tetlock’s longitudinal study of experts, encompassing 28,000 predictions over a fifteen year period, which found that eclectic foxes outperform dogmatic hedgehogs, but that both are outperformed by extrapolation algorithms. They argue that we need to get better at accepting our limitations and to “give greater consideration to living with failure, uncertainty, and surprise”. They accordingly call for “decentralized decision-making and a proliferation of small-scale experimentation”.

In the Reaction Essay section, Robin Hanson addresses the puzzle of why forecasting remains so immune to accountability via – presumably easy to assemble – track records: “[s]urprising disinterest [he means uninterest] in forecasting accuracy could be explained either by its costs being higher, or its benefits being lower, than we expect.” His conclusion is that, even in profit and loss settings such as organisations, the signalling value of forecasting must compete with its information value:

Even in business, champions need to assemble supporting political coalitions to create and sustain large projects. As such coalitions are not lightly disbanded, they are reluctant to allow last minute forecast changes to threaten project support. It is often more important to assemble crowds of supporting “yes-men” to signal sufficient support, than it is to get accurate feedback and updates on project success. Also, since project failures are often followed by a search for scapegoats, project managers are reluctant to allow the creation

He points out that, while prediction markets are best able to incentivise information holders to provide accurate forecasts, institutional respect for accuracy is a necessary and thus far absent precondition to their widespread uptake.

John H. Cochrane turns the tables by arguing that unforecastability is a good sign as seen through the lens of economics:

In fact, many economic events should be unforecastable, and their unforecastability is a sign that the markets and our theories about them are working well.

This statement is clearest in the case of financial markets. If anyone could tell you with any sort of certainty that “the market will go up tomorrow,” you could use that information to buy today and make a fortune. So could everyone else. As we all try to buy, the market would go up today, right to the point that nobody can tell whether tomorrow’s value will be higher or lower.

An “efficient” market should be unpredictable. If markets went steadily up and delivered return without risk, then markets would not be working as they should.

Forecasting, in the sense of accurately trying to predict the future, is a “fool’s game”. But it does work as an input into risk management:

The good use of “forecasting” is to get a better handle on probabilities, so we focus our risk management resources on the most important events. But we must still pay attention to events, and buy insurance against them, based as much on the painfulness of the event as on its probability. (Note to economics techies: what matters is the risk-neutral probability, probability weighted by marginal utility.)

So it’s not really the forecast that’s wrong, it’s what people do with it. If we all understood the essential unpredictability of the world, especially of rare and very costly events, if we got rid of the habit of mind that asks for a forecast and then makes “plans” as if that were the only state of the world that could occur; if we instead focused on laying out all the bad things that could happen and made sure we had insurance or contingency plans, both personal and public policies might be a lot better.

Cochrane defends a hedgehog-like reversion to principles – basic economic principles like supply and demand – in order to build effective conditional forecasts which inform plans and provide decision support.

Bruce Bueno de Mesquita argues that expert prediction is, properly contextualised, a sideshow. Statistical methods are widely used, so much so that we’ve ceased to notice (e.g. in insurance pricing and political polling). Game theory is better still, and continues to make incremental progress:

Are these methods perfect or omniscient? Certainly not! Are the marginal returns to knowledge over naïve methods (expert opinion; predicting that tomorrow will be just like today) substantial? I believe the evidence warrants an enthusiastic “Yes!” Nevertheless, despite the numerous successes in designing predictive methods, we appropriately focus on failures. After all, by studying failure methodically we are likely to make progress in eliminating some errors in the future.

So why do we continue to focus on the poorly performing experts? De Mesquita’s view is that:

Unfortunately, government, business, and the media assume that expertise—knowing the history, culture, mores, and language of a place, for instance—is sufficient to anticipate the unfolding of events. Indeed, too often many of us dismiss approaches to prediction that require knowledge of statistical methods, mathematics, and systematic research design. We seem to prefer “wisdom” over science, even though the evidence shows that the application of the scientific method, with all of its demands, outperforms experts.

De Mesquita goes on to explain and advocate his own game theoretic (Expected Utility Model) approach:

Acting like a fox, I gather information from a wide variety of experts. They are asked only for specific current information (Who wants to influence a decision? What outcome do they currently advocate? How focused are they on the issue compared to other questions on their plate? How flexible are they about getting the outcome they advocate? And how much clout could they exert?). They are not asked to make judgments about what will happen. Then, acting as a hedgehog, I use that information as data with which to seed a dynamic applied game theory model. The model’s logic then produces not only specific predictions about the issues in question, but also a probability distribution around the predictions. The predictions are detailed and nuanced. They address not only what outcome is likely to arise, but also how each “player” will act, how they are likely to relate to other players over time, what they believe about each other, and much more.

In the Conversation section, Robin Hanson challenges Cochrane and de Mesquita to produce conditional forecasts and submit them to systematic public measurement and verification. He is doubtful, however, that they will assent:

The sad fact is that the many research patrons eager to fund hedgehoggy research by folks like Cochrane and De Mesquita show little interest in funding forecasting competitions at the scale required to get public participation by such prestigious folks.

Forecasting, he contends, is a domain in which the rewards to affiliation with prominent expertise trump accuracy.

Bruce Bueno de Mesquita replies that the acceptance of his methods in journals, via peer review, is evidence of their having been sufficiently scrutinised; furthermore that no one has been willing to publically compete with him; additionally that he has successfully beaten alternative approaches; and finally that he has made his methods available online.

Robin Hanson responds that more comprehensive standards of proof are required to settle the matter.

Gardner and Tetlock then provide an insightful running summary. In response to Hanson they speculate that the costs of admitting to poor forecasting performance would disenfranchise those currently enjoying their – unjustified in terms of performance – public and organisational reputations:

Open prediction contests will reveal how hard it is [for them] to outperform their junior assistants and secretaries. Insofar as technologies such as prediction markets make it easier to figure out who has better or worse performance over long stretches, prediction markets create exactly the sort of transparency that destabilizes status hierarchies… If these hypotheses are correct the prognosis for prediction markets—and transparent competitions of relative forecasting performance—is grim. Epistemic elites are smart enough to recognize a serious threat to their dominance.

In response to Cochrane they speak up for the value of hedgehogs – more compelling, more visionary, better at envisioning extreme events – but note that the cost of this is that they are more wrong, more often.

They close by welcoming de Mesquita’s willingness to be publically scrutinised, note that the jury is still out in terms of systematic and decomposed measurement of his methods, and caution that:

For many categories of forecasting problems, we are likely to bump into the optimal forecasting frontier quite quickly. There is an irreducible indeterminacy to history and no amount of ingenuity will allow us to predict beyond a certain point.

De Mesquita responds that he welcomes being assessed.

Target by Jasper Johns

Although Cochrane comes close, none of the discussants explicitly recognises and makes central the difference between forecasting and other activities which organisations call forecasting (i.e. planning and goal setting). I explained this distinction in a previous post, namely:

  • Forecasting means objectively estimating the most likely future outcome: “what’s going to happen?”
  • Goal setting means putting a target in place, generally for motivational purposes: “what would we like to happen?”
  • Planning means establishing an intended course of action, usually to direct the allocation of resources: “what are we going to do?”

This distinction is key because, while all three activities are based on prediction, only in the case of forecasting is predictive accuracy the primary purpose. Organisations can improve all of these, but to do so they need to address three tiers of potential failure:

Paradigm Failure

All the Cato discussants take it as read that, in assessing predictions, they’re operating in an empirical paradigm. In organisations, however, this can’t be taken for granted. Many organisations place prediction either in the wrong paradigm, or no paradigm at all. It’s common for predictive activities and processes to be ritualised and adhered to, but without any systematic error measurement or validation. Gardner and Tetlock acknowledge the “widespread lack of curiosity—lack of interest in thinking about how we think about possible futures” as “a phenomenon worthy of investigation in its own right,” pointing out the wastefulness of remaining ignorant given the resources involved.

Category Failure

Systematic error measurement and validation can’t happen without the right categories being first recognised and agreed upon. Disambiguating forecasting from goal setting from planning is critical. Organisations don’t do this well. Loose language doesn’t help. The same Finance department will update a budget (a plan) and call it a “forecast”, oversee the revision of sales “forecasts” (goals), and publish revenue estimates for the scrutiny of stock market analysts (true forecasts). As an earlier Analyst First post pointed out, these activities, while all reliant on objective estimation, do not share the same benchmarks when it comes to assessing error and value. Forecast error makes sense for forecasting; execution error makes more sense for goal setting and planning.

The Cato discussants all tacitly acknowledge these distinctions, but none recognises its implications when it comes to understanding the way organisations do prediction.

Tetlock’s experiment required that pundits’ anonymity be protected. Participants knew to distance themselves from their projections when they were accountable for accuracy. The implication here is either that pundits are dishonest, or that they recognise that their projections serve a purpose other than informing people about the likelihood of future events. Gardner and Tetlock, and Hanson, acknowledge that punditry is a form of entertainment, has signalling value, and by virtue of this trades off accuracy for clarity and narrative value. As Hanson puts it:

Media consumers can be educated and entertained by clever, witty, but accessible commentary, and can coordinate to signal that they are smart and well-read by quoting and discussing the words of the same few focal pundits. Also, impressive pundits with prestigious credentials and clear “philosophical” positions can let readers and viewers gain by affiliation with such impressiveness, credentials, and positions. Being easier to understand and classify helps “hedgehogs” to serve many of these functions.

Hanson recognises that affiliation with sophistication has signalling value within organisations too. He notes the multiple roles played by managers, including the requirement that they appear impressive enough to attract affiliation and inspire their subordinates:

[C]onsider next the many functions and roles of managers, both public and private. By being personally impressive, and by being identified with attractive philosophical positions, leaders can inspire people to work for and affiliate with their organizations. Such support can be threatened by clear tracking of leader forecasts, if that questions leader impressiveness.

He goes on to describe the motivational impact of managerial ‘overconfidence’:

Often, managers can increase project effort by getting participants to see an intermediate chance of the project making important deadlines—the project is both likely to succeed, and to fail. Accurate estimates of the chances of making deadlines can undermine this impression management. Similarly, overconfident managers who promise more than they can deliver are often preferred, as they push teams harder when they fall behind and deliver more overall.

Incentivising workers to “deliver more overall” is precisely the purpose of goal setting. Consistently producing overshooting projections in this context isn’t necessarily “forecast hypocrisy,” as Hanson characterises it. It may be effective stretch targeting.

Many of the discussants also acknowledge that planning is a different activity from forecasting (and goal setting), but don’t pursue the full implications of this in terms of error and value measurement. The Kenneth Arrow anecdote relayed by Gardner and Tetlock, for example, illustrates that plans are reliant on, but different from, forecasts:

Some [corporations and governments] even persist in using forecasts that are manifestly unreliable, an attitude encountered by the future Nobel laureate Kenneth Arrow when he was a young statistician during the Second World War. When Arrow discovered that month-long weather forecasts used by the army were worthless, he warned his superiors against using them. He was rebuffed. “The Commanding General is well aware the forecasts are no good,” he was told. “However, he needs them for planning purposes.”

Gardner and Tetlock look also at the role of self-aware (i.e. of limitations) prediction in preparedness planning, comparing the effectiveness of the recent New Zealand and Haiti earthquake responses:

Designing for resiliency is essential, as New Zealanders discovered in February when a major earthquake struck Christchurch. 181 people were killed. When a somewhat larger earthquake struck Haiti in 2010, it killed hundreds of thousands. The difference? New Zealand’s infrastructure was designed and constructed to withstand an earthquake, whenever it might come. Haiti’s wasn’t.

Cochrane seconds this, adding that predictions have scenario generation utility regardless of their accuracy:

Once we recognize that uncertainty will always remain, risk management rather than forecasting is much wiser. Just the step of naming the events that could happen is useful.

In these and other ways, the discussants acknowledge that accuracy isn’t the only purpose of prediction. It should therefore follow that forecast error might not be the only relevant measure.

Methodological Failure

Much of the discussion contrasts different predictive tools, techniques and approaches: expert judgement, statistical algorithms, prediction markets, game theory. Methodologies and expectations both need to be appropriately calibrated: simple statistical extrapolation works well in some settings, but in complex systems environments the best we can hope for may be a better feel for the probabilities involved.

There are a range of insights here for organisations. Individual human judgement on its own, it is unanimously acknowledged, performs poorly. Statistical algorithms consistently beat the experts. There is general agreement among the discussants that eclecticism is desirable. The clear implication is that organisations should adopt collective intelligence methods.

Tetlock’s wider work on expert political judgement has implications for optimal forecasting team composition (use hedgehogs to generate possibilities and foxes to synthesise and calibrate probabilities). Gardner and Tetlock also call for what we term Decision Performance Management:

Imagine a system for recording and judging forecasts. Imagine running tallies of forecasters’ accuracy rates. Imagine advocates on either side of a policy debate specifying in advance precisely what outcomes their desired approach is expected to produce, the evidence that will settle whether it has done so, and the conditions under which participants would agree to say “I was wrong.” Imagine pundits being held to account.

It’s also worth imagining what sort of environment supports this, as Hanson does in his discussion of a different “social equilibrium”:

A track record tech must be combined with a social equilibrium that punishes those with poor records, and thus encourages rivals and victims to collect and report records. The lesson I take for forecast accuracy is that it isn’t enough to devise ways to record forecast accuracy—we also need a new matching social respect for such records.

He’s right. New ways to record accuracy aren’t enough. We also need to know whether accuracy is the real goal. On the subject of goals, when it comes to organisational planning and goal setting, it may well be that these are best understood in a game theoretic context.

Whatever the case, once they are disambiguated, and because they are all related, empiricism means doing forecasting and goal setting and planning better.

Related Analyst First posts:

 

3 Responses to *What’s Wrong with Expert Predictions* – Commentary

  1. Matt Clarke says:

    I suppose it’s not *really* relevant, but the topic reminds me of Laurie Anderson’s song Only An Expert

  2. [...] further reflection on Cato Unbound’s What’s Wrong with Expert Predictions debate (see here and here) is that Gardner and Tetlock are correct in the aggregate while Cochrane is correct regarding the [...]

  3. [...] in Business Analytics. Software vendors mean different things when they talk about “analytics”; Finance departments run three substantively distinct processes in parallel, all called “forecasting”. Such language [...]

Set your Twitter account name in your settings to use the TwitterBar Section.