Developing expert political judgment: The impact of training and practice on judgmental accuracy in geopolitical forecasting tournaments
Judgment and Decision Making, Vol. 11, No. 5, September 2016, pp. 509–526
Developing expert political judgment: The impact of training and
practice on judgmental accuracy in geopolitical forecasting
tournaments
Welton Chang*
Eva Chen†
Barbara Mellers†
Philip Tetlock†
Abstract
The heuristics-and-biases research program highlights reasons for expecting people to be poor intuitive forecasters. This
article tests the power of a cognitive-debiasing training module (“CHAMPS KNOW”) to improve probability judgments in a
four-year series of geopolitical forecasting tournaments sponsored by the U.S. intelligence community. Although the training
lasted less than one hour, it consistently improved accuracy (Brier scores) by 6 to 11% over the control condition. Cognitive
ability and practice also made largely independent contributions to predictive accuracy. Given the brevity of the training
tutorials and the heterogeneity of the problems posed, the observed effects are likely to be lower-bound estimates of what
could be achieved by more intensive interventions. Future work should isolate which prongs of the multipronged CHAMPS
KNOW training were most effective in improving judgment on which categories of problems.
Keywords: forecasting, probability judgment, training, practice, cognitive debiasing
1 Introduction
Research in judgment and choice has found numerous
flaws in people’s intuitive understanding of probability (BarHillel, 1980; Kahneman & Tversky, 1973, 1984; Lichtenstein, Slovic, Fischhoff, Layman & Combs, 1978; Slovic
& Fischhoff, 1977; Tversky & Kahneman, 1974). We often make errors in prediction tasks by using effort-saving
heuristics that are either insensitive to factors that normative theories say we should take into account or sensitive to
factors that we should ignore (Kahneman & Tversky, 1977,
1982; Morewedge & Kahneman, 2010; Tversky & Kahneman, 1974). These results have sparked interest in interventions that can improve judgments (Arkes, 1991; Croskerry,
Singhal & Mamede, 2013a, 2013b; Fischhoff, 1982; Lilienfeld, Ammirati & Landfield, 2009; Miller, 1969), but it reThe authors thank Lyle Ungar and Angela Duckworth for their comments as well as Pavel Atanasov, Philip Rescober and Angela Minster for
their help with data analysis. Pavel Atanasov, Terry Murray and Katrina
Fincher were instrumental in helping us develop the training materials as
well. This research was supported by the Intelligence Advanced Research
Projects Activity (IARPA) via the Department of Interior National Business
Center contract number D11PC20061. The U.S. Government is authorized
to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions expressed herein are those of
the authors and should not be interpreted as necessarily representing the
official policies or endorsements, either expressed or implied, of IARPA,
DoI/NBC, or the U.S. Government.
Copyright: © 2016. The authors license this article under the terms of
the Creative Commons Attribution 3.0 License.
* Department of Psychology, University of Pennsylvania, Philadelphia,
PA 19104. Email: .
† University of Pennsylvania
mains true that significantly less attention has been paid
to “debiasing” than to biases (Arkes, 1991; Graber et al.,
2012; Lilienfeld et al., 2009). Moreover, few organizations
have embraced the debiasing methods that have been developed (Croskerry, 2003; Graber et al., 2012; Lilienfeld et al.,
2009).
Accurate probability judgments are important in domains
such as law, finance, medicine and politics (Croskerry et al.,
2013b; Jolls & Sunstein, 2005). For example, the U.S. justification for invading Iraq in 2003 hinged on intelligence estimates that stated with high confidence that Iraq possessed
Weapons of Mass Destruction (WMD) (Director of Central
Intelligence, 2002). Two years later, a bipartisan commission determined that there were no WMD in Iraq. The prewar intelligence was “dead wrong,” putting the blame on
the intelligence community and politicization of the available information by a subset of policymakers (Commission
on the Intelligence Capabilities of the United States Regarding Weapons of Mass Destruction, 2005). The United States
would continue its involvement in the country for over a
decade at an estimated cost between $4 and $6 trillion and
thousands of casualties, numbers which underscore the dangers of over-confident “slam-dunk” assessments of ambiguous evidence (Bilmes, 2014).
The intelligence community responded, in part, by creating a research division devoted to exploring methods of
improving intelligence analysis, IARPA. The research reported here was part of four years of forecasting tournaments in which our team, the Good Judgment Project, was a
competitor. Five university-based teams competed to submit the most accurate daily probability forecasts possible
509
510
Judgment and Decision Making, Vol. 11, No. 5, September 2016
Effect of training and practice on geopolitical forecasting
on a range of political and economic questions, which included improving human judgments with algorithms. Additional details on the forecasting tournament, competitors
and Good Judgment Project’s winning methods, was previously reported in Mellers et al. (2014); Tetlock, Mellers,
Rohrbaugh and Chen (2014). We experimentally tested the
efficacy of a variety of tools for improving judgment, including a cognitive-debiasing and political knowledge training
regimen called “CHAMPS KNOW”.
vestigated individual-difference moderators. Our study also
represents one of the most rigorous tests of debiasing methods to date. The open-ended experimental task, forecasting
a wide range of political and economic outcomes, is widely
recognized as difficult (Jervis, 2010; Tetlock, 2005). Some
political experts and commentators have portrayed it as impossible (Atkins, 2015; Taleb & Blyth, 2011). Our work
does not correct all of the aforementioned conceptual and
methodological problems, but we can address a significant
fraction of them.
The analysis reported here builds on Mellers et al. (2014).
The previous article examined the first two years of the forecasting tournament and discusses several drivers of performance. Here, we focus on the effects of training and include
a more in-depth analysis of all four years of the experiment.
We also examine mediational mechanisms and moderator
variables to understand individual differences.
1.1
Literature review
A number of studies have shed light on how probability estimates and judgments can be improved (Fischbein & Gazit,
1984; Fischhoff & Bar-Hillel, 1984; Stewart, 2001; Tetlock,
2005; Whitecotton, Sanders & Norris, 1998). However, past
work suffers from at least six sets of limitations: 1) overreliance on student subjects who are often neither intrinsically nor extrinsically motivated to master the task (Anderson, 1982; Petty & Cacioppo, 1984; Sears, 1986); 2) oneshot experimental ta (...truncated)