How generalizable is good judgment? A multi-task, multi-benchmark study (pdf)

Article PDF cannot be displayed. You can download it here:

http://journal.sjdm.org/17/17408/jdm17408.pdf

How generalizable is good judgment? A multi-task, multi-benchmark study

Judgment and Decision Making, Vol. 12, No. 4, July 2017, pp. 369–381 How generalizable is good judgment? A multi-task, multi-benchmark study Barbara A. Mellers∗ Joshua D. Baker† Eva Chen† David R. Mandel‡ Philip E. Tetlock† Abstract Good judgment is often gauged against two gold standards – coherence and correspondence. Judgments are coherent if they demonstrate consistency with the axioms of probability theory or propositional logic. Judgments are correspondent if they agree with ground truth. When gold standards are unavailable, silver standards such as consistency and discrimination can be used to evaluate judgment quality. Individuals are consistent if they assign similar judgments to comparable stimuli, and they discriminate if they assign different judgments to dissimilar stimuli. We ask whether “superforecasters”, individuals with noteworthy correspondence skills (see Mellers et al., 2014) show superior performance on laboratory tasks assessing other standards of good judgment. Results showed that superforecasters either tied or out-performed less correspondent forecasters and undergraduates with no forecasting experience on tests of consistency, discrimination, and coherence. While multifaceted, good judgment may be a more unified than concept than previously thought. Keywords: 1 Introduction Social scientists and philosophers often evaluate judgments against two gold standards: correspondence and coherence (Hammond, 1996, 2007). Measures of correspondence capture the degree to which judgments agree with empirical observations (e.g., Cooksey, 1996; Hammond, 1996), and coherence criteria assess the degree to which judgments are consistent with logical or axiomatic principles. Both standards are widely accepted as important components of good judgment (Dunwoody & College, 2009; Hammond, 2000). 1.1 How Well do People Meet These Standards? Previous research suggests that human judgment tends to fall short on coherence and correspondence. Coherence violations range from base rate neglect and confirmation bias to overconfidence and framing effects (Gilovich, Griffith & Kahneman, 2002; Kahneman, Slovic & Tversky, 1982). Experts are not immune. Statisticians (Christensen-Szalanski & Bushyhead, 1981), doctors (Eddy, 1982), and nurses (Bennett, 1980) neglect base rates. Physicians and intelligence professionals are susceptible to framing effects (Aberegg, Arkes & Terry, 2006; Reyna, Chick, Corbin & Hsia, 2014), Copyright: © 2017. The authors license this article under the terms of the Creative Commons Attribution 3.0 License. ∗ Department of Psychology, Solomon Labs, 3720 Walnut St., University of Pennsylvania, Philadelphia, PA 19104. Email: . † University of Pennsylvania ‡ DRDC and York University and financial investors are prone to overconfidence (Barber & Odean, 2001). Research on correspondence tells a similar story. Numerous studies show that human predictions are frequently inaccurate and worse than simple linear models in many domains (e.g., Meehl, 1954; Dawes, Faust & Meehl, 1989). Once again, expertise doesn’t necessarily help. Inaccurate predictions have been found in parole officers (Carroll & Payne, 1977), court judges (Ebbesen & Konecki, 1975), investment managers in the US and Taiwan (Olsen, 1997), and politicians (Tetlock, 2005). However, expert predictions are better when the forecasting environment provides regular, clear feedback and there are repeated opportunities to learn (Kahneman & Klein, 2009; Shanteau, 1992). Examples include meteorologists (Murphy & Winkler, 1984), professional bridge players (Keren, 1987), and bookmakers at the racetrack (Bruce & Johnson, 2003), all of whom are well-calibrated in their own domains. 1.2 Silver Standards In many cases, judgment quality is important, but gold standards are unavailable. How “good” is a physician’s diagnosis, for example? Or an instructor’s grade, or a judge’s sentencing decision? Einhorn (1972, 1974), and Weiss & Shanteau (2003) suggested that, at a minimum, good judges (i.e., domain experts) should demonstrate consistency and discrimination in their judgments. In other words, experts should make similar judgments if cases are alike, and dissimilar judgments when cases are unalike. Indeed, some would argue that these skills are essential to expertise. 369 Judgment and Decision Making, Vol. 12, No. 4, July 2017 In a test of the consistency, Skånér, Strender & Bring (1998) asked 27 general practitioners (GPs) to estimate the probability of patient heart failure from a series of vignettes based on actual patients. In the vignettes, GPs were given diagnostic cues such as age, sex, and history of myocardial infarction, but the patients’ eventual survival status could not be obtained. GPs were presented with 45 cases, five of which were presented twice, and consistency was operationalized as the absolute difference in survival estimates on the five repeated cases. Individual consistency varied greatly. Absolute differences fell between 0 and 10% in 62% of cases, 11% to 20% in 25% of cases, and greater than 20% in 13% of cases. In another test of consistency, Dhami & Ayton (2001) found inconsistency in magistrate’s decisions across repeated trials in a laboratory task of bail setting. 1.3 Connections Among Standards We know of no studies that have examined good judgment using all four of the standards. One study examined three of them (Weiss, Brennan, Thomas, Kirlik & Miller, 2009). Using a golf putting task with experienced golfers, they found a strong correlation between accuracy and their combined measure of consistency and discrimination. In a handful of studies, researchers have investigated connections between the gold standards, but results have been mixed. Most studies show weak connections (Wright & Ayton, 1987a; 1987b; Wright, Rowe, Bolger & Gammack, 1994; Adam & Reyna, 2005; Weaver & Stewart, 2012; Dunwoody et al., 2005), although one could argue that Weiss et al.’s results are an exception. For the most part, however, measures of coherence and correspondence are loosely coupled. Yet under the right conditions, loose couplings may tighten. In research on forecasting, for example, some studies have given subjects a battery of coherence tasks prior to having them make forecasts about uncertain events. The results show that aggregate forecasts are more accurate (i.e., more correspondent) when subjects’ predictions are weighted by their scores on the coherence tasks, rather than combined with a simple unweighted average. Indeed, coherence-based weighting schemes have been shown to improve the accuracy of the aggregate by more than 30%, relative to a simple mean (Predd, Osherson, Kulkarni & Poor, 2008; Wang Kulkarni, Poor & Osherson, 2011; Tsai & Kirlik, 2012; Karvetski, Olson, Mandel & Twardy, 2013). Perhaps correspondence and coherence are intertwined, but we’ve been looking in the wrong places. 1.4 Superforecasters In previous research, we have examined a group of indivi (...truncated)