How generalizable is good judgment? A multi-task, multi-benchmark study
Judgment and Decision Making, Vol. 12, No. 4, July 2017, pp. 369–381
How generalizable is good judgment? A multi-task, multi-benchmark
study
Barbara A. Mellers∗
Joshua D. Baker†
Eva Chen†
David R. Mandel‡
Philip E. Tetlock†
Abstract
Good judgment is often gauged against two gold standards – coherence and correspondence. Judgments are coherent if they
demonstrate consistency with the axioms of probability theory or propositional logic. Judgments are correspondent if they
agree with ground truth. When gold standards are unavailable, silver standards such as consistency and discrimination can
be used to evaluate judgment quality. Individuals are consistent if they assign similar judgments to comparable stimuli, and
they discriminate if they assign different judgments to dissimilar stimuli. We ask whether “superforecasters”, individuals with
noteworthy correspondence skills (see Mellers et al., 2014) show superior performance on laboratory tasks assessing other
standards of good judgment. Results showed that superforecasters either tied or out-performed less correspondent forecasters
and undergraduates with no forecasting experience on tests of consistency, discrimination, and coherence. While multifaceted,
good judgment may be a more unified than concept than previously thought.
Keywords:
1 Introduction
Social scientists and philosophers often evaluate judgments
against two gold standards: correspondence and coherence
(Hammond, 1996, 2007). Measures of correspondence capture the degree to which judgments agree with empirical
observations (e.g., Cooksey, 1996; Hammond, 1996), and
coherence criteria assess the degree to which judgments are
consistent with logical or axiomatic principles. Both standards are widely accepted as important components of good
judgment (Dunwoody & College, 2009; Hammond, 2000).
1.1
How Well do People Meet These Standards?
Previous research suggests that human judgment tends to
fall short on coherence and correspondence. Coherence violations range from base rate neglect and confirmation bias
to overconfidence and framing effects (Gilovich, Griffith &
Kahneman, 2002; Kahneman, Slovic & Tversky, 1982). Experts are not immune. Statisticians (Christensen-Szalanski
& Bushyhead, 1981), doctors (Eddy, 1982), and nurses (Bennett, 1980) neglect base rates. Physicians and intelligence
professionals are susceptible to framing effects (Aberegg,
Arkes & Terry, 2006; Reyna, Chick, Corbin & Hsia, 2014),
Copyright: © 2017. The authors license this article under the terms of
the Creative Commons Attribution 3.0 License.
∗ Department of Psychology, Solomon Labs, 3720 Walnut St.,
University of Pennsylvania, Philadelphia, PA 19104.
Email:
.
† University of Pennsylvania
‡ DRDC and York University
and financial investors are prone to overconfidence (Barber
& Odean, 2001).
Research on correspondence tells a similar story. Numerous studies show that human predictions are frequently
inaccurate and worse than simple linear models in many domains (e.g., Meehl, 1954; Dawes, Faust & Meehl, 1989).
Once again, expertise doesn’t necessarily help. Inaccurate
predictions have been found in parole officers (Carroll &
Payne, 1977), court judges (Ebbesen & Konecki, 1975), investment managers in the US and Taiwan (Olsen, 1997),
and politicians (Tetlock, 2005). However, expert predictions are better when the forecasting environment provides
regular, clear feedback and there are repeated opportunities
to learn (Kahneman & Klein, 2009; Shanteau, 1992). Examples include meteorologists (Murphy & Winkler, 1984),
professional bridge players (Keren, 1987), and bookmakers
at the racetrack (Bruce & Johnson, 2003), all of whom are
well-calibrated in their own domains.
1.2
Silver Standards
In many cases, judgment quality is important, but gold standards are unavailable. How “good” is a physician’s diagnosis, for example? Or an instructor’s grade, or a judge’s
sentencing decision? Einhorn (1972, 1974), and Weiss &
Shanteau (2003) suggested that, at a minimum, good judges
(i.e., domain experts) should demonstrate consistency and
discrimination in their judgments. In other words, experts
should make similar judgments if cases are alike, and dissimilar judgments when cases are unalike. Indeed, some
would argue that these skills are essential to expertise.
369
Judgment and Decision Making, Vol. 12, No. 4, July 2017
In a test of the consistency, Skånér, Strender & Bring
(1998) asked 27 general practitioners (GPs) to estimate the
probability of patient heart failure from a series of vignettes
based on actual patients. In the vignettes, GPs were given
diagnostic cues such as age, sex, and history of myocardial
infarction, but the patients’ eventual survival status could not
be obtained. GPs were presented with 45 cases, five of which
were presented twice, and consistency was operationalized
as the absolute difference in survival estimates on the five
repeated cases. Individual consistency varied greatly. Absolute differences fell between 0 and 10% in 62% of cases,
11% to 20% in 25% of cases, and greater than 20% in 13%
of cases. In another test of consistency, Dhami & Ayton
(2001) found inconsistency in magistrate’s decisions across
repeated trials in a laboratory task of bail setting.
1.3
Connections Among Standards
We know of no studies that have examined good judgment
using all four of the standards. One study examined three
of them (Weiss, Brennan, Thomas, Kirlik & Miller, 2009).
Using a golf putting task with experienced golfers, they found
a strong correlation between accuracy and their combined
measure of consistency and discrimination.
In a handful of studies, researchers have investigated connections between the gold standards, but results have been
mixed. Most studies show weak connections (Wright &
Ayton, 1987a; 1987b; Wright, Rowe, Bolger & Gammack,
1994; Adam & Reyna, 2005; Weaver & Stewart, 2012; Dunwoody et al., 2005), although one could argue that Weiss
et al.’s results are an exception. For the most part, however, measures of coherence and correspondence are loosely
coupled.
Yet under the right conditions, loose couplings may
tighten. In research on forecasting, for example, some studies
have given subjects a battery of coherence tasks prior to having them make forecasts about uncertain events. The results
show that aggregate forecasts are more accurate (i.e., more
correspondent) when subjects’ predictions are weighted by
their scores on the coherence tasks, rather than combined
with a simple unweighted average. Indeed, coherence-based
weighting schemes have been shown to improve the accuracy
of the aggregate by more than 30%, relative to a simple mean
(Predd, Osherson, Kulkarni & Poor, 2008; Wang Kulkarni,
Poor & Osherson, 2011; Tsai & Kirlik, 2012; Karvetski,
Olson, Mandel & Twardy, 2013). Perhaps correspondence
and coherence are intertwined, but we’ve been looking in the
wrong places.
1.4
Superforecasters
In previous research, we have examined a group of indivi (...truncated)