An IRT forecasting model: linking proper scoring rules to item response theory
Judgment and Decision Making, Vol. 12, No. 2, March 2017, pp. 90–103
An IRT forecasting model: linking proper scoring rules to item
response theory
Yuanchao Emily Bo∗
David V. Budescu†
Charles Lewis†
Philip E. Tetlock‡
Barbara Mellers‡
Abstract
This article proposes an Item Response Theoretical (IRT) forecasting model that incorporates proper scoring rules and
provides evaluations of forecasters’ expertise in relation to the features of the specific questions they answer. We illustrate the
model using geopolitical forecasts obtained by the Good Judgment Project (GJP) (see Mellers, Ungar, Baron, Ramos, Gurcay,
Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone & Tetlock, 2014). The expertise estimates from the IRT model, which
take into account variation in the difficulty and discrimination power of the events, capture the underlying construct being
measured and are highly correlated with the forecasters’ Brier scores. Furthermore, our expertise estimates based on the first
three years of the GJP data are better predictors of both the forecasters’ fourth year Brier scores and their activity level than
the overall Brier scores obtained and Merkle’s (2016) predictions, based on the same period. Lastly, we discuss the benefits of
using event-characteristic information in forecasting.
Keywords: IRT, Forecasting, Brier scores, Proper Scoring Rules, Good Judgment Project, Gibbs sampling.
1 Introduction
Scoring rules are useful tools for evaluating probability
forecasters. These mechanisms assign numerical values
based on the proximity of the forecast to the event, or value,
when it materializes (e.g., Gneiting & Raftery, 2007). A
scoring rule is proper if it elicits a forecaster’s true belief as
a probabilistic forecast, and it is strictly proper if it uniquely
elicits an expert’s true beliefs. Winkler (1967), Winkler
& Murphy (1968), Murphy & Winkler (1970), and Bickel
(2007) discuss scoring rules and their properties.
Consider the assessment of a probability distribution by a
forecaster i over a partition of n mutually exclusive events,
where n > 1. Let pi = (pi1, . . . , pin ) be a latent vector
of probabilities representing the forecaster’s private beliefs,
where pi j is the probability the i th forecaster assigns to event
j, and the sum of the probabilities is equal to 1. The forecaster’s overt (stated) probabilities for the n events are represented by the vector ri = (r i1, . . . , r in ), and their sum is
also equal to 1. The key feature of a strictly proper scoring
rule is that forecasters maximize their subjectively expected
scores if, and only if, they state their true probabilities such
that ri = pi .
Researchers have devised several families of proper scoring rules (Bickel, 2007; Merkle & Steyvers, 2013). The ones
most often employed in practice include the Brier/quadratic
score, the logarithmic score, and the spherical score, where:
Probabilistic forecasting is the process of making formal
statements about the likelihood of future events based on
what is known about antecedent conditions and the causal
and stochastic processes operating on them. Assessing the
accuracy of probabilistic forecasts is difficult for a variety of
reasons. First, such forecasts typically provide a probability
distribution with respect to a single outcome so, methodologically speaking, the outcome cannot falsify the forecast.
Second, some forecasts relate to outcomes of events whose
“ground truth” is hard to determine (e.g., Armstrong, 2001;
Lehner, Micheslson, Adelma & Goodman, 2012; Mandel
& Barnes, 2014; Tetlock, 2005). Finally, forecasts often
address outcomes that will only be resolved in the distant
future (Mandel & Barnes, 2014).
The authors thank Drs. Edward Merkle, Michael Lee, Lyle Ungar and
one anonymous reviewer for their comments.
This research was supported by the Intelligence Advanced Research
Projects Activity (IARPA) via the Department of Interior National Business
Center contract number D11PC20061. The U.S. Government is authorized
to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon.
Data for the full study are available at https://dataverse.harvard.edu/
dataverse/gjp.
Disclaimer: The views and conclusions expressed herein are those of the
authors and should not be interpreted as necessarily representing the official
policies or endorsements, either expressed or implied, of IARPA,DoI/NBC,
or the U.S. Government.
Copyright: © 2017. The authors license this article under the terms of
the Creative Commons Attribution 3.0 License.
∗ Northwest
Evaluation
Association
(NWEA).
Email:.
† Fordham University
‡ University of Pennsylvania
Brier Score:Qi (r) = a + b(2r i − rr)
Logarithmic Score:L i (r) = a + b ln(r i )
ri
Spherical Score:Si (r) = a + b
1
(rr) 2
(1)
where a and b (b > 0) are arbitrary constants (Toda, 1963).
90
Proper scoring rules and item response theory
1.0
Judgment and Decision Making, Vol. 12, No. 2, March 2017
developed as an alternative to classical test theory (Lord
& Novick, 1968). It describes how performance relates to
ability measured by the items on the test and features of
these items. In other words, it models the relation between
test takers’ abilities and psychometric properties of the items.
One of the most popular IRT models for binary items is the
2-parameter logistic model:
Outcome = 0
0.6
0.4
Pj (θ i ) =
0.0
0.2
Brier score
0.8
Outcome = 1
0.0
0.2
0.4
0.6
0.8
91
1.0
Probability prediction
Figure 1. Relationship between probability predictions and
Brier scores in events with binary outcomes.
Without any loss of generality, we set a = 0 and b = 1
in all our analyses. Figure 1 illustrates the relationship between probability predictions and Brier scores for binary
cases (where 0 = outcome does not happen in blue, and
1 = outcome does happen in red). Brier scores measure the
mean square difference between the predicted probability
assigned to the possible outcomes and the actual outcome.
Thus, lower Brier scores indicate better calibration of a set
of predictions.
In addition to motivating forecasters (Gneiting & Raftery,
2007), these scores provide a means of assessing relative
accuracy as they reflect the “quality” or “goodness” of the
probabilistic forecasts: The lower the mean Brier score is for
a set of predictions, the better the predictions.
Typically, scores do not take into account the characteristics of the events, or class of events, being forecast. Consider,
for example a person predicting the results of games to be
played between teams in a sports league (e.g., National Football League, National Basketball Association). A probability
forecast, p, earns the same score if it refers to the outcome
of a game between the best and worst teams in the league (a
relatively easy prediction) or between two evenly matched
ones (a more difficult prediction). Similarly, they give equal
credit for assigning the same probabilities when predicting
political r (...truncated)