An IRT forecasting model: linking proper scoring rules to item response theory (pdf)

Article PDF cannot be displayed. You can download it here:

http://journal.sjdm.org/16/16218/jdm16218.pdf

An IRT forecasting model: linking proper scoring rules to item response theory

Judgment and Decision Making, Vol. 12, No. 2, March 2017, pp. 90–103 An IRT forecasting model: linking proper scoring rules to item response theory Yuanchao Emily Bo∗ David V. Budescu† Charles Lewis† Philip E. Tetlock‡ Barbara Mellers‡ Abstract This article proposes an Item Response Theoretical (IRT) forecasting model that incorporates proper scoring rules and provides evaluations of forecasters’ expertise in relation to the features of the specific questions they answer. We illustrate the model using geopolitical forecasts obtained by the Good Judgment Project (GJP) (see Mellers, Ungar, Baron, Ramos, Gurcay, Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone & Tetlock, 2014). The expertise estimates from the IRT model, which take into account variation in the difficulty and discrimination power of the events, capture the underlying construct being measured and are highly correlated with the forecasters’ Brier scores. Furthermore, our expertise estimates based on the first three years of the GJP data are better predictors of both the forecasters’ fourth year Brier scores and their activity level than the overall Brier scores obtained and Merkle’s (2016) predictions, based on the same period. Lastly, we discuss the benefits of using event-characteristic information in forecasting. Keywords: IRT, Forecasting, Brier scores, Proper Scoring Rules, Good Judgment Project, Gibbs sampling. 1 Introduction Scoring rules are useful tools for evaluating probability forecasters. These mechanisms assign numerical values based on the proximity of the forecast to the event, or value, when it materializes (e.g., Gneiting & Raftery, 2007). A scoring rule is proper if it elicits a forecaster’s true belief as a probabilistic forecast, and it is strictly proper if it uniquely elicits an expert’s true beliefs. Winkler (1967), Winkler & Murphy (1968), Murphy & Winkler (1970), and Bickel (2007) discuss scoring rules and their properties. Consider the assessment of a probability distribution by a forecaster i over a partition of n mutually exclusive events, where n > 1. Let pi = (pi1, . . . , pin ) be a latent vector of probabilities representing the forecaster’s private beliefs, where pi j is the probability the i th forecaster assigns to event j, and the sum of the probabilities is equal to 1. The forecaster’s overt (stated) probabilities for the n events are represented by the vector ri = (r i1, . . . , r in ), and their sum is also equal to 1. The key feature of a strictly proper scoring rule is that forecasters maximize their subjectively expected scores if, and only if, they state their true probabilities such that ri = pi . Researchers have devised several families of proper scoring rules (Bickel, 2007; Merkle & Steyvers, 2013). The ones most often employed in practice include the Brier/quadratic score, the logarithmic score, and the spherical score, where: Probabilistic forecasting is the process of making formal statements about the likelihood of future events based on what is known about antecedent conditions and the causal and stochastic processes operating on them. Assessing the accuracy of probabilistic forecasts is difficult for a variety of reasons. First, such forecasts typically provide a probability distribution with respect to a single outcome so, methodologically speaking, the outcome cannot falsify the forecast. Second, some forecasts relate to outcomes of events whose “ground truth” is hard to determine (e.g., Armstrong, 2001; Lehner, Micheslson, Adelma & Goodman, 2012; Mandel & Barnes, 2014; Tetlock, 2005). Finally, forecasts often address outcomes that will only be resolved in the distant future (Mandel & Barnes, 2014). The authors thank Drs. Edward Merkle, Michael Lee, Lyle Ungar and one anonymous reviewer for their comments. This research was supported by the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract number D11PC20061. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon. Data for the full study are available at https://dataverse.harvard.edu/ dataverse/gjp. Disclaimer: The views and conclusions expressed herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA,DoI/NBC, or the U.S. Government. Copyright: © 2017. The authors license this article under the terms of the Creative Commons Attribution 3.0 License. ∗ Northwest Evaluation Association (NWEA). Email:. † Fordham University ‡ University of Pennsylvania Brier Score:Qi (r) = a + b(2r i − rr) Logarithmic Score:L i (r) = a + b ln(r i ) ri Spherical Score:Si (r) = a + b 1 (rr) 2 (1) where a and b (b > 0) are arbitrary constants (Toda, 1963). 90 Proper scoring rules and item response theory 1.0 Judgment and Decision Making, Vol. 12, No. 2, March 2017 developed as an alternative to classical test theory (Lord & Novick, 1968). It describes how performance relates to ability measured by the items on the test and features of these items. In other words, it models the relation between test takers’ abilities and psychometric properties of the items. One of the most popular IRT models for binary items is the 2-parameter logistic model: Outcome = 0 0.6 0.4 Pj (θ i ) = 0.0 0.2 Brier score 0.8 Outcome = 1 0.0 0.2 0.4 0.6 0.8 91 1.0 Probability prediction Figure 1. Relationship between probability predictions and Brier scores in events with binary outcomes. Without any loss of generality, we set a = 0 and b = 1 in all our analyses. Figure 1 illustrates the relationship between probability predictions and Brier scores for binary cases (where 0 = outcome does not happen in blue, and 1 = outcome does happen in red). Brier scores measure the mean square difference between the predicted probability assigned to the possible outcomes and the actual outcome. Thus, lower Brier scores indicate better calibration of a set of predictions. In addition to motivating forecasters (Gneiting & Raftery, 2007), these scores provide a means of assessing relative accuracy as they reflect the “quality” or “goodness” of the probabilistic forecasts: The lower the mean Brier score is for a set of predictions, the better the predictions. Typically, scores do not take into account the characteristics of the events, or class of events, being forecast. Consider, for example a person predicting the results of games to be played between teams in a sports league (e.g., National Football League, National Basketball Association). A probability forecast, p, earns the same score if it refers to the outcome of a game between the best and worst teams in the league (a relatively easy prediction) or between two evenly matched ones (a more difficult prediction). Similarly, they give equal credit for assigning the same probabilities when predicting political r (...truncated)