Beyond statistical inference: A decision theory for science (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.3758%2FBF03193962.pdf

Beyond statistical inference: A decision theory for science

PETER R. KILLEEN ) 0 1 0 The research was supported by NSF Grant IBN 0236821 and NIMH Grant 1R01MH066860. I thank Rob Nosofsky and Michael Lee for ing this article should be addressed to P. R. Killeen, Department of Psy- chology, Arizona State University , Box 1104, Tempe, AZ 85287-1104 ( 1 Arizona State University , Tempe, Arizona Traditional null hypothesis significance testing does not yield the probability of the null or its alternative and, therefore, cannot logically ground scientific decisions. The decision theory proposed here calculates the expected utility of an effect on the basis of (1) the probability of replicating it and (2) a utility function on its size. It takes significance testswhich place all value on the replicability of an effect and none on its magnitudeas a special case, one in which the cost of a false positive is revealed to be an order of magnitude greater than the value of a true positive. More realistic utility functions credit both replicability and effect size, integrating them for a single index of merit. The analysis incorporates opportunity cost and is consistent with alternate measures of effect size, such as r 2 and information transmission, and with Bayesian model selection criteria. An alternate formulation is functionally equivalent to the formal theory, transparent, and easy to compute. - Null Hypothesis Statistical Tests The .05 yardstick of null hypothesis statistical tests (NHSTs) was based on a suggestion by Fisher and is typically implemented as the NeymanPearson criterion (NPc; see Gigerenzer, 1993, among many others). The NPc stipulates a criterion for the rejection of a null hypothesis that keeps the probability of incorrectly rejecting the null, a false positive or Type I error, no greater than . To know whether this is a rational criterion requires an estimate of the expected costs and benefits it delivers. Table 1 shows the situation for binary decisions, such as publication of research findings, with errors and successes of commission in the top row and successes and errors of omission in the bottom row. To calculate the expected utility of actions on the basis of the NPc, assign costs and benefits to each cell and multiply these by the probability of the null and its alternativehere, assumed to be complementary. The sums across rows give the expected utilities of action appropriate to the alternative and to the null. It is rational to act when the former is greater than the latter and, otherwise, to refrain from action. Alas, the NPc cannot be derived from such a canonical decision theory. There are two reasons for this. 1. NHST provides neither the probability of the alterna tive p(A) nor the probability of the null p(N ): Such a test of significance does not authorize us to make any statement about the hypothesis in question in terms of mathematical probability (Fisher, 1959, p. 35). NHST gives the probability of a statistic x more extreme than the one obtained, D, under the assumption that the null is true, p(x D |N ). A rational decision, however, requires the probability that the null is true in light of the statistic, p(N|D). Going from p(D|N ) to p(N|D) is the inverse problem. The calculation of p(N|D) requires that we know the prior probability of the null, the prior probability of the statistic, and combine them according to Bayess theorem. Those priors are difficult to estimate. Furthermore, many statisticians are loath to invoke Bayes for fear of rendering probabilities subjective, despite reassurances from Bayesians, M. D. Lee and Wagenmakers (2005) among the latest. The problem has roots in our use of an inferential calculus that is based on such parameters as the means of the hypothetical experimental and control populations, E and C, and their equality under the null (Geisser, 1992). To make probability statements about parameters requires a solution to the inverse problem. Fisher invested decades searching for an alternative inferential calculus that required neither parameters nor prior distributions (Seidenfeld, 1979). Neyman and Pearson (1933) convinced a generation that they could avoid the inverse problem by behaving, when p , as though the null was false without changing their belief Act for the alternative (A) Balk (B); refrain from action about the null; and by assuming that which needed proving: It may often be proved that if we behave according to such a rule, then in the long run we shall reject H when it is true not more than, say, one in a hundred times (Neyman, 1960, p. 290, emphasis added). When the null is false, inferences based on its truth are counterfactual conditionals from which anything followsincluding psychologists long, illicit relationship with NHST. The null has been recast as an interval estimate in more useful ways (e.g., Jones & Tukey, 2000), but little attention has been paid to the alternative hypothesis, generally treated as an anti-null (see Greenwalds [1975] seminal analyses). Despite these difficulties, the NPc constitutes the most common test for acceptability of research. 2. If these tactics do not solve the problem of assigning probabilities to outcomes, they do not even address the problem of assigning utilities to the outcomes, an assignment at the core of a principled decision theory. Observation of practice permits us to rank the values implicit in scientific journals. Most journals will not publish results that the editor deems trivial, no matter how small the p value. This means that the value of a true positivethe value of an action, given the truth of the alternative, v(A|A) must be substantially greater than zero. The small probability allowed a Type I error, p(A|N ) .05, reflects a substantial cost associated with false alarms, the onus of publishing a nonreplicable result. The remaining outcomes are of intermediate value. No effect is difficult to publish, so the value of a true negativev(B|N )must be less than that of a true positive. v(B|N ) must also be greater than the value of a Type II errora false negative, v(B|A)which is primarily a matter of chagrin for the scientist. Thus, v(True Positive) v(True Negative) v(False Negative) v(False Positive), with the last two being negative. But a mere ranking is inadequate for an informed decision on this most central issue: what research should get published, to become part of the canon. BEYOND NHST: DTS The decision theory for science (DTS) proposed here constitutes a well-defined alternative to NHST. DTSs probability module measures replicability, not the improbability of data. Its utility module is based on the information provided by a measurement or manipulation. Together these provide (1) a rational basis for action, (2) a demonstrated ability to recapture current standards, and (3) flexibility for applications in which the payoff matrix differs from the implicit matrices currently regnant. The exposition is couched in terms of editorial actions, since they play a central role in main (...truncated)