Beyond statistical inference: A decision theory for science
PETER R. KILLEEN
)
0
1
0
The research was supported by NSF Grant IBN 0236821 and NIMH Grant 1R01MH066860. I thank Rob Nosofsky and Michael Lee for ing this article should be addressed to P. R. Killeen, Department of Psy- chology, Arizona State University
, Box 1104, Tempe,
AZ 85287-1104 (
1
Arizona State University
, Tempe,
Arizona
Traditional null hypothesis significance testing does not yield the probability of the null or its alternative and, therefore, cannot logically ground scientific decisions. The decision theory proposed here calculates the expected utility of an effect on the basis of (1) the probability of replicating it and (2) a utility function on its size. It takes significance testswhich place all value on the replicability of an effect and none on its magnitudeas a special case, one in which the cost of a false positive is revealed to be an order of magnitude greater than the value of a true positive. More realistic utility functions credit both replicability and effect size, integrating them for a single index of merit. The analysis incorporates opportunity cost and is consistent with alternate measures of effect size, such as r 2 and information transmission, and with Bayesian model selection criteria. An alternate formulation is functionally equivalent to the formal theory, transparent, and easy to compute.
-
Null Hypothesis Statistical Tests
The .05 yardstick of null hypothesis statistical tests
(NHSTs) was based on a suggestion by Fisher and is
typically implemented as the NeymanPearson criterion (NPc;
see Gigerenzer, 1993, among many others). The NPc
stipulates a criterion for the rejection of a null hypothesis that
keeps the probability of incorrectly rejecting the null, a
false positive or Type I error, no greater than . To know
whether this is a rational criterion requires an estimate of
the expected costs and benefits it delivers. Table 1 shows
the situation for binary decisions, such as publication of
research findings, with errors and successes of
commission in the top row and successes and errors of omission
in the bottom row. To calculate the expected utility of
actions on the basis of the NPc, assign costs and benefits to
each cell and multiply these by the probability of the null
and its alternativehere, assumed to be complementary.
The sums across rows give the expected utilities of action
appropriate to the alternative and to the null. It is rational
to act when the former is greater than the latter and,
otherwise, to refrain from action.
Alas, the NPc cannot be derived from such a canonical
decision theory. There are two reasons for this.
1. NHST provides neither the probability of the alterna
tive p(A) nor the probability of the null p(N ): Such a test of
significance does not authorize us to make any statement
about the hypothesis in question in terms of mathematical
probability (Fisher, 1959, p. 35). NHST gives the
probability of a statistic x more extreme than the one obtained,
D, under the assumption that the null is true, p(x D |N ).
A rational decision, however, requires the probability that
the null is true in light of the statistic, p(N|D). Going from
p(D|N ) to p(N|D) is the inverse problem. The calculation
of p(N|D) requires that we know the prior probability of
the null, the prior probability of the statistic, and
combine them according to Bayess theorem. Those priors are
difficult to estimate. Furthermore, many statisticians are
loath to invoke Bayes for fear of rendering probabilities
subjective, despite reassurances from Bayesians, M. D.
Lee and Wagenmakers (2005) among the latest. The
problem has roots in our use of an inferential calculus that is
based on such parameters as the means of the hypothetical
experimental and control populations, E and C, and their
equality under the null (Geisser, 1992). To make
probability statements about parameters requires a solution to the
inverse problem. Fisher invested decades searching for an
alternative inferential calculus that required neither
parameters nor prior distributions (Seidenfeld, 1979).
Neyman and Pearson (1933) convinced a generation that they
could avoid the inverse problem by behaving, when p ,
as though the null was false without changing their belief
Act for the alternative (A)
Balk (B); refrain from action
about the null; and by assuming that which needed
proving: It may often be proved that if we behave according to
such a rule, then in the long run we shall reject H when it is
true not more than, say, one in a hundred times (Neyman,
1960, p. 290, emphasis added). When the null is false,
inferences based on its truth are counterfactual conditionals
from which anything followsincluding psychologists
long, illicit relationship with NHST.
The null has been recast as an interval estimate in more
useful ways (e.g., Jones & Tukey, 2000), but little
attention has been paid to the alternative hypothesis, generally
treated as an anti-null (see Greenwalds [1975] seminal
analyses). Despite these difficulties, the NPc constitutes
the most common test for acceptability of research.
2. If these tactics do not solve the problem of assigning
probabilities to outcomes, they do not even address the
problem of assigning utilities to the outcomes, an
assignment at the core of a principled decision theory.
Observation of practice permits us to rank the values implicit in
scientific journals. Most journals will not publish results that
the editor deems trivial, no matter how small the p value.
This means that the value of a true positivethe value
of an action, given the truth of the alternative, v(A|A)
must be substantially greater than zero. The small
probability allowed a Type I error, p(A|N ) .05, reflects
a substantial cost associated with false alarms, the onus
of publishing a nonreplicable result. The remaining
outcomes are of intermediate value. No effect is difficult
to publish, so the value of a true negativev(B|N )must
be less than that of a true positive. v(B|N ) must also be
greater than the value of a Type II errora false negative,
v(B|A)which is primarily a matter of chagrin for the
scientist. Thus, v(True Positive) v(True Negative) v(False
Negative) v(False Positive), with the last two being
negative. But a mere ranking is inadequate for an informed
decision on this most central issue: what research should get
published, to become part of the canon.
BEYOND NHST: DTS
The decision theory for science (DTS) proposed here
constitutes a well-defined alternative to NHST. DTSs
probability module measures replicability, not the
improbability of data. Its utility module is based on the
information provided by a measurement or manipulation.
Together these provide (1) a rational basis for action, (2) a
demonstrated ability to recapture current standards, and
(3) flexibility for applications in which the payoff matrix
differs from the implicit matrices currently regnant. The
exposition is couched in terms of editorial actions, since
they play a central role in main (...truncated)