prep misestimates the probability of replication
ERIC-JAN WAGENMAKERS
0
0
University of Amsterdam
,
Amsterdam, The Netherlands
prep misestimates the probability of replication AND The probability of replication, prep, has been proposed as a means of identifying replicable and reliable effects in the psychological sciences. We conduct a basic test of prep that reveals that it misestimates the true probability of replication, especially for small effects. We show how these general problems with prep play out in practice, when it is applied to predict the replicability of observed effects over a series of experiments. Our results show that, over any plausible series of experiments, the true probabilities of replication will be very different from those predicted by prep. We discuss some basic problems in the formulation of prep that are responsible for its poor performance, and conclude that prep is not a useful statistic for psychological science.
-
n = 50
ple sizes are very large. For example, when n 10 and the
underlying true effect size is 1, the actual
probability of replication is about .95, and prep on average gives a
value of about .90; but the variability is large, with one SD
around the mean extending from about .80 to 1.00.
The results in Figure 1 have serious consequences for
the performance of prep. For small effect sizes, where
much of the psychological interest lies, and where new
experimental findings can make the biggest contribution
to the psychological sciences, prep is highly variable and
exaggerates the probability of replication. Only for very
large effect sizes does prep work (approximately) as
advertised. Figure 1 suggests that, unless we are willing to
believe that most experiments have very large effects, prep
will on average lead us to overestimate the probability of
replication, and will do so with undesirably high
variability. Figure 1 also shows that we cannot safely use prep to
identify replicable or reliable small effects.
The Practical Consequences
of Misestimation for prep
Killeen (2005a, p. 351), in his closing statements,
conceived of prep allowing the management of risk in a
research setting:
But editors may lower the hurdle for potentially
important research that comes with so precise a
warning label as prep. When replicability becomes the
criterion, researchers can gauge the risks they face in
pursuing a line of study: An assistant professor may
choose paradigms in which prep is typically greater
than .8, whereas a tenured risk taker may [pursue] a
line of research having preps around .6.
Of course, only clairvoyants can identify those
experiments that will give them prep values of exactly .6 or .8.
This means we cannot simply use the analysis in Figure 1
to look up how prep will misestimate in practice. Whereas
n = 20
n = 100
our analysis shows that prep has general problems, it
does not make explicit how those problems will play out
in practice when prep is used to make predictions about
replicability for observed effect sizes, as Killeen (2005a)
proposed. In this section, we address the problem of
misestimation in practice directly.
Research Strategies
Under Killeens (2005a) risk management conception,
researchers do a series of experiments, hunting for replicable
and reliable effects according to some risk management
strategy. The more aggressive tenured researchers might choose
experiments they believe might have small effects, and avoid
doing less interesting experiments whose effect is obvious
from the outset. The more conservative untenured
researchers might spread their net wider, being happy to do
experiments with large underlying effect sizes, but inevitably also
doing experiments with small underlying effect sizes.
A sensible way to think about these different
riskseeking profiles is to imagine each attempted experiment
having a true but unknown effect size drawn from a
distribution of possible experiments. The distribution used
corresponds to the risk management strategy. Four
possible strategies are shown in Figure 2. The top panel shows
riskier strategies for tenured researchers, focused on small
effect sizes. Strategy A assumes that the distribution has
its mode at 0, whereas Strategy B makes the more
optimistic assumption that researchers are astute enough to
be able to place modes on small but genuine effects, and
to then try to control the variance of their distribution to
focus on these effect sizes. Strategies C and D in the
bottom panel, for the untenured researcher, follow the same
pattern, except now the distributions have greater
variance, so that experiments with larger underlying effects
are also included in the mix.
All of the strategy distributions are symmetric about 0,
because of the nature of effect size measures (i.e., the
magnitude of an effect size carries information, but the
sign is arbitrary). This symmetry requires, for example,
that observed effects of d 2 and d 2 be equally
likely for any given strategy. For this reason, it is possible
to formulate any strategy more succinctly as a distribution
over absolute effect size, in which case the strategies in
Figure 2 would become truncated normal distributions.
In these terms, the means for Strategies A and C are 0,
and the means for Strategies B and D are 0.2. The SDs for
Strategies A and B are 0.3, and the SDs for Strategies C
and D are 0.8.
The Performance of prep
Whatever strategy researchers use, prep is supposed to
give them the probability that effects they observe for each
experiment will be replicated in sign. A prep value of .85
claims an 85% probability that the next effect will have the
same sign, and a 15% chance that it will not. It is easy to test
the usefulness of prep as an estimator of these probabilities
by simulation. We examined the four strategies shown in
Figure 2, and focused on a standard root mean square error
(RMSE) measure of the difference between the true
probability of replication and the estimate provided by prep.
Our simulation test used the following seven steps.
1. Choose an experiment by sampling from the distribu
tion defined by the risk strategy. Call the true underlying
effect size for the particular experiment sampled .
2. Generate the observed effect size from an experimentwhich involves experimental and control groups
both with n subjectsfrom the normal distribution with
mean and variance 2/n. Call this d.
3. Calculate the true probability of replication, which is
given by p*rep [sgn(d ) / 2/n ].2
4. Calculate prep (|d| n/4 ).
5. Calculate the mean squared error (MSe) between the
true probability of replication, p*rep, and the estimate prep.
For the t th trial, this is MSe(t) ( p*rep prep)2.
6. Go back to Step 1 to conduct the next experiment,
until a total of T have been completed.
7. When all T experiments are completed, average the
MSes over all the experiments, and take the square root
of this average to get the final RMSE. That is, calculate
RMSE 1/T tMSe(t).
To make the process of the simulation test concrete, the
first (...truncated)