prep misestimates the probability of replication (pdf)

Article PDF cannot be displayed. You can download it here:

http://link.springer.com/content/pdf/10.3758%2FPBR.16.2.424.pdf

prep misestimates the probability of replication

ERIC-JAN WAGENMAKERS 0 0 University of Amsterdam , Amsterdam, The Netherlands prep misestimates the probability of replication AND The probability of replication, prep, has been proposed as a means of identifying replicable and reliable effects in the psychological sciences. We conduct a basic test of prep that reveals that it misestimates the true probability of replication, especially for small effects. We show how these general problems with prep play out in practice, when it is applied to predict the replicability of observed effects over a series of experiments. Our results show that, over any plausible series of experiments, the true probabilities of replication will be very different from those predicted by prep. We discuss some basic problems in the formulation of prep that are responsible for its poor performance, and conclude that prep is not a useful statistic for psychological science. - n = 50 ple sizes are very large. For example, when n 10 and the underlying true effect size is 1, the actual probability of replication is about .95, and prep on average gives a value of about .90; but the variability is large, with one SD around the mean extending from about .80 to 1.00. The results in Figure 1 have serious consequences for the performance of prep. For small effect sizes, where much of the psychological interest lies, and where new experimental findings can make the biggest contribution to the psychological sciences, prep is highly variable and exaggerates the probability of replication. Only for very large effect sizes does prep work (approximately) as advertised. Figure 1 suggests that, unless we are willing to believe that most experiments have very large effects, prep will on average lead us to overestimate the probability of replication, and will do so with undesirably high variability. Figure 1 also shows that we cannot safely use prep to identify replicable or reliable small effects. The Practical Consequences of Misestimation for prep Killeen (2005a, p. 351), in his closing statements, conceived of prep allowing the management of risk in a research setting: But editors may lower the hurdle for potentially important research that comes with so precise a warning label as prep. When replicability becomes the criterion, researchers can gauge the risks they face in pursuing a line of study: An assistant professor may choose paradigms in which prep is typically greater than .8, whereas a tenured risk taker may [pursue] a line of research having preps around .6. Of course, only clairvoyants can identify those experiments that will give them prep values of exactly .6 or .8. This means we cannot simply use the analysis in Figure 1 to look up how prep will misestimate in practice. Whereas n = 20 n = 100 our analysis shows that prep has general problems, it does not make explicit how those problems will play out in practice when prep is used to make predictions about replicability for observed effect sizes, as Killeen (2005a) proposed. In this section, we address the problem of misestimation in practice directly. Research Strategies Under Killeens (2005a) risk management conception, researchers do a series of experiments, hunting for replicable and reliable effects according to some risk management strategy. The more aggressive tenured researchers might choose experiments they believe might have small effects, and avoid doing less interesting experiments whose effect is obvious from the outset. The more conservative untenured researchers might spread their net wider, being happy to do experiments with large underlying effect sizes, but inevitably also doing experiments with small underlying effect sizes. A sensible way to think about these different riskseeking profiles is to imagine each attempted experiment having a true but unknown effect size drawn from a distribution of possible experiments. The distribution used corresponds to the risk management strategy. Four possible strategies are shown in Figure 2. The top panel shows riskier strategies for tenured researchers, focused on small effect sizes. Strategy A assumes that the distribution has its mode at 0, whereas Strategy B makes the more optimistic assumption that researchers are astute enough to be able to place modes on small but genuine effects, and to then try to control the variance of their distribution to focus on these effect sizes. Strategies C and D in the bottom panel, for the untenured researcher, follow the same pattern, except now the distributions have greater variance, so that experiments with larger underlying effects are also included in the mix. All of the strategy distributions are symmetric about 0, because of the nature of effect size measures (i.e., the magnitude of an effect size carries information, but the sign is arbitrary). This symmetry requires, for example, that observed effects of d 2 and d 2 be equally likely for any given strategy. For this reason, it is possible to formulate any strategy more succinctly as a distribution over absolute effect size, in which case the strategies in Figure 2 would become truncated normal distributions. In these terms, the means for Strategies A and C are 0, and the means for Strategies B and D are 0.2. The SDs for Strategies A and B are 0.3, and the SDs for Strategies C and D are 0.8. The Performance of prep Whatever strategy researchers use, prep is supposed to give them the probability that effects they observe for each experiment will be replicated in sign. A prep value of .85 claims an 85% probability that the next effect will have the same sign, and a 15% chance that it will not. It is easy to test the usefulness of prep as an estimator of these probabilities by simulation. We examined the four strategies shown in Figure 2, and focused on a standard root mean square error (RMSE) measure of the difference between the true probability of replication and the estimate provided by prep. Our simulation test used the following seven steps. 1. Choose an experiment by sampling from the distribu tion defined by the risk strategy. Call the true underlying effect size for the particular experiment sampled . 2. Generate the observed effect size from an experimentwhich involves experimental and control groups both with n subjectsfrom the normal distribution with mean and variance 2/n. Call this d. 3. Calculate the true probability of replication, which is given by p*rep [sgn(d ) / 2/n ].2 4. Calculate prep (|d| n/4 ). 5. Calculate the mean squared error (MSe) between the true probability of replication, p*rep, and the estimate prep. For the t th trial, this is MSe(t) ( p*rep prep)2. 6. Go back to Step 1 to conduct the next experiment, until a total of T have been completed. 7. When all T experiments are completed, average the MSes over all the experiments, and take the square root of this average to get the final RMSE. That is, calculate RMSE 1/T tMSe(t). To make the process of the simulation test concrete, the first (...truncated)