A Bayesian Perspective on the Reproducibility Project: Psychology

PLOS ONE, Feb 2016

We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors—a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis—for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable.

A Bayesian Perspective on the Reproducibility Project: Psychology

RESEARCH ARTICLE A Bayesian Perspective on the Reproducibility Project: Psychology Alexander Etz1, Joachim Vandekerckhove2,3* 1 Department of Psychology, University of Amsterdam, Amsterdam, the Netherlands, 2 Department of Cognitive Sciences, University of California, Irvine, Irvine, CA, United States of America, 3 Department of Statistics, University of California, Irvine, Irvine, CA, United States of America * a11111 OPEN ACCESS Citation: Etz A, Vandekerckhove J (2016) A Bayesian Perspective on the Reproducibility Project: Psychology. PLoS ONE 11(2): e0149794. doi:10.1371/journal.pone.0149794 Editor: Daniele Marinazzo, Universiteit Gent, BELGIUM Received: December 16, 2015 Accepted: February 4, 2016 Published: February 26, 2016 Copyright: © 2016 Etz, Vandekerckhove. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abstract We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors—a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis—for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable. Data Availability Statement: All relevant data are within the paper and its Supporting Information files. Funding: This work was partly funded by the National Science Foundation grants #1230118 and #1534472 from the Methods, 335 Measurements, and Statistics panel (www.nsf.gov) and the John Templeton Foundation grant #48192 (www.templeton. org). This publication was made possible through the support of a grant from the John Templeton Foundation. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the John Templeton Foundation. The funders had no role in study design, 1 Introduction The summer of 2015 saw the first published results of the long-awaited Reproducibility Project: Psychology by the Open Science Collaboration [1] (henceforth OSC). In an attempt to closely replicate 100 studies published in leading journals, fewer than half were judged to successfully replicate. The replications were pre-registered in order to avoid selection and publication bias and were evaluated using multiple criteria. When a replication was judged to be successful if it reached statistical significance (i.e., p <.05), only 39% were judged to have been successfully reproduced. Nevertheless, the paper reports a.51 correlation between original and replication effect sizes, indicating some degree of robustness of results (see their Fig 3). PLOS ONE | DOI:10.1371/journal.pone.0149794 February 26, 2016 1 / 12 Bayesian Perspective on the Reproducibility Project: Psychology data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. Much like the results of the project, the reactions in media and social media have been mixed. In a first wave of reactions, headlines ranged from the dryly descriptive “Scientists replicated 100 psychology studies, and fewer than half got the same results” [2] and “More than half of psychology papers are not reproducible” [3] to the crass “Study reveals that a lot of psychology research really is just ‘psycho-babble’” [4]. A second wave of reactions shortly followed. Editorials with titles such as “Psychology is not in crisis” [5] and a statement by the American Psychological Association [6] were quick to emphasize the possibility of many hidden moderators that rendered the replications ineffective. OSC acknowledges this: “unanticipated factors in the sample, setting, or procedure could still have altered the observed effect magnitudes,” but it is unclear what, if any, bearing this has on the robustness of the theories that the original publications supported. In addition to the unresolved possibility of hidden moderators, there is the issue of lacking statistical power. The statistical power of an experiment is the frequency with which it will yield a statistically significant effect in repeated sampling, assuming that the underlying effect is of a given size. All other things—such as the design of the study and the true size of the effect —being equal, statistical power is determined by an experiment’s sample size. Low-powered research designs undermine the credibility of statistically significant results in addition to increasing the probability of nonsignificant ones (see [7] and the references therein for a detailed argument); furthermore, low-powered studies generally provide only small amounts of evidence (in the form of weak Bayes factors; see below). Among the insights reported in OSC is that “low-power research designs combined with publication bias favoring positive results together produce a literature with upwardly biased effect sizes,” and that this may explain why replications—unaffected by publication bias—show smaller effect sizes. Here, we formally evaluate that insight, and use the results of the Reproducibility Project: Psychology to conclude that publication bias and low-powered designs indeed contribute to the poor reproducibility, but also that many of the replication attempts in OSC were themselves underpowered. While the OSC aimed for a minimum of 80% power (with an average of 92%) in all replications, this estimate was based on the observed effect size in the original studies. In the likely event that these observed effect sizes were inflated (see next section), the sample size recommendations (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0149794&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149794

Alexander Etz, Joachim Vandekerckhove. A Bayesian Perspective on the Reproducibility Project: Psychology, PLOS ONE, 2016, Volume 11, Issue 2, DOI: 10.1371/journal.pone.0149794