A Bayesian Perspective on the Reproducibility Project: Psychology
RESEARCH ARTICLE
A Bayesian Perspective on the Reproducibility
Project: Psychology
Alexander Etz1, Joachim Vandekerckhove2,3*
1 Department of Psychology, University of Amsterdam, Amsterdam, the Netherlands, 2 Department of
Cognitive Sciences, University of California, Irvine, Irvine, CA, United States of America, 3 Department of
Statistics, University of California, Irvine, Irvine, CA, United States of America
*
a11111
OPEN ACCESS
Citation: Etz A, Vandekerckhove J (2016) A
Bayesian Perspective on the Reproducibility Project:
Psychology. PLoS ONE 11(2): e0149794.
doi:10.1371/journal.pone.0149794
Editor: Daniele Marinazzo, Universiteit Gent,
BELGIUM
Received: December 16, 2015
Accepted: February 4, 2016
Published: February 26, 2016
Copyright: © 2016 Etz, Vandekerckhove. This is an
open access article distributed under the terms of the
Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are
credited.
Abstract
We revisit the results of the recent Reproducibility Project: Psychology by the Open Science
Collaboration. We compute Bayes factors—a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis—for a large subset (N = 72)
of the original papers and their corresponding replication attempts. In our computation, we
take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the
amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor <
10). The majority of the studies (64%) did not provide strong evidence for either the null or
the alternative hypothesis in either the original or the replication, and no replication attempts
provided strong evidence in favor of the null. In all cases where the original paper provided
strong evidence but the replication did not (15%), the sample size in the replication was
smaller than the original. Where the replication provided strong evidence but the original did
not (10%), the replication sample size was larger. We conclude that the apparent failure of
the Reproducibility Project to replicate many target effects can be adequately explained by
overestimation of effect sizes (or overestimation of evidence against the null hypothesis)
due to small sample sizes and publication bias in the psychological literature. We further
conclude that traditional sample sizes are insufficient and that a more widespread adoption
of Bayesian methods is desirable.
Data Availability Statement: All relevant data are
within the paper and its Supporting Information files.
Funding: This work was partly funded by the
National Science Foundation grants #1230118 and
#1534472 from the Methods, 335 Measurements,
and Statistics panel (www.nsf.gov) and the John
Templeton Foundation grant #48192 (www.templeton.
org). This publication was made possible through the
support of a grant from the John Templeton
Foundation. The opinions expressed in this
publication are those of the authors and do not
necessarily reflect the views of the John Templeton
Foundation. The funders had no role in study design,
1 Introduction
The summer of 2015 saw the first published results of the long-awaited Reproducibility Project:
Psychology by the Open Science Collaboration [1] (henceforth OSC). In an attempt to closely
replicate 100 studies published in leading journals, fewer than half were judged to successfully
replicate. The replications were pre-registered in order to avoid selection and publication bias
and were evaluated using multiple criteria. When a replication was judged to be successful if it
reached statistical significance (i.e., p <.05), only 39% were judged to have been successfully
reproduced. Nevertheless, the paper reports a.51 correlation between original and replication
effect sizes, indicating some degree of robustness of results (see their Fig 3).
PLOS ONE | DOI:10.1371/journal.pone.0149794 February 26, 2016
1 / 12
Bayesian Perspective on the Reproducibility Project: Psychology
data collection and analysis, decision to publish, or
preparation of the manuscript.
Competing Interests: The authors have declared
that no competing interests exist.
Much like the results of the project, the reactions in media and social media have been
mixed. In a first wave of reactions, headlines ranged from the dryly descriptive “Scientists replicated 100 psychology studies, and fewer than half got the same results” [2] and “More than half
of psychology papers are not reproducible” [3] to the crass “Study reveals that a lot of psychology research really is just ‘psycho-babble’” [4]. A second wave of reactions shortly followed.
Editorials with titles such as “Psychology is not in crisis” [5] and a statement by the American
Psychological Association [6] were quick to emphasize the possibility of many hidden moderators that rendered the replications ineffective. OSC acknowledges this: “unanticipated factors
in the sample, setting, or procedure could still have altered the observed effect magnitudes,”
but it is unclear what, if any, bearing this has on the robustness of the theories that the original
publications supported.
In addition to the unresolved possibility of hidden moderators, there is the issue of lacking
statistical power. The statistical power of an experiment is the frequency with which it will
yield a statistically significant effect in repeated sampling, assuming that the underlying effect
is of a given size. All other things—such as the design of the study and the true size of the effect
—being equal, statistical power is determined by an experiment’s sample size. Low-powered
research designs undermine the credibility of statistically significant results in addition to
increasing the probability of nonsignificant ones (see [7] and the references therein for a
detailed argument); furthermore, low-powered studies generally provide only small amounts of
evidence (in the form of weak Bayes factors; see below).
Among the insights reported in OSC is that “low-power research designs combined with
publication bias favoring positive results together produce a literature with upwardly biased
effect sizes,” and that this may explain why replications—unaffected by publication bias—show
smaller effect sizes. Here, we formally evaluate that insight, and use the results of the Reproducibility Project: Psychology to conclude that publication bias and low-powered designs indeed
contribute to the poor reproducibility, but also that many of the replication attempts in OSC
were themselves underpowered. While the OSC aimed for a minimum of 80% power (with an
average of 92%) in all replications, this estimate was based on the observed effect size in the
original studies. In the likely event that these observed effect sizes were inflated (see next section), the sample size recommendations (...truncated)