Sharing Detailed Research Data Is Associated with Increased Citation Rate
Citation: Piwowar HA, Day RS, Fridsma DB (
Sharing Detailed Research Data Is Associated with Increased Citation Rate
Heather A. Piwowar 0 1
Roger S. Day 0 1
Douglas B. Fridsma 0 1
0 Academic Editor: John Ioannidis, University of Ioannina School of Medicine , Greece
1 Department of Biomedical Informatics, University of Pittsburgh School of Medicine , Pittsburgh, Pennsylvania , United States of America
Background. Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available. Principal Findings. We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. Significance. This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data.
-
INTRODUCTION
Sharing information facilitates science. Publicly sharing detailed
research datasample attributes, clinical factors, patient outcomes,
DNA sequences, raw mRNA microarray measurementswith
other researchers allows these valuable resources to contribute far
beyond their original analysis[1]. In addition to being used to
confirm original results, raw data can be used to explore related or
new hypotheses, particularly when combined with other publicly
available data sets. Real data is indispensable when investigating
and developing study methods, analysis techniques, and software
implementations. The larger scientific community also benefits:
sharing data encourages multiple perspectives, helps to identify
errors, discourages fraud, is useful for training new researchers,
and increases efficient use of funding and patient population
resources by avoiding duplicate data collection.
Believing that that these benefits outweigh the costs of sharing
research data, many initiatives actively encourage investigators to
make their data available. Some journals, including the PLoS
family, require the submission of detailed biomedical data to
publicly available databases as a condition of publication[24].
Since 2003, the NIH has required a data sharing plan for all large
funding grants. The growing open-access publishing movement
will perhaps increase peer pressure to share data.
However, while the general research community benefits from
shared data, much of the burden for sharing the data falls to the study
investigator. Are there benefits for the investigators themselves?
A currency of value to many investigators is the number of times
their publications are cited. Although limited as a proxy for the
scientific contribution of a paper[5], citation counts are often used
in research funding and promotion decisions and have even been
assigned a salary-increase dollar value[6]. Boosting citation rate is
thus is a potentially important motivator for publication authors.
In this study, we explored the relationship between the citation
rate of a publication and whether its data was made publicly
available. Using cancer microarray clinical trials, we addressed the
following questions: Do trials which share their microarray data
receive more citations? Is this true even within lower profile trials?
What other data-sharing variables are associated with an increased
citation rate? While this study is not able to investigate causation,
quantifying associations is a valuable first step in understanding
these relationships. Clinical microarray data provides a useful
environment for the investigation: despite being valuable for reuse
and extremely costly to collect, is not yet universally shared.
RESULTS
We studied the citations of 85 cancer microarray clinical trials
published between January 1999 and April 2003, as identified in
a systematic review by Ntzani and Ioannidis[7] and listed in
Supplementary Text S1. We found 41 of the 85 clinical trials
(48%) made their microarray data publicly available on the
internet. Most data sets were located on lab websites (28), with
a few found on publisher websites (4), or within public databases (6
in the Stanford Microarray Database (SMD)[8], 6 in Gene
Expression Omnibus (GEO)[9], 2 in ArrayExpress[10], 2 in the
NCI GeneExpression Data Portal (GEDP)(gedp.nci.nih.gov); some
datasets in more than one location). The internet locations of the
datasets are listed in Supplementary Text S2. The majority of
datasets were made available concurrently with the trial
publication, as illustrated within the WayBackMachine internet
archives (www.archive.org/web/web.php) for 25 of the datasets
and mention of supplementary data within the trial publication
itself for 10 of the remaining 16 datasets. As seen in Table 1, trials
published in high impact journals, prior to 2001, or with US
authors were more likely to share their data.
The cohort of 85 trials was cited an aggregate of 6239 times in
20042005 by 3133 distinct articles (median of 1.0 cohort citation
per article, range 123). The 48% of trials which shared their data
received a total of 5334 citations (85% of aggregate), distributed as
shown in Figure 1.
Funding: HAP was supported by NLM Training Grant Number 5T15-LM007059-19.
The NIH had no role in study design, data collection or analysis, writing the paper,
or the decision to submit it for publication. The publication contents are solely
the responsibility of the authors and do not necessarily represent the official
views of the NIH.
.
... Table 1. Characteristics of Eligible Trials by Data Sharing.
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
............... TOTAL T8No5utmalber of Art4Dic1alet(as48S%ha)red 4D4at(a52N%o)t Shared Odds Ratio (95% confidence interval)
.
.......... LPHouigwbhliIsmIhmeppdaacc1tt9J9(o.9ur=2n20a50l)0 76132 25192(8((341%000%)%) ) 4104((10(7%6%0)%) ) 6.0(3(.08.6toto2)88.5)
.
....... IPnucblulidsheeadU2S00A1ut2h0o0r3 5769 3356 ((6436%%)) 2413 ((3584%%)) 6.4 (2.0 to 21.9)
... No US Authors 29 6 (21%) 23 (79%)
...
.. doi:10.1371/journal.pone.0000308.t001
Whether a trials dataset was made publicly available was
significantly associated with the log of its 20042005 citation rate
(69% increase in citation count; 95% confidence interval: 18 to
143%, p = 0.006), independent of journal impact factor, date of
publication, and US authorship. Detailed results of this
multivariate linear regression are given in Table 2. A similar result was
found when we regressed on the number of ci (...truncated)