Open Data for Global Science
Data Science Journal, Volume 6, Open Data Issue, 17 June 2007
OPEN DATA FOR GLOBAL SCIENCE
Paul F. Uhlir1* and Peter Schröder2
*1 National Research Council, 2101 Constitution Avenue NW, Washington, DC 20418, USA. The views expressed in
this paper are those of the authors and not necessarily those of their institutions of employment.
Email:
2
Data Archiving and Networked Services (DANS), Anna van Saksenlaan 51, 2593 HW Den Haag, The Netherlands
Email:
ABSTRACT
The digital revolution has transformed the accumulation of properly curated public research data into an essential
upstream resource whose value increases with use.1 The potential contributions of such data to the creation of new
knowledge and downstream economic and social goods can in many cases be multiplied exponentially when the
data are made openly available on digital networks. Most developed countries spend large amounts of public
resources on research and related scientific facilities and instruments that generate massive amounts of data. Yet
precious little of that investment is devoted to promoting the value of the resulting data by preserving and making
them broadly available. The largely ad hoc approach to managing such data, however, is now beginning to be
understood as inadequate to meet the exigencies of the national and international research enterprise. The time has
thus come for the research community to establish explicit responsibilities for these digital resources. This article
reviews the opportunities and challenges to the global science system associated with establishing an open data
policy.
Keywords: Scientific data, Science policy, Information policy, Open access, Data management, Data licensing,
International scientific cooperation, Cyberinfrastructure, e-Science, Internet
1
INTRODUCTION
The global science system stands at a critical juncture. On the one hand, it is overwhelmed by a hidden avalanche of
ephemeral bits that are central components of modern research and of the emerging “cyberinfrastructure”2 for e-
1
See generally, National Research Council (1997), Bits of Power: Issues in Global Access to Scientific Data,
National Academy Press, Washington, DC. “Data” may be defined as “facts, numbers, letters, and symbols that
describe an object, idea, condition, situation, or other factors”, National Research Council (1999), A Question of
Balance: Private Rights and the Public Interest in Scientific Databases, National Academy Press, Washington, DC,
p. 15. We define “public research data” as data that are generated through research within government organizations,
or by academic or other not-for-profit entities, as well as public data used for research purposes, but not necessarily
produced primarily for research (e.g., geographic or meteorological data, or socioeconomic statistics produced by or
for government organizations).
2
The U.S. Blue Ribbon Advisory Panel on Cyberinfrastructure anticipated an information and communication
technology (ICT) infrastructure of “…digital environments that become interactive and functionally complete for
research communities in terms of people, data, information, tools and instruments and that operate at unprecedented
levels of computational, storage and data transfer capacity…” in (2003) Revolutionizing Science and Engineering
Trough Cyberinfrastructure: Report of the National Science Foundation Blue Ribbon Advisory Panel on
Cyberinfrastructure,
National
Science
Foundation,
available
at:
http://www.communitytechnology.org/nsf_ci_report/. We use the terms cyberinfrastructure and ICT infrastructure
interchangeably in this paper.
OD36
Data Science Journal, Volume 6, Open Data Issue, 17 June 2007
science3. The rational management and exploitation of this cascade of digital assets offers boundless opportunities
for research and applications. On the other hand, the ability to access and use this rising flood of data seems to lag
behind, despite the rapidly growing capabilities of information and communication technologies (ICTs) to make
much more effective use of those data. As long as the attention for data policies and data management by
researchers, their organisations and their funders does not catch up with the rapidly changing research environment,
the research policy and funding entities in many cases will perpetuate the systemic inefficiencies, and the resulting
loss or underutilization of valuable data resources derived from public investments. There is thus an urgent need for
rationalized national strategies and more coherent international arrangements for sustainable access to public
research data, both to data produced directly by government entities and to data generated in academic and not-forprofit institutions with public funding.
In this paper, we examine some of the implications of the “data driven” research and possible ways to overcome
existing barriers to accessibility of public research data. Our perspective is framed in the context of the
predominantly publicly funded global science system. We begin by reviewing the growing role of digital data in
research and outlining the roles of stakeholders in the research community in developing data access regimes. We
then discuss the hidden costs of closed data systems, the benefits and limitations of openness as the default principle
for data access, and the emerging open access models that are beginning to form digitally networked commons. We
conclude by examining the rationale and requirements for developing overarching international principles from the
top down, as well as flexible, common-use contractual templates from the bottom up, to establish data access
regimes founded on a presumption of openness, with the goal of better capturing the benefits from the existing and
future scientific data assets. The ”Principles and Guidelines for Access to Research Data from Public Funding” from
the Organisation for Economic Cooperation and Development (OECD), reported on in another article by Pilat and
Fukasaku in this special issue of the CODATA Data Science Journal, are the most important recent example of the
high-level (inter)governmental approach. The common-use licenses promoted by the Science Commons are a
leading example of flexible arrangements originating within the community. Finally, we should emphasize that we
focus almost exclusively on the policy—the institutional, socioeconomic, and legal aspects of data access—rather
than on the technical and management practicalities that are also important, but beyond the scope of this article.
2
THE GROWING ROLE OF DIGITAL DATA IN THE RESEARCH PROCESS
The evolution of scientific research may be characterized by an accelerating growth in scale, scope, and complexity.
These developments in scientific research have been accompanied by a substantial rise in costs. Overall
expenditures on research and development (R&D) in the OECD countries increased from $163.2 billion in 1981 to
$679.8 in 2003 (in constant prices, 2000 dollars (...truncated)