SciData: a data model and ontology for semantic representation of scientific data
Chalk J Cheminform
SciData: a data model and ontology for semantic representation of scientific data
Stuart J. Chalk 0
0 Department of Chemistry, University of North Florida , Jacksonville, FL 32224 , USA
With the move toward global, Internet enabled science there is an inherent need to capture, store, aggregate and search scientific data across a large corpus of heterogeneous data silos. As a result, standards development is needed to create an infrastructure capable of representing the diverse nature of scientific data. This paper describes a fundamental data model for scientific data that can be applied to data currently stored in any format, and an associated ontology that affords semantic representation of the structure of scientific data (and its metadata), upon which discipline specific semantics can be applied. Application of this data model to experimental and computational chemistry data are presented, implemented using JavaScript Object Notation for Linked Data. Full examples are available at the project website (Chalk in SciData: a scientific data model. http://stuchalk.github.io/scidata/, 2016).
Science data; Semantic annotation; Ontology; JSON-LD; RDF; Scientific data model
-
Background
For almost 40 years, scientists have been storing
scientific data on computers. With the advent of the Internet,
research data could be shared between scientists, first via
email and later using web pages, FTP sites, and online
databases. With the advancement of Internet
technologies and online and local storage capabilities, the options
for collecting and stored scientific information have
become unlimited.
Yet, with all these advancements science faces an
increasingly important issue of interoperability. Data are
commonly stored in different formats, organized in
different ways, and available via different tools/services
severely impacting curation [2]. In addition, data is often
without context (no metadata describing it), and if there
is metadata it is minimal and often not based on
standards. Though the Internet has promoted the creation of
open standards in many areas, scientific data has, in a
sense, been left behind because of its inherent
complexity. The strange part about this scenario is that scientific
data itself is not the biggest problem. The problem is the
contextualization of the scientific data—the metadata
that describes system that it applies to, the way it was
investigated, the scientists that determined it, and the
quality of the measurements.
So, what is scientific data and where is the metadata?
Peter Murray-Rust grappled with these questions in
2010 and concluded that it is “factual data that shows up
in research papers” [3]. When writing scientific articles,
researchers add most (in most cases not all) of the
valuable metadata in the description of the research they have
performed. The motivation of course is open sharing of
knowledge for the advancement of science, with
appropriate attribution and provenance of research work. As
we move toward the fourth paradigm [4], where large
aggregations of data are the key to discovery, it is
imperative that the context of the data are articulated completely
(or as completely as possible), not only to identify it’s
origin and authenticity, but more importantly to allow the
data to be located correctly on the “scientific data map”.
To address these issue, this paper describes a generic
scientific data model (SDM)/framework for scientific
data derived from (1) the common structure of scientific
articles, (2) the needs of electronic notebooks to
capture scientific research data and metadata, and (3) the
clear need to organize scientific data and its contextual
descriptors (metadata). The SDM is intended to be data
format/software agnostic and extremely flexible, so that
© 2016 The Author(s). This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/
publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
it can be implemented as the scientific research dictates.
While the SDM is abstract in nature, it defines a concrete
framework that can be easily implemented in any
database and does not constrain the data and metadata that
can be stored. It therefore serves as a backbone upon
which data and its associated metadata can be ‘attached’.
In addition, this paper describes an ontology that
defines the terms in the SDM, which can be used to
semantically annotate the structure of the data reported.
In this way, scientific data can be integrated together by
storage in Resource Description Fr (...truncated)