A journey to Semantic Web query federation in the life sciences
Kei-Hoi Cheung
2
H Robert Frost
1
M Scott Marshall
0
Eric Prud'hommeaux
6
Matthias Samwald
4
5
Jun Zhao
3
Adrian Paschke
7
0
Informatics Institute, University of Amsterdam
,
The Netherlands
1
VectorC, LLC
,
Hanover, NH 03755
,
USA
2
Center for Medical Informatics, Yale University School of Medicine
,
New Haven, CT 06511
,
USA
3
Department of Zoology, University of Oxford
,
Oxford, OX1 3PS
,
UK
4
Konrad Lorenz Institute for Evolution and Cognition Research
,
Altenberg
,
Austria
5
Digital Enterprise Research Institute, National University of Ireland Galway
,
IDA Business Park, Lower Dangan, Galway
,
Ireland
6
World Wide Web Consortium, Massachusetts Institute of Technology
,
Massachusetts, MA 02139
,
USA
7
Freie Universitat Berlin
,
Germany
Background: As interest in adopting the Semantic Web in the biomedical domain continues to grow, Semantic Web technology has been evolving and maturing. A variety of technological approaches including triplestore technologies, SPARQL endpoints, Linked Data, and Vocabulary of Interlinked Datasets have emerged in recent years. In addition to the data warehouse construction, these technological approaches can be used to support dynamic query federation. As a community effort, the BioRDF task force, within the Semantic Web for Health Care and Life Sciences Interest Group, is exploring how these emerging approaches can be utilized to execute distributed queries across different neuroscience data sources. Methods and results: We have created two health care and life science knowledge bases. We have explored a variety of Semantic Web approaches to describe, map, and dynamically query multiple datasets. We have demonstrated several federation approaches that integrate diverse types of information about neurons and receptors that play an important role in basic, clinical, and translational neuroscience research. Particularly, we have created a prototype receptor explorer which uses OWL mappings to provide an integrated list of receptors and executes individual queries against different SPARQL endpoints. We have also employed the AIDA Toolkit, which is directed at groups of knowledge workers who cooperatively search, annotate, interpret, and enrich large collections of heterogeneous documents from diverse locations. We have explored a tool called FeDeRate, which enables a global SPARQL query to be decomposed into subqueries against the remote databases offering either SPARQL or SQL query interfaces. Finally, we have explored how to use the vocabulary of interlinked Datasets (voiD) to create metadata for describing datasets exposed as Linked Data URIs or SPARQL endpoints.
-
Conclusion: We have demonstrated the use of a set of novel and state-of-the-art Semantic Web
technologies in support of a neuroscience query federation scenario. We have identified both the
strengths and weaknesses of these technologies. While Semantic Web offers a global data model
including the use of Uniform Resource Identifiers (URIs), the proliferation of
semanticallyequivalent URIs hinders large scale data integration. Our work helps direct research and tool
development, which will be of benefit to this community.
Background
As the number, size, and complexity of life science
databases continue to grow, data integration remains a
prominent problem in the life sciences. These disparate
databases feature diverse types of data including
sequences, genes, proteins, pathways, and drugs
produced by different kinds of experiments, including those
that involve high-throughput technologies such as DNA
microarray, mass spectrometry, and next generation
sequencing. The challenges involved in integrating such
data include inconsistency in naming, diversity of data
models, and heterogeneous data formats. The benefits of
integrating these disparate sources of data include
discovery of new associations/relationships between
the data and validation of existing hypotheses.
Numerous life science databases can be accessed publicly
via the Web. The data retrieved from different databases
are displayed using the HyperText Markup Language
(HTML) and rendered by Web browsers (e.g., Internet
Explorer and Firefox). Hypertext links are used to
connect data items between different Web pages. Data
integration using hypertext links, however, is
burdensome to the user [1]. HTML works well to expose the
results of scripted (canned) queries but does not expose
the database structure to data users who would wish to
construct their own queries. To automate integration of
data in HTML format, we need to rely on methods such
as screen scraping to extract the data from the HTML
documents and integrate the extracted data by custom
scripts. This approach is vulnerable to changes in display
and location of Web pages. Such changes, together with
changes in database structure, significantly increase the
code complexity of data integration. To address this
problem, approaches have been developed to facilitate
data integration on a larger scale. Some representative
approaches include EBI SRS [2], Atlas [3], DiscoveryLink
[4], Biokleisli [5], Biozon [6], etc. In general, these
approaches fall into two categories: data warehouse and
federated database. The data warehouse approach relies
on data translation in which data from different
databases are re-expressed in a common data model on
a central repository. The federated approach features
query translation in which data are kept in their local
databases and a global query can be translated into a set
of local database subqueries whose results are unified
and presented to the user. There are pros and cons for
each approach. Data warehouses typically wrestle with
the concurrency issue (keeping the data up-to-date with
respect to a data source). Each time a member database
is changed, the data translation code will need to be
modified and/or re-executed, depending on the nature of
the change. On the other hand, data warehouse query
performance is good because queries are run locally. In
the federated approach, data concurrency is not an issue,
but query speed may be slow, especially when large
amounts of data are transferred over the network.
The Semantic Web [7] transforms the Web into a global
database or knowledge base by providing: i) globally
unique names through the Uniform Resource Identifiers
(URIs), ii) standard languages including the Resource
Description Framework (RDF), RDF Schema (RDFS),
and the Web Ontology Language (OWL) for modeling
data and creating ontologies, and iii) a standard query
language SPARQL [8]. Enabling technologies such as
ontology editors (e.g., Protg), OWL reasoners (e.g.,
Pellet and FaCT++) and triplestores with SPARQL
endpoints (e.g., Virtuoso, AllegroGraph and Sesame) help
make the Semantic Web vision a reality. While these core
and enabling technologies are maturing, there are new
technological developments that can help push the
Semantic Web to a new level of data interoperability.
For example, Linked Data [9] is (...truncated)