Developing a kidney and urinary pathway knowledge base
Jupp et al. Journal of Biomedical Semantics 2011, 2(Suppl 2):S7
http://www.jbiomedsem.com/content/2/S2/S7
JOURNAL OF
BIOMEDICAL SEMANTICS
PROCEEDINGS
Open Access
Developing a kidney and urinary pathway
knowledge base
Simon Jupp1*, Julie Klein2,3, Joost Schanstra2,3, Robert Stevens1*
From Bio-Ontologies 2010: Semantic Applications in Life Sciences
Boston, MA, USA. 9-10 July 2010
* Correspondence: simon.
; robert.
1
School of Computer Science,
University of Manchester, UK
Abstract
Background: Chronic renal disease is a global health problem. The identification of
suitable biomarkers could facilitate early detection and diagnosis and allow better
understanding of the underlying pathology. One of the challenges in meeting this
goal is the necessary integration of experimental results from multiple biological
levels for further analysis by data mining. Data integration in the life science is still a
struggle, and many groups are looking to the benefits promised by the Semantic
Web for data integration.
Results: We present a Semantic Web approach to developing a knowledge base
that integrates data from high-throughput experiments on kidney and urine. A
specialised KUP ontology is used to tie the various layers together, whilst
background knowledge from external databases is incorporated by conversion into
RDF. Using SPARQL as a query mechanism, we are able to query for proteins
expressed in urine and place these back into the context of genes expressed in
regions of the kidney.
Conclusions: The KUPKB gives KUP biologists the means to ask queries across many
resources in order to aggregate knowledge that is necessary for answering biological
questions. The Semantic Web technologies we use, together with the background
knowledge from the domain’s ontologies, allows both rapid conversion and
integration of this knowledge base. The KUPKB is still relatively small, but questions
remain about scalability, maintenance and availability of the knowledge itself.
Availability: The KUPKB may be accessed via http://www.e-lico.eu/kupkb.
Introduction
The early detection and better understanding of (chronic) renal disease is important as
it will reach pandemic proportions over the next few decades [1]. The biologist’s goal
in renal disease is to understand the pathological processes and identify disease biomarkers. This requires the analyses of experimental data from multiple biological levels
(e.g. genes, proteins and metabolites). These data need to be integrated with existing
knowledge from databases and the scientific literature to connect the different levels.
In addition, the kidney field is peculiar for at least two reasons:
1. the kidney is highly cellular and compartmentalised and each compartment is
involved in many different functions and,
© 2011 Jupp et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Jupp et al. Journal of Biomedical Semantics 2011, 2(Suppl 2):S7
http://www.jbiomedsem.com/content/2/S2/S7
2. most of the large scale data comes from analysis of urine, that needs to be put
into the ‘kidney’ context.
All together this makes the analysis of data, for which integration of data is a prerequisite, problematic. This paper presents a case-study for developing a knowledge
base around a focused domain in the life sciences, namely the kidney and urinary
pathway (KUP). The KUP Knowledge Base (KUPKB) is being developed as part of the
e-LICO project [2]. e-LICO is developing a data mining platform that supports the
semi-automated construction of data mining workflows for data intensive sciences [3].
The e-LICO platform is to be demonstrated with a system biology use case that uses
real data encountered in the KUP domain. The data spans multiple -omic levels and is
collected from different tissues and from different species. For example, most of the
human -omics data originates from urine [4] and needs to be related back to the kidney and its parts. In contrast, multilevel -omics data from animal models is more regularly available. e-LICO aims to develop tools that will mine these large scale disparate
experimental findings, link those to existing data and build new predictive models for
renal disease.
The KUPKB is built using a Semantic Web approach in order to assess the benefits
and feasibility of creating such a resource with this technology. The methodology section guides the reader through the creation of a Kidney and Urinary Pathway Ontology
(KUPO), that provides a specialised application ontology for the KUP domain. The
KUPO provides the schema for the data held in the KUPKB. Within this methodology
we explore the requirements for tools that help engage the biologists in the design and
construction of such an ontology. The results section describes the KUPKB with
examples of the kinds of queries that can be asked across multiple biological levels.
We conclude by discussing the merits and limitations of our approach.
Background
Data integration in the life science is an ongoing challenge in Bioinformatics; problems
arise because standards for data formats, identifiers, common vocabularies and agreed
semantics between databases are lacking [5,6]. Data in the life sciences are complex
and volatile that, when taken with the issues outlined, makes the necessary integration
of life sciences data hard work. Another factor is the numerous data resources published by independent groups that leads to an expansion of the heterogeneities that are
rife in life science data [7].
Developing new resources that integrate existing data typically involves centralising
the external data within new bespoke schemas. This ‘warehousing’ approach is common in the life sciences and over time leads to an increasing number of resources,
each with their own schema [7]. The situation with respect to accessing these data is,
however, improving with data providers often offering programmatic access to the data
via Web Services or database exports [8,9]. This access affords easier integration
opportunities, despite the semantic heterogeneities and the problem of identity of entities within life science’s data. The ‘identity crisis’ [10] is being addressed through
efforts such as shared names [11] and services such as BridgeDB [12], but wide spread
compliance has yet to be realised. The adoption of ontologies for the annotation of
data is providing new possibilities for data integration that go beyond using primary
database entry identifiers alone.
Page 2 of 17
Jupp et al. Journal of Biomedical Semantics 2011, 2(Suppl 2):S7
http://www.jbiomedsem.com/content/2/S2/S7
The problem is exacerbated in the life sciences due to the nature of the data being
captured. Biological data are complex, heavily inter-related and also often irregular or
i (...truncated)