Developing a kidney and urinary pathway knowledge base (pdf)

Article PDF cannot be displayed. You can download it here:

https://jbiomedsem.biomedcentral.com/track/pdf/10.1186/2041-1480-2-S2-S7

Developing a kidney and urinary pathway knowledge base

Jupp et al. Journal of Biomedical Semantics 2011, 2(Suppl 2):S7 http://www.jbiomedsem.com/content/2/S2/S7 JOURNAL OF BIOMEDICAL SEMANTICS PROCEEDINGS Open Access Developing a kidney and urinary pathway knowledge base Simon Jupp1*, Julie Klein2,3, Joost Schanstra2,3, Robert Stevens1* From Bio-Ontologies 2010: Semantic Applications in Life Sciences Boston, MA, USA. 9-10 July 2010 * Correspondence: simon. ; robert. 1 School of Computer Science, University of Manchester, UK Abstract Background: Chronic renal disease is a global health problem. The identification of suitable biomarkers could facilitate early detection and diagnosis and allow better understanding of the underlying pathology. One of the challenges in meeting this goal is the necessary integration of experimental results from multiple biological levels for further analysis by data mining. Data integration in the life science is still a struggle, and many groups are looking to the benefits promised by the Semantic Web for data integration. Results: We present a Semantic Web approach to developing a knowledge base that integrates data from high-throughput experiments on kidney and urine. A specialised KUP ontology is used to tie the various layers together, whilst background knowledge from external databases is incorporated by conversion into RDF. Using SPARQL as a query mechanism, we are able to query for proteins expressed in urine and place these back into the context of genes expressed in regions of the kidney. Conclusions: The KUPKB gives KUP biologists the means to ask queries across many resources in order to aggregate knowledge that is necessary for answering biological questions. The Semantic Web technologies we use, together with the background knowledge from the domain’s ontologies, allows both rapid conversion and integration of this knowledge base. The KUPKB is still relatively small, but questions remain about scalability, maintenance and availability of the knowledge itself. Availability: The KUPKB may be accessed via http://www.e-lico.eu/kupkb. Introduction The early detection and better understanding of (chronic) renal disease is important as it will reach pandemic proportions over the next few decades [1]. The biologist’s goal in renal disease is to understand the pathological processes and identify disease biomarkers. This requires the analyses of experimental data from multiple biological levels (e.g. genes, proteins and metabolites). These data need to be integrated with existing knowledge from databases and the scientific literature to connect the different levels. In addition, the kidney field is peculiar for at least two reasons: 1. the kidney is highly cellular and compartmentalised and each compartment is involved in many different functions and, © 2011 Jupp et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Jupp et al. Journal of Biomedical Semantics 2011, 2(Suppl 2):S7 http://www.jbiomedsem.com/content/2/S2/S7 2. most of the large scale data comes from analysis of urine, that needs to be put into the ‘kidney’ context. All together this makes the analysis of data, for which integration of data is a prerequisite, problematic. This paper presents a case-study for developing a knowledge base around a focused domain in the life sciences, namely the kidney and urinary pathway (KUP). The KUP Knowledge Base (KUPKB) is being developed as part of the e-LICO project [2]. e-LICO is developing a data mining platform that supports the semi-automated construction of data mining workflows for data intensive sciences [3]. The e-LICO platform is to be demonstrated with a system biology use case that uses real data encountered in the KUP domain. The data spans multiple -omic levels and is collected from different tissues and from different species. For example, most of the human -omics data originates from urine [4] and needs to be related back to the kidney and its parts. In contrast, multilevel -omics data from animal models is more regularly available. e-LICO aims to develop tools that will mine these large scale disparate experimental findings, link those to existing data and build new predictive models for renal disease. The KUPKB is built using a Semantic Web approach in order to assess the benefits and feasibility of creating such a resource with this technology. The methodology section guides the reader through the creation of a Kidney and Urinary Pathway Ontology (KUPO), that provides a specialised application ontology for the KUP domain. The KUPO provides the schema for the data held in the KUPKB. Within this methodology we explore the requirements for tools that help engage the biologists in the design and construction of such an ontology. The results section describes the KUPKB with examples of the kinds of queries that can be asked across multiple biological levels. We conclude by discussing the merits and limitations of our approach. Background Data integration in the life science is an ongoing challenge in Bioinformatics; problems arise because standards for data formats, identifiers, common vocabularies and agreed semantics between databases are lacking [5,6]. Data in the life sciences are complex and volatile that, when taken with the issues outlined, makes the necessary integration of life sciences data hard work. Another factor is the numerous data resources published by independent groups that leads to an expansion of the heterogeneities that are rife in life science data [7]. Developing new resources that integrate existing data typically involves centralising the external data within new bespoke schemas. This ‘warehousing’ approach is common in the life sciences and over time leads to an increasing number of resources, each with their own schema [7]. The situation with respect to accessing these data is, however, improving with data providers often offering programmatic access to the data via Web Services or database exports [8,9]. This access affords easier integration opportunities, despite the semantic heterogeneities and the problem of identity of entities within life science’s data. The ‘identity crisis’ [10] is being addressed through efforts such as shared names [11] and services such as BridgeDB [12], but wide spread compliance has yet to be realised. The adoption of ontologies for the annotation of data is providing new possibilities for data integration that go beyond using primary database entry identifiers alone. Page 2 of 17 Jupp et al. Journal of Biomedical Semantics 2011, 2(Suppl 2):S7 http://www.jbiomedsem.com/content/2/S2/S7 The problem is exacerbated in the life sciences due to the nature of the data being captured. Biological data are complex, heavily inter-related and also often irregular or i (...truncated)