Life sciences on the Semantic Web: the Neurocommons and beyond
Abstract
Translational research, the effort to couple the results of basic research to clinical applications, depends on the ability to effectively answer questions using information that spans multiple disciplines. The Semantic Web, with its emphasis on combining information using standard representation languages, access to that information via standard web protocols, and technologies to leverage computation, such as in the form of inference and distributable query, offers a social and technological basis for assembling, integrating and making available biomedical knowledge at Web scale. In this article, we discuss the use of Semantic Web technology for assembling and querying biomedical knowledge from multiple sources and disciplines. We present the Neurocommons prototype knowledge base, a demonstration intended to show the feasibility and benefits of using these technologies. The prototype knowledge base can be used to experiment with and assess the scalability of current tools and methods for creating such a resource, and to elicit issues that will need to be addressed in order to expand the scope and use of it. We demonstrate the utility of the knowledge base by reviewing a few example queries that provide answers to precise questions relevant to the understanding of disease. All components of the knowledge base are freely available at http://neurocommons.org/, enabling readers to reconstruct the knowledge base and experiment with this new technology.
Semantic Web, ontology, data integration, life science, medicine, neuroscience
INTRODUCTION
Understanding complex biological systems is a crucial challenge for modern biomedical science and informatics. In order to answer questions that might accelerate translational medicine, knowledge from different disciplines, research methodologies and repositories must be collected and integrated. However, the data and knowledge that measure and describe biomedical phenomena are scattered across numerous information systems, each with its own terminologies, identifier schemes, and data formats. One collation counts more than 1000 publicly accessible molecular biology databases [1]. There is little schema or ontology reuse between these. Beyond these lies a bulk of biomedical knowledge published in journals, monographs, and textbooks. Making effective computational use of all this knowledge is an important contemporary challenge.
Given this situation, it is difficult for researchers to find all available information about a subject of interest, and to organize it so that it can be found and understood. Scientists who would attempt to form a comprehensive view of a biological phenomenon face tedious and error-prone computing tasks such as converting data formats and information schemas, querying different databases and combining the results of these queries, wrestling with a variety of uncoordinated application interfaces, reading articles and extracting and integrating relevant facts from them. Most of such a scientist's resources are spent on working through the complexities of information systems instead of understanding the complexities of biological reality—the actual goal of biomedical research [2].
Instead of ushering in a new era of biomedical insight, the growing abundance of data on the web has intensified the need to develop new approaches to manage and integrate it. If we fail to do so, knowledge will remain fractured—encoded in a myriad of representational dialects—and effectively inaccessible to the majority of researchers.
As a means to change this situation, we have become interested in helping establish a Semantic Web for science [3,4]. By our assessment, the Semantic Web adds to existing Web standards and practices encouraging clearly specified names for things, classes, and relationships, organized and documented in ontologies, with data expressed using standardized well-specified knowledge representation languages. Such a combination could enable computationally assisted management of information, ease the integration of different sources into a coherent system, and make knowledge more widely and easily accessible. As with the existing synergy between Internet and intranet, these technologies continue to enhance the ability to work with knowledge that spans public and organizational boundaries, an essential capability in an ecosystem of biomedical research that includes academia, pharmaceutical companies, medical clinics and government agencies.
A number of recent Semantic Web standards provide a part of the technical basis for such a vision, building on existing Web practices such as the ubiquitous use of Uniform Resource Identifiers (URIs) as globally unique names and documentation locators. The Resource Description Framework (RDF) [5], RDF Schema (RDFS) and the Web Ontology Language (OWL) [6] are standards for knowledge representation. RDF(S) (We use RDF(S) to refer to both RDF and RDF Schema) provides a basic syntax, datatypes and the ability to use classes and instances. OWL goes beyond RDF(S) in offering more expressive ways of specifying classes, relations between classes, properties and relationships between instances. OWL is expressive enough to state inconsistent assertions, therefore going beyond RDF(S) and enabling tools that can profitably check consistency in the service of improving data quality.
The query language SPARQL [7] is a first standard for posing queries against repositories of knowledge expressed in these languages. Reasoners such as Pellet [8] are able to compute implications of statements made in OWL, as well as perform consistency checking.
The Neurocommons prototype is a knowledge base built as a first step towards Web scale integration of scientific knowledge. With it, we are already able to demonstrate how Semantic Web technologies can be applied in biomedical research, for instance by helping scientists more easily answer questions about background science and connections between different research disciplines. The prototype serves as one test bed for exploring the technical, social and legal processes that will be needed to achieve a future in which the results of research are placed seamlessly into the Web of science. It also demonstrates the productive use of existing ontologies and exposes the need for their augmentation and future development. Through our experience working with the SenseLab project [9], the OBO Foundry [10], and with members of the W3C Semantic Web for Health Care and Life Sciences Interest Group [11], we can report insights on methods of collaboration that can work in practice. The prototype is based on the Virtuoso open source triple store (http://virtuoso.openlinksw.com/) as an OWL and RDF repository, and comes with open access data. The knowledge base has been released with the express purpose of allowing others to replicate, experiment with and extend it.
We see this prototype as a step towards the Semantic Web for science. Below w (...truncated)