The UMLS knowledge source server: an object model for delivering UMLS data.
The UMLS Knowledge Source Server:
An Object Model For Delivering UMLS Data
Anantha Bangalore, Karen E. Thorn, Carolyn Tilley, Lee Peters
US National Library of Medicine, Bethesda, Maryland
The Unified Medical Language System® (UMLS ®),
a project of the National Library of Medicine (NLM),
regularly distributes a set of knowledge sources to
the research community. These data are made
available over the Internet through the UMLS
Knowledge Source Server (UMLSKS). The new
version of the UMLSKS is a complete redesign of the
original system using Java and the Extensible
Markup Language (XML) technologies to implement
a fast, reliable, flexible, and extensible UMLS data
retrieval system that includes an Application
Programmer’s Interface (API) and an Object Model
of each of the Knowledge Sources: the UMLS
Metathesaurus, the Semantic Network, and the
SPECIALIST Lexicon. In this paper we present the
design of the new system, outline each of the system
design goals, the UMLS Object Model, and statistics
showing the usage of the new UMLSKS and
associated data. We conclude with implications for
future work.
INTRODUCTION
The Unified Medical Language System® (UMLS®)
approach involves the development of a set of widely
distributed Knowledge Sources (Metathesaurus®,
Semantic Network, and SPECIALIST Lexicon). These
Knowledge Sources can be used by a variety of
computerized applications to compensate for
differences in the way concepts are expressed in a
variety of biomedical vocabularies [1]. Currently, over
1900 individuals and institutions have signed the
UMLS License Agreement, enabling them to receive
the UMLS data either on CD-ROM or through the
UMLS Knowledge Source Server (UMLSKS). A
smaller number of licensees (approximately 1200) have
registered for access to the UMLSKS.
The UMLS is large and complex and presents
significant challenges in retrieving information in a
comprehensive way. The centrally managed UMLSKS
provides system developers with UMLS information
remotely and on demand. The advantage of such an
approach is that it makes the Knowledge Sources
readily available and perhaps more importantly,
developers do not need to invest time and effort in
understanding the structure of the data files and other
details to use the UMLS data in their applications. In
1995, the UMLS data were made available for the first
time through the Internet-based UMLSKS [2]. Since
then there have been significant improvements to the
software and hardware components of the UMLSKS
resulting in enhanced performance, increased
flexibility, extensibility, and scalability, and better
software developer access to UMLS data.
Functionally, the UMLSKS is similar to previous
versions in facilitating remote site users, individuals
as well as computer programs, to send requests to a
server at the National Library of Medicine (NLM)
through multiple channels. The similarity ends there.
The old system ran as a single server using a flagbased command line Application Programmer’s
Interface (API) that was written in the “C”
programming language. The new Java-based system
was designed with the following tenets in mind:
•
•
Extensibility for ease of new feature integration
Flexibility by providing a rich API set to allow
system developers access to all UMLS data
elements
• Access to data through multiple channels
(web, XML/socket API, and Java API)
• Provision of a unified data model for the
Knowledge Sources for use by application
developers
• Scalability in handling ever increasing user
loads and increasing numbers of UMLS source
vocabularies
• Performance enhancement to provide faster
access to UMLS data
• Ease of administration by NLM staff and
contractors
The UMLSKS Object Model for each of the
Knowledge Sources allows users to ingest XML
documents produced by the UMLSKS and to
manipulate those data in an object-oriented fashion
within their own programs. The load on the new
system is spread across multiple machines to achieve
load balance and fault tolerance.
UMLSKS API
The API provides a number of functions for querying
UMLS Knowledge Source information from the
UMLSKS. Two programming interfaces are available
AMIA 2003 Symposium Proceedings − Page 51
to developers wishing to use the UMLSKS to retrieve
UMLS data content – a Java Remote Method
Invocation (RMI)-based mechanism and a TCP/IP
socket-based mechanism. The first scheme utilizes the
Java RMI package to establish a connection to the
UMLSKS that allows client applications to make
method calls from directly within their Java programs.
The underlying communications mechanism is hidden
and frees the user from needing to directly manage
the communications with the UMLSKS server. The
second scheme is a lower level mechanism that can be
used with any programming language. The socketbased scheme includes a TCP/IP server running on
the UMLSKS server that accepts socket connections
from remote clients. Clients establish a connection to
this server socket, compose a UMLSKS API request
in XML format to send over this connection, and then
await receipt of the XML response from the server.
Client programs may be written in any language that
supports TCP/IP socket communication. Java
programmers can take further advantage of the API
by using the Object Model to interpret the returned
XML.
The API is built on the premise that all of the
Metathesaurus may not be required by every
developer. Many applications require only a fraction
of the information available. With this in mind, the
API was developed to slice the Metathesaurus into
subsets of data. This results in a reduction of the total
amount of information traveling between the
UMLSKS and client applications and also provides
applications with fine-grained control over the data
they wish to receive. These modifications to the
software yield significant performance improvements
over the previous version.
The API exclusively uses XML for describing data for
each of the Knowledge Sources. As an industry
standard means of structuring information, XML
provides
a
platform-independent
form
for
representing hierarchical data like those of the
Knowledge Sources. XML is basically ASCII text that
is self-describing through use of descriptive data
tags. Many tools exist for manipulating and
displaying XML that make the developer’s job easier
by releasing them from this responsibility and
allowing them to focus on the application details. The
use of XML gives the system its extensibility and
flexibility as proprietary formats are dropped in favor
of a more-widely available and accepted form and
XML is inherently forward compatible.
UMLS OBJECT MODEL
Previously, the onus has been on application
developers to create their own usable data model for
the Knowledge Sources. Each developer needed to
understand the relational data representation
delivered by the UMLS development group in order
to abstract the Knowledge Source contents into
application level components. Competing UMLS
object models existed but without a cons (...truncated)