Strength in numbers: exploring redundancy in hierarchical relations across biomedical terminologies.
Strength in Numbers: Exploring Redundancy
in Hierarchical Relations across Biomedical Terminologies
Olivier Bodenreider, M.D., Ph.D.
U.S. National Library of Medicine, Bethesda, Maryland
National Institutes of Health, Department of Health & Human Services
Objectives: To investigate three aspects of the redundancy of hierarchical relations across biomedical
terminologies: 1) What proportion of the relations is
redundant?, 2) Which terminologies tend to overlap
with other terminologies?, and 3) Is there a link
between redundancy and semantic consistency?.
Methods: Hierarchical relations are counted in the
various families of terminologies integrated into the
UMLS and an index of redundancy is computed for
each relation. Similarity among sources is computed
using the classical cosine method. Semantic consistency is evaluated by reference to the UMLS Semantic Network. Results: Overall, 29% of the 1,128,261
relations examined exhibit redundancy. Most similar
sources include consecutive versions of terminologies. The link between redundancy and semantic
consistency is weak. Discussion: Applications of
these findings are discussed, including selecting
sources, selecting useful relations, and auditing the
categorization of UMLS concepts.
INTRODUCTION
Redundancy in biomedical terminologies has been
considered essentially from the perspective of the
concepts [1, 2]. Providing multiple names for a concept (i.e., synonymy) is generally considered a valuable feature [3], while multiple ways of representing a
concept (e.g., through compositionality) should be
avoided (unless the system allows equivalent expression to be recognized as such at the application level)
[4]. At the same time, most authors favor multiple
levels of granularity for concepts and multiple categorization of the concepts (resulting in multiple hierarchies) [1, 2].
In practice, these two features contribute to creating
multiple paths between two concepts. For example,
one path from Pulmonary tuberculosis to Disease
may include Lung disease, while another includes
Infectious disease (multiple inheritance). Moreover,
the concept Mycobacterium infection may intervene
between Pulmonary tuberculosis and Infectious
disease in a terminology providing a higher level of
granularity. The existence of multiple paths between
two concepts is of course compounded when several
terminologies are merged to form a broad termino-
logical system such as the Unified Medical Language
System® (UMLS®) Metathesaurus®.
From the perspective of relations, the existence of
multiple paths between two concepts can be regarded
as a different form of redundancy. The relation (C1,
parent of, C2) may be considered redundant if it is
found in several terminologies or if it can be inferred
by combining several other relations, e.g., (C1, parent
of, C3) and (C3, parent of, C2).
The objective of this experiment is to explore the
redundancy of hierarchical relations in an inherently
redundant terminological system: the UMLS
Metathesaurus. More precisely, we want to address
the following three aspects of redundancy in hierarchical relations in the Metathesaurus:
1) What proportion of the relations is redundant?
2) Which terminologies tend to overlap with other
terminologies? (in terms of relations)
3) Is there a link between redundancy and semantic
consistency?
We show that knowledge about redundancy in hierarchical relations may help customize a terminological
system for various kinds of applications.
MATERIALS
The terminological system evaluated in this study is
the Unified Medical Language System (UMLS),
developed and maintained by the National Library of
Medicine. The UMLS Metathesaurus1 (13th edition,
2002AA) contains over 775,000 concepts from some
sixty families of biomedical terminologies and over
ten million relations (i.e., pairs of related concepts).
As for the concepts, each relation may come from one
or more sources. Nearly 1.2 million of these relations
correspond to hierarchical relations contributed by
the constituent terminologies or added by the
Metathesaurus editors – namely (C1, parent of, C2)
and (C1, broader than, C2) in Metathesaurus parlance). In order to benefit from the properties of directed acyclic graphs, we used a slightly modified
version of the Metathesaurus from which the circular
hierarchical relations have been removed [5].
1
umlsinfo.nlm.nih.gov
AMIA 2003 Symposium Proceedings − Page 101
1,155,673 hierarchical relations remained after this
process was applied. In the Metathesaurus, each concept is categorized by means of semantic types from
the Semantic Network. As mentioned in several studies [e.g., 6], this feature makes it possible to check the
semantic validity of a hierarchical relationship between two concepts by comparing it to the relationships represented between the semantic types of the
two concepts in the Semantic Network.
METHODS
Prior to investigating the three questions asked in the
introduction, we must present what criteria we used
for defining families of terminologies, redundancy,
and semantic consistency.
Definitions
Families of terminologies. In the UMLS, the constituent vocabularies are grouped by family2. For
example, all translations of MeSH are part of the
“MeSH family”, identified by ‘MSH’. Except for
minor differences, we used the same grouping in this
study and we refer the reader to the UMLS documentation for the full name of the source vocabularies.
Forty-three families of terminologies contribute relations to the Metathesaurus.
Redundancy. The intuitive notion of redundancy for
a relation is that of a relation shared by several
sources. The redundancy for a given relation would
thus be proportional to the number of sources providing this relation. However, this definition does not
account for differences in granularity across terminologies or multiple categorization. Indeed, the pairs
of hierarchically related concepts (C1, C2) and (C2,
C3) can be seen as redundant with the pair (C1, C3).
Moreover, the pairs (C1, C4) and (C4, C3) would also
be redundant with (C1, C3). Thus, redundancy for (Ci,
Cj) can rather be defined in terms of number of paths
between Ci and Cj.
The index of redundancy for a given pair (Ci, Cj) is
defined as the sum of the indexes of redundancy for
each path between Ci and Cj. The index of redundancy for a given path is the minimum number of
sources for each pair of concepts along the path
(“weakest link” approach). As illustrated in Figure 1,
the index of redundancy for (A, D) may be significantly higher than the number of sources for the direct relation between the two concepts.
In this experiment, we do not distinguish between the
several types of hierarchical relationships in the
Metathesaurus and a hierarchical relation is considered either present or absent in a source, regardless of
its type in the source (parent, broader than, or both).
2
Family information is located in the column SF of the table
MRSAB, recently added to the UMLS distribution
1
A
2
(...truncated)