Strength in numbers: exploring redundancy in hierarchical relations across biomedical terminologies.

AMIA Annual Symposium Proceedings, Aug 2024

To investigate three aspects of the redundancy of hierarchical relations across biomedical terminologies: 1) What proportion of the relations is redundant?, 2) Which terminologies tend to overlap with other terminologies?, and 3) Is there a link between ...

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480297/pdf/

Strength in numbers: exploring redundancy in hierarchical relations across biomedical terminologies.

Strength in Numbers: Exploring Redundancy in Hierarchical Relations across Biomedical Terminologies Olivier Bodenreider, M.D., Ph.D. U.S. National Library of Medicine, Bethesda, Maryland National Institutes of Health, Department of Health & Human Services Objectives: To investigate three aspects of the redundancy of hierarchical relations across biomedical terminologies: 1) What proportion of the relations is redundant?, 2) Which terminologies tend to overlap with other terminologies?, and 3) Is there a link between redundancy and semantic consistency?. Methods: Hierarchical relations are counted in the various families of terminologies integrated into the UMLS and an index of redundancy is computed for each relation. Similarity among sources is computed using the classical cosine method. Semantic consistency is evaluated by reference to the UMLS Semantic Network. Results: Overall, 29% of the 1,128,261 relations examined exhibit redundancy. Most similar sources include consecutive versions of terminologies. The link between redundancy and semantic consistency is weak. Discussion: Applications of these findings are discussed, including selecting sources, selecting useful relations, and auditing the categorization of UMLS concepts. INTRODUCTION Redundancy in biomedical terminologies has been considered essentially from the perspective of the concepts [1, 2]. Providing multiple names for a concept (i.e., synonymy) is generally considered a valuable feature [3], while multiple ways of representing a concept (e.g., through compositionality) should be avoided (unless the system allows equivalent expression to be recognized as such at the application level) [4]. At the same time, most authors favor multiple levels of granularity for concepts and multiple categorization of the concepts (resulting in multiple hierarchies) [1, 2]. In practice, these two features contribute to creating multiple paths between two concepts. For example, one path from Pulmonary tuberculosis to Disease may include Lung disease, while another includes Infectious disease (multiple inheritance). Moreover, the concept Mycobacterium infection may intervene between Pulmonary tuberculosis and Infectious disease in a terminology providing a higher level of granularity. The existence of multiple paths between two concepts is of course compounded when several terminologies are merged to form a broad termino- logical system such as the Unified Medical Language System® (UMLS®) Metathesaurus®. From the perspective of relations, the existence of multiple paths between two concepts can be regarded as a different form of redundancy. The relation (C1, parent of, C2) may be considered redundant if it is found in several terminologies or if it can be inferred by combining several other relations, e.g., (C1, parent of, C3) and (C3, parent of, C2). The objective of this experiment is to explore the redundancy of hierarchical relations in an inherently redundant terminological system: the UMLS Metathesaurus. More precisely, we want to address the following three aspects of redundancy in hierarchical relations in the Metathesaurus: 1) What proportion of the relations is redundant? 2) Which terminologies tend to overlap with other terminologies? (in terms of relations) 3) Is there a link between redundancy and semantic consistency? We show that knowledge about redundancy in hierarchical relations may help customize a terminological system for various kinds of applications. MATERIALS The terminological system evaluated in this study is the Unified Medical Language System (UMLS), developed and maintained by the National Library of Medicine. The UMLS Metathesaurus1 (13th edition, 2002AA) contains over 775,000 concepts from some sixty families of biomedical terminologies and over ten million relations (i.e., pairs of related concepts). As for the concepts, each relation may come from one or more sources. Nearly 1.2 million of these relations correspond to hierarchical relations contributed by the constituent terminologies or added by the Metathesaurus editors – namely (C1, parent of, C2) and (C1, broader than, C2) in Metathesaurus parlance). In order to benefit from the properties of directed acyclic graphs, we used a slightly modified version of the Metathesaurus from which the circular hierarchical relations have been removed [5]. 1 umlsinfo.nlm.nih.gov AMIA 2003 Symposium Proceedings − Page 101 1,155,673 hierarchical relations remained after this process was applied. In the Metathesaurus, each concept is categorized by means of semantic types from the Semantic Network. As mentioned in several studies [e.g., 6], this feature makes it possible to check the semantic validity of a hierarchical relationship between two concepts by comparing it to the relationships represented between the semantic types of the two concepts in the Semantic Network. METHODS Prior to investigating the three questions asked in the introduction, we must present what criteria we used for defining families of terminologies, redundancy, and semantic consistency. Definitions Families of terminologies. In the UMLS, the constituent vocabularies are grouped by family2. For example, all translations of MeSH are part of the “MeSH family”, identified by ‘MSH’. Except for minor differences, we used the same grouping in this study and we refer the reader to the UMLS documentation for the full name of the source vocabularies. Forty-three families of terminologies contribute relations to the Metathesaurus. Redundancy. The intuitive notion of redundancy for a relation is that of a relation shared by several sources. The redundancy for a given relation would thus be proportional to the number of sources providing this relation. However, this definition does not account for differences in granularity across terminologies or multiple categorization. Indeed, the pairs of hierarchically related concepts (C1, C2) and (C2, C3) can be seen as redundant with the pair (C1, C3). Moreover, the pairs (C1, C4) and (C4, C3) would also be redundant with (C1, C3). Thus, redundancy for (Ci, Cj) can rather be defined in terms of number of paths between Ci and Cj. The index of redundancy for a given pair (Ci, Cj) is defined as the sum of the indexes of redundancy for each path between Ci and Cj. The index of redundancy for a given path is the minimum number of sources for each pair of concepts along the path (“weakest link” approach). As illustrated in Figure 1, the index of redundancy for (A, D) may be significantly higher than the number of sources for the direct relation between the two concepts. In this experiment, we do not distinguish between the several types of hierarchical relationships in the Metathesaurus and a hierarchical relation is considered either present or absent in a source, regardless of its type in the source (parent, broader than, or both). 2 Family information is located in the column SF of the table MRSAB, recently added to the UMLS distribution 1 A 2 (...truncated)


This is a preview of a remote PDF: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480297/pdf/
Article home page: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480297

O. Bodenreider. Strength in numbers: exploring redundancy in hierarchical relations across biomedical terminologies., AMIA Annual Symposium Proceedings, pp. 101,