DNorm: disease name normalization with pairwise learning to rank
Robert Leaman
0
1
Rezarta Islamaj Dog an
1
Zhiyong Lu
1
Associate Editor: Jonathan Wren
0
Department of Biomedical Informatics, Arizona State University
, 13212 East Shea Blvd, Scottsdale,
AZ 85259, USA
1
National Center for Biotechnology Information
, 8600 Rockville Pike,
Bethesda, MD 20894, USA
Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a textthe task of disease name normalization (DNorm)compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macroaveraged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Availability: The source code for DNorm is available at http://www. ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a webbased demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm. nih.gov/CBBresearch/Lu/Demo/PubTator Contact: The Author 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
1 INTRODUCTION
Diseases are central to many lines of biomedical research, and
enabling access to disease information is the goal of many
information extraction and text mining efforts (Islamaj Dog an and
Lu, 2012b; Kang et al., 2012; Ne ve ol et al., 2012; Wiegers et al.,
2012). The task of disease normalization consists of finding
disease mentions and assigning a unique identifier to each. This task
is important in many lines of inquiry involving disease, including
etiology (e.g. genedisease relationships) and clinical aspects
(e.g. diagnosis, prevention and treatment).
Disease may be defined broadly as any impairment of normal
biological function (Hunter, 2009). Given the wide range of
concepts that may thus be categorized as diseasestheir respective
*To whom correspondence should be addressed.
etiologies, clinical presentations and their various histories of
diagnosis and treatmentdisease names naturally exhibit
considerable variation. This variation presents not only in
synonymous terms for the same disease, but also in the diverse logic used
to create the disease names themselves.
Disease names are often created by combining roots and
affixes from Greek or Latin (e.g. hemochromatosis). A
particularly flexible way to create disease names is to combine a disease
category with a short descriptive modifier, which may take many
forms, including anatomical locations (breast cancer),
symptoms (cat-eye syndrome), treatment (Dopa-responsive
dystonia), causative agent (staph infection), biomolecular
etiology (G6PD deficiency), heredity (X-linked
agammaglobulinemia) or eponyms (Schwartz-Jampel syndrome). Modifiers
are also frequently used to provide description not part of the
name (e.g. severe malaria).
When diseases are mentioned in text, they are frequently also
abbreviated, exhibit morphological or orthographical variations,
use different word orderings or use synonyms. These variations
may involve more than single word substitutions. For example,
because affixes are often composed, a single word
(oculocerebrorenal) may correspond to multiple words (eye,
brain and kidney) in another form.
The disease normalization task is further complicated by the
overlap between disease concepts, forcing systems that locate and
normalize diseases in natural language text to balance handling
name variations with differentiating between concepts to achieve
good performance. Previous works addressing disease name
normalization (DNorm) typically use a hybrid of lexical and
linguistic approaches (Islamaj Dog an and Lu, 2012b; Jimeno
et al., 2008; Kang et al., 2012). While string normalization
techniques (e.g. case folding, stemming) do allow some
generalization, the name variations in the lexicon always impose some
limitation. Machine learning may enable higher performance
by modeling the language that authors use to describe diseases
in text; however, there have been relatively few attempts to use
machine learning in normalization, and none for disease names.
In this work we use the NCBI disease corpus (Islamaj Dog an
and Lu, 2012a), which has recently been updated to include
concept annotations (Islamaj Dogan et al., unpublished data),
to consider the task of disease normalization. We describe the
task as follows: given an abstract, return the set of disease
concepts mentioned. Our current purpose is to support
entityspecific semantic search of the biomedical literature (Lu, 2011)
and computer-assisted biocuration, especially document triage
(Kim et al., 2012).
In this article we introduce DNorm, the first machine learning
method to normalize disease names in biomedical text. Our
technique learns the similarity between mentions and concept names
directly from the training data, thereby focusing on the candidate
generation phase of normalization. Our technique can learn
arbitrary mappings between mentions and names, including
synonymy, polysemy and relationships that are not 1-to-1.
Moreover, our method specifically handles abbreviations and
word order variations. Our method is based on pairwise learning
to rank (pLTR), which has been successfully applied to large
optimization problems in information retrieval (Bai et al.,
2010), but to the best of our knowledge has not previously
been used for concept normalization.
Related work
Biomedical named entity recognition (NER) research has
received increased attention recently, partly owing to
BioCreative (Hirschman et al., 2005b) and BioNLP (Kim
et al., 2009) challenges on recognition of genes, proteins and
biological events in the scientific literature, as well as TREC
(Voorhees and Tong, 2011) and i2b2 (Uzuner et al., 2011)
challenges on identification of drugs, diseases and medical tests in
electronic patient records.
The problem of concept normalization has seen substantial
work for genes and proteins, as a result of a series of tasks
that were part of the BioCreative competitions (Hirschman
et al., 20 (...truncated)