DNorm: disease name normalization with pairwise learning to rank
BIOINFORMATICS
ORIGINAL PAPER
Data and text mining
Vol. 29 no. 22 2013, pages 2909–2917
doi:10.1093/bioinformatics/btt474
Advance Access publication August 21, 2013
DNorm: disease name normalization with pairwise learning
to rank
Robert Leaman1,2, Rezarta Islamaj Doğan1 and Zhiyong Lu1,*
1
National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, MD 20894, USA and 2Department of
Biomedical Informatics, Arizona State University, 13212 East Shea Blvd, Scottsdale, AZ 85259, USA
Associate Editor: Jonathan Wren
Motivation: Despite the central role of diseases in biomedical
research, there have been much fewer attempts to automatically
determine which diseases are mentioned in a text—the task of disease
name normalization (DNorm)—compared with other normalization
tasks in biomedical text mining research.
Methods: In this article we introduce the first machine learning
approach for DNorm, using the NCBI disease corpus and the
MEDIC vocabulary, which combines MeSHÕ and OMIM. Our
method is a high-performing and mathematically principled framework
for learning similarities between mentions and concept names directly
from training data. The technique is based on pairwise learning to
rank, which has not previously been applied to the normalization
task but has proven successful in large optimization problems for
information retrieval.
Results: We compare our method with several techniques based on
lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macroaveraged F-measure, an increase over the highest performing baseline
method of 0.121 and 0.098, respectively.
Availability: The source code for DNorm is available at http://www.
ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a webbased demonstration and links to the NCBI disease corpus. Results
on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.
nih.gov/CBBresearch/Lu/Demo/PubTator
Contact:
Received on March 13, 2013; revised on August 8, 2013; accepted on
August 9, 2013
1 INTRODUCTION
Diseases are central to many lines of biomedical research, and
enabling access to disease information is the goal of many information extraction and text mining efforts (Islamaj Doğan and
Lu, 2012b; Kang et al., 2012; Névéol et al., 2012; Wiegers et al.,
2012). The task of disease normalization consists of finding disease mentions and assigning a unique identifier to each. This task
is important in many lines of inquiry involving disease, including
etiology (e.g. gene–disease relationships) and clinical aspects
(e.g. diagnosis, prevention and treatment).
Disease may be defined broadly as ‘any impairment of normal
biological function’ (Hunter, 2009). Given the wide range of concepts that may thus be categorized as diseases—their respective
*To whom correspondence should be addressed.
etiologies, clinical presentations and their various histories of
diagnosis and treatment—disease names naturally exhibit considerable variation. This variation presents not only in synonymous terms for the same disease, but also in the diverse logic used
to create the disease names themselves.
Disease names are often created by combining roots and
affixes from Greek or Latin (e.g. ‘hemochromatosis’). A particularly flexible way to create disease names is to combine a disease
category with a short descriptive modifier, which may take many
forms, including anatomical locations (‘breast cancer’), symptoms (‘cat-eye syndrome’), treatment (‘Dopa-responsive dystonia’), causative agent (‘staph infection’), biomolecular
etiology (‘G6PD deficiency’), heredity (‘X-linked agammaglobulinemia’) or eponyms (‘Schwartz-Jampel syndrome’). Modifiers
are also frequently used to provide description not part of the
name (e.g. ‘severe malaria’).
When diseases are mentioned in text, they are frequently also
abbreviated, exhibit morphological or orthographical variations,
use different word orderings or use synonyms. These variations
may involve more than single word substitutions. For example,
because affixes are often composed, a single word
(‘oculocerebrorenal’) may correspond to multiple words (‘eye,
brain and kidney’) in another form.
The disease normalization task is further complicated by the
overlap between disease concepts, forcing systems that locate and
normalize diseases in natural language text to balance handling
name variations with differentiating between concepts to achieve
good performance. Previous works addressing disease name
normalization (DNorm) typically use a hybrid of lexical and
linguistic approaches (Islamaj Doğan and Lu, 2012b; Jimeno
et al., 2008; Kang et al., 2012). While string normalization techniques (e.g. case folding, stemming) do allow some generalization, the name variations in the lexicon always impose some
limitation. Machine learning may enable higher performance
by modeling the language that authors use to describe diseases
in text; however, there have been relatively few attempts to use
machine learning in normalization, and none for disease names.
In this work we use the NCBI disease corpus (Islamaj Doğan
and Lu, 2012a), which has recently been updated to include
concept annotations (Islamaj Dogan et al., unpublished data),
to consider the task of disease normalization. We describe the
task as follows: given an abstract, return the set of disease concepts mentioned. Our current purpose is to support entityspecific semantic search of the biomedical literature (Lu, 2011)
and computer-assisted biocuration, especially document triage
(Kim et al., 2012).
ß The Author 2013. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
ABSTRACT
R.Leaman et al.
1.1
Related work
Biomedical named entity recognition (NER) research has
received increased attention recently, partly owing to
BioCreative (Hirschman et al., 2005b) and BioNLP (Kim
et al., 2009) challenges on recognition of genes, proteins and
biological events in the scientific literature, as well as TREC
(Voorhees and Tong, 2011) and i2b2 (Uzuner et al., 2011) challenges on identification of drugs, diseases and medical tests in
electronic patient records.
The problem of concept normalization has seen substantial
work for genes and proteins, as a result of a series of tasks
that were part of the BioCreative competitions (Hirschman
et al., 2005a; Lu et al., 2011; Morgan et al., 2008). A variety of
methods including pattern matching, dictionary lookup, machine
learning and heuristic rules were described for the systems participating in these challenges. Articles have also discussed the
problem of abbreviation definition and expansion, rule-based
procedures to resolve conjunctions of gene names, lexical rules
to address (...truncated)