DNorm: disease name normalization with pairwise learning to rank (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/29/22/2909.full.pdf

DNorm: disease name normalization with pairwise learning to rank

Robert Leaman 0 1 Rezarta Islamaj Dog an 1 Zhiyong Lu 1 Associate Editor: Jonathan Wren 0 Department of Biomedical Informatics, Arizona State University , 13212 East Shea Blvd, Scottsdale, AZ 85259, USA 1 National Center for Biotechnology Information , 8600 Rockville Pike, Bethesda, MD 20894, USA Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a textthe task of disease name normalization (DNorm)compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macroaveraged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Availability: The source code for DNorm is available at http://www. ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a webbased demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm. nih.gov/CBBresearch/Lu/Demo/PubTator Contact: The Author 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 INTRODUCTION Diseases are central to many lines of biomedical research, and enabling access to disease information is the goal of many information extraction and text mining efforts (Islamaj Dog an and Lu, 2012b; Kang et al., 2012; Ne ve ol et al., 2012; Wiegers et al., 2012). The task of disease normalization consists of finding disease mentions and assigning a unique identifier to each. This task is important in many lines of inquiry involving disease, including etiology (e.g. genedisease relationships) and clinical aspects (e.g. diagnosis, prevention and treatment). Disease may be defined broadly as any impairment of normal biological function (Hunter, 2009). Given the wide range of concepts that may thus be categorized as diseasestheir respective *To whom correspondence should be addressed. etiologies, clinical presentations and their various histories of diagnosis and treatmentdisease names naturally exhibit considerable variation. This variation presents not only in synonymous terms for the same disease, but also in the diverse logic used to create the disease names themselves. Disease names are often created by combining roots and affixes from Greek or Latin (e.g. hemochromatosis). A particularly flexible way to create disease names is to combine a disease category with a short descriptive modifier, which may take many forms, including anatomical locations (breast cancer), symptoms (cat-eye syndrome), treatment (Dopa-responsive dystonia), causative agent (staph infection), biomolecular etiology (G6PD deficiency), heredity (X-linked agammaglobulinemia) or eponyms (Schwartz-Jampel syndrome). Modifiers are also frequently used to provide description not part of the name (e.g. severe malaria). When diseases are mentioned in text, they are frequently also abbreviated, exhibit morphological or orthographical variations, use different word orderings or use synonyms. These variations may involve more than single word substitutions. For example, because affixes are often composed, a single word (oculocerebrorenal) may correspond to multiple words (eye, brain and kidney) in another form. The disease normalization task is further complicated by the overlap between disease concepts, forcing systems that locate and normalize diseases in natural language text to balance handling name variations with differentiating between concepts to achieve good performance. Previous works addressing disease name normalization (DNorm) typically use a hybrid of lexical and linguistic approaches (Islamaj Dog an and Lu, 2012b; Jimeno et al., 2008; Kang et al., 2012). While string normalization techniques (e.g. case folding, stemming) do allow some generalization, the name variations in the lexicon always impose some limitation. Machine learning may enable higher performance by modeling the language that authors use to describe diseases in text; however, there have been relatively few attempts to use machine learning in normalization, and none for disease names. In this work we use the NCBI disease corpus (Islamaj Dog an and Lu, 2012a), which has recently been updated to include concept annotations (Islamaj Dogan et al., unpublished data), to consider the task of disease normalization. We describe the task as follows: given an abstract, return the set of disease concepts mentioned. Our current purpose is to support entityspecific semantic search of the biomedical literature (Lu, 2011) and computer-assisted biocuration, especially document triage (Kim et al., 2012). In this article we introduce DNorm, the first machine learning method to normalize disease names in biomedical text. Our technique learns the similarity between mentions and concept names directly from the training data, thereby focusing on the candidate generation phase of normalization. Our technique can learn arbitrary mappings between mentions and names, including synonymy, polysemy and relationships that are not 1-to-1. Moreover, our method specifically handles abbreviations and word order variations. Our method is based on pairwise learning to rank (pLTR), which has been successfully applied to large optimization problems in information retrieval (Bai et al., 2010), but to the best of our knowledge has not previously been used for concept normalization. Related work Biomedical named entity recognition (NER) research has received increased attention recently, partly owing to BioCreative (Hirschman et al., 2005b) and BioNLP (Kim et al., 2009) challenges on recognition of genes, proteins and biological events in the scientific literature, as well as TREC (Voorhees and Tong, 2011) and i2b2 (Uzuner et al., 2011) challenges on identification of drugs, diseases and medical tests in electronic patient records. The problem of concept normalization has seen substantial work for genes and proteins, as a result of a series of tasks that were part of the BioCreative competitions (Hirschman et al., 20 (...truncated)