DNorm: disease name normalization with pairwise learning to rank (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/29/22/2909/48891951/bioinformatics_29_22_2909.pdf

DNorm: disease name normalization with pairwise learning to rank

BIOINFORMATICS ORIGINAL PAPER Data and text mining Vol. 29 no. 22 2013, pages 2909–2917 doi:10.1093/bioinformatics/btt474 Advance Access publication August 21, 2013 DNorm: disease name normalization with pairwise learning to rank Robert Leaman1,2, Rezarta Islamaj Doğan1 and Zhiyong Lu1,* 1 National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, MD 20894, USA and 2Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd, Scottsdale, AZ 85259, USA Associate Editor: Jonathan Wren Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text—the task of disease name normalization (DNorm)—compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSHÕ and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macroaveraged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Availability: The source code for DNorm is available at http://www. ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a webbased demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm. nih.gov/CBBresearch/Lu/Demo/PubTator Contact: Received on March 13, 2013; revised on August 8, 2013; accepted on August 9, 2013 1 INTRODUCTION Diseases are central to many lines of biomedical research, and enabling access to disease information is the goal of many information extraction and text mining efforts (Islamaj Doğan and Lu, 2012b; Kang et al., 2012; Névéol et al., 2012; Wiegers et al., 2012). The task of disease normalization consists of finding disease mentions and assigning a unique identifier to each. This task is important in many lines of inquiry involving disease, including etiology (e.g. gene–disease relationships) and clinical aspects (e.g. diagnosis, prevention and treatment). Disease may be defined broadly as ‘any impairment of normal biological function’ (Hunter, 2009). Given the wide range of concepts that may thus be categorized as diseases—their respective *To whom correspondence should be addressed. etiologies, clinical presentations and their various histories of diagnosis and treatment—disease names naturally exhibit considerable variation. This variation presents not only in synonymous terms for the same disease, but also in the diverse logic used to create the disease names themselves. Disease names are often created by combining roots and affixes from Greek or Latin (e.g. ‘hemochromatosis’). A particularly flexible way to create disease names is to combine a disease category with a short descriptive modifier, which may take many forms, including anatomical locations (‘breast cancer’), symptoms (‘cat-eye syndrome’), treatment (‘Dopa-responsive dystonia’), causative agent (‘staph infection’), biomolecular etiology (‘G6PD deficiency’), heredity (‘X-linked agammaglobulinemia’) or eponyms (‘Schwartz-Jampel syndrome’). Modifiers are also frequently used to provide description not part of the name (e.g. ‘severe malaria’). When diseases are mentioned in text, they are frequently also abbreviated, exhibit morphological or orthographical variations, use different word orderings or use synonyms. These variations may involve more than single word substitutions. For example, because affixes are often composed, a single word (‘oculocerebrorenal’) may correspond to multiple words (‘eye, brain and kidney’) in another form. The disease normalization task is further complicated by the overlap between disease concepts, forcing systems that locate and normalize diseases in natural language text to balance handling name variations with differentiating between concepts to achieve good performance. Previous works addressing disease name normalization (DNorm) typically use a hybrid of lexical and linguistic approaches (Islamaj Doğan and Lu, 2012b; Jimeno et al., 2008; Kang et al., 2012). While string normalization techniques (e.g. case folding, stemming) do allow some generalization, the name variations in the lexicon always impose some limitation. Machine learning may enable higher performance by modeling the language that authors use to describe diseases in text; however, there have been relatively few attempts to use machine learning in normalization, and none for disease names. In this work we use the NCBI disease corpus (Islamaj Doğan and Lu, 2012a), which has recently been updated to include concept annotations (Islamaj Dogan et al., unpublished data), to consider the task of disease normalization. We describe the task as follows: given an abstract, return the set of disease concepts mentioned. Our current purpose is to support entityspecific semantic search of the biomedical literature (Lu, 2011) and computer-assisted biocuration, especially document triage (Kim et al., 2012). ß The Author 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. ABSTRACT R.Leaman et al. 1.1 Related work Biomedical named entity recognition (NER) research has received increased attention recently, partly owing to BioCreative (Hirschman et al., 2005b) and BioNLP (Kim et al., 2009) challenges on recognition of genes, proteins and biological events in the scientific literature, as well as TREC (Voorhees and Tong, 2011) and i2b2 (Uzuner et al., 2011) challenges on identification of drugs, diseases and medical tests in electronic patient records. The problem of concept normalization has seen substantial work for genes and proteins, as a result of a series of tasks that were part of the BioCreative competitions (Hirschman et al., 2005a; Lu et al., 2011; Morgan et al., 2008). A variety of methods including pattern matching, dictionary lookup, machine learning and heuristic rules were described for the systems participating in these challenges. Articles have also discussed the problem of abbreviation definition and expansion, rule-based procedures to resolve conjunctions of gene names, lexical rules to address (...truncated)