NOBLE – Flexible concept recognition for large-scale biomedical natural language processing (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/s12859-015-0871-y.pdf

NOBLE – Flexible concept recognition for large-scale biomedical natural language processing

Tseytlin et al. BMC Bioinformatics (2016) 17:32 DOI 10.1186/s12859-015-0871-y SOFTWARE Open Access NOBLE – Flexible concept recognition for large-scale biomedical natural language processing Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski, Julia Corrigan, Girish Chavan and Rebecca S. Jacobson* Abstract Background: Natural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus. NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system’s matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator. Results: We describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm, configurable matching strategies, and multiple terminology input formats. These features provide unique functionality when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE’s performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error analysis revealed differences in error profiles among systems. Conclusion: NOBLE Coder is comparable to other widely used concept recognition systems in terms of accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching system suitable for general concept recognition in biomedical NLP pipelines. Keywords: Natural language processing, Text-processing, Named Entity Recognition, Concept recognition, Biomedical terminologies, Auto-coding, System evaluation Background Natural Language Processing (NLP) methods are increasingly used to accomplish information retrieval and information extraction in biomedical systems [1]. A critical component of NLP pipelines is the matching of terms in the text to concepts or entities in the controlled vocabulary or ontology. This task is best described as * Correspondence: Department of Biomedical Informatics, University of Pittsburgh School of Medicine, The Offices at Baum, 5607 Baum Boulevard, BAUM 423, Rm 523, Pittsburgh, PA 15206-3701, USA ‘Concept Recognition’ although the labels ‘Entity Mention Extraction’ and ‘Named Entity Recognition’ are sometimes also used, especially among clinical NLP researchers. Ideally, such concept recognition systems produce annotations of mentions where the annotated term in the text may be a synonym, abbreviation, lexical variant, or partial match of the concept in the controlled vocabulary, or of the entity in the ontology. For example, given the concept “atrial fibrillation” in the terminology, we expect a concept recognition component to annotate mentions for all four of the following phrases in a text passage: ‘auricular © 2016 Tseytlin et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Tseytlin et al. BMC Bioinformatics (2016) 17:32 fibrillation’ (synonym), ‘a-fib’ (abbreviation), ‘atrial fibrillations’ (lexical variant), and potentially ‘fibrillation’ (partial match). Definitions of key terms used throughout this paper are provided in Table 1. Two general approaches have been used for biomedical concept recognition [2]. Term-to-concept matching algorithms (previously called ‘auto coders’) are generalpurpose algorithms for matching parsed text to a terminology. They typically require expert selection of Table 1 Key terms and definitions Term Definition Abbreviation A shortened form of a word, name, or phrase. Annotation The tagging of words comprising a mention to assign them to a concept or text feature [41]. Auto coder A computer-based system that automatically matches text terms to a code or concept. Concept A “cognitive construct” that is built on our perception or understanding of something [42]; delineates a specific entity embodying a particular meaning [43]. Controlled vocabulary A vocabulary that reduces ambiguity and establishes relationships by linking each concept to a term and its synonyms [43, 44]. Entity An “object of interest.” [41]; the referent in the semiotic triangle. Gazetteer A list or dictionary of entities [45]. Lexical variant Different forms of the same term that occur due to variations in spelling, grammar, etc. [44]. Mention One or more words and or punctuation within a text which refer to a specific entity. Named entity A specific word or phrase referring to an object of interest [41, 46]. Ontology A defined group of terms and their relationships to each other, within the context of a particular domain [47]. Semantic type A logical category of related terms [48]. Stop word A word of high frequency but limited information value (e.g.determiners) that is excluded from a vocabulary to improve results of a subsequent task [49]. Synonym A term with the same meaning as another term; terms that describe the same concept [48, 50]. Term One or more words including punctuation that represent a concept; there may be multiple terms associated with one concept [42, 49]. Terminology A catalog of terms related to a specific domain [42]. Subsumes a variety of formalisms such as lexicons and ontologies [43]. Vocabulary A terminology where the terms and concepts are defined [42, 44]. Word A linguistic unit that has a definable meaning and/or function [51]. Page 2 of 15 vocabularies, semantic types, stop word lists, and other gazetteers, but they do not require training data produced through human annotation. In contrast, machine learning NLP methods are used to produce annotators for specific well-defined purposes such as annotating drug mentions [3, 4] an (...truncated)