NOBLE – Flexible concept recognition for large-scale biomedical natural language processing
Tseytlin et al. BMC Bioinformatics (2016) 17:32
DOI 10.1186/s12859-015-0871-y
SOFTWARE
Open Access
NOBLE – Flexible concept recognition for
large-scale biomedical natural language
processing
Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski, Julia Corrigan, Girish Chavan and Rebecca S. Jacobson*
Abstract
Background: Natural language processing (NLP) applications are increasingly important in biomedical data analysis,
knowledge engineering, and decision support. Concept recognition is an important component task for NLP
pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose
concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly
used alternatives on both a biological and clinical corpus.
NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The
system’s matching options can be configured individually or in combination to yield specific system behavior for a
variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We
benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and
compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary
Lookup Annotator.
Results: We describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm,
configurable matching strategies, and multiple terminology input formats. These features provide unique functionality
when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE’s
performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error
analysis revealed differences in error profiles among systems.
Conclusion: NOBLE Coder is comparable to other widely used concept recognition systems in terms of
accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of
configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching
system suitable for general concept recognition in biomedical NLP pipelines.
Keywords: Natural language processing, Text-processing, Named Entity Recognition, Concept recognition,
Biomedical terminologies, Auto-coding, System evaluation
Background
Natural Language Processing (NLP) methods are increasingly used to accomplish information retrieval and
information extraction in biomedical systems [1]. A critical component of NLP pipelines is the matching of
terms in the text to concepts or entities in the controlled
vocabulary or ontology. This task is best described as
* Correspondence:
Department of Biomedical Informatics, University of Pittsburgh School of
Medicine, The Offices at Baum, 5607 Baum Boulevard, BAUM 423, Rm 523,
Pittsburgh, PA 15206-3701, USA
‘Concept Recognition’ although the labels ‘Entity
Mention Extraction’ and ‘Named Entity Recognition’
are sometimes also used, especially among clinical
NLP researchers. Ideally, such concept recognition
systems produce annotations of mentions where the
annotated term in the text may be a synonym, abbreviation, lexical variant, or partial match of the concept in the controlled vocabulary, or of the entity in
the ontology. For example, given the concept “atrial
fibrillation” in the terminology, we expect a concept
recognition component to annotate mentions for all
four of the following phrases in a text passage: ‘auricular
© 2016 Tseytlin et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Tseytlin et al. BMC Bioinformatics (2016) 17:32
fibrillation’ (synonym), ‘a-fib’ (abbreviation), ‘atrial fibrillations’ (lexical variant), and potentially ‘fibrillation’ (partial
match). Definitions of key terms used throughout this
paper are provided in Table 1.
Two general approaches have been used for biomedical concept recognition [2]. Term-to-concept matching
algorithms (previously called ‘auto coders’) are generalpurpose algorithms for matching parsed text to a terminology. They typically require expert selection of
Table 1 Key terms and definitions
Term
Definition
Abbreviation
A shortened form of a word, name, or phrase.
Annotation
The tagging of words comprising a mention
to assign them to a concept or text feature
[41].
Auto coder
A computer-based system that automatically
matches text terms to a code or concept.
Concept
A “cognitive construct” that is built on our
perception or understanding of something
[42]; delineates a specific entity embodying
a particular meaning [43].
Controlled vocabulary
A vocabulary that reduces ambiguity and
establishes relationships by linking each
concept to a term and its synonyms [43, 44].
Entity
An “object of interest.” [41]; the referent in
the semiotic triangle.
Gazetteer
A list or dictionary of entities [45].
Lexical variant
Different forms of the same term that occur
due to variations in spelling, grammar, etc. [44].
Mention
One or more words and or punctuation
within a text which refer to a specific entity.
Named entity
A specific word or phrase referring to an
object of interest [41, 46].
Ontology
A defined group of terms and their
relationships to each other, within the
context of a particular domain [47].
Semantic type
A logical category of related terms [48].
Stop word
A word of high frequency but limited
information value (e.g.determiners) that is
excluded from a vocabulary to improve
results of a subsequent task [49].
Synonym
A term with the same meaning as another
term; terms that describe the same concept
[48, 50].
Term
One or more words including punctuation
that represent a concept; there may be
multiple terms associated with one concept
[42, 49].
Terminology
A catalog of terms related to a specific
domain [42]. Subsumes a variety of
formalisms such as lexicons and ontologies
[43].
Vocabulary
A terminology where the terms and concepts
are defined [42, 44].
Word
A linguistic unit that has a definable meaning
and/or function [51].
Page 2 of 15
vocabularies, semantic types, stop word lists, and other
gazetteers, but they do not require training data produced through human annotation.
In contrast, machine learning NLP methods are used
to produce annotators for specific well-defined purposes
such as annotating drug mentions [3, 4] an (...truncated)