TaggerOne: joint named entity recognition and normalization with semi-Markov Models (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/32/18/2839/49020913/bioinformatics_32_18_2839.pdf

TaggerOne: joint named entity recognition and normalization with semi-Markov Models

Bioinformatics, 32(18), 2016, 2839–2846 doi: 10.1093/bioinformatics/btw343 Advance Access Publication Date: 9 June 2016 Original Paper Data and text mining TaggerOne: joint named entity recognition and normalization with semi-Markov Models Robert Leaman and Zhiyong Lu* National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, MD 20894, USA *To whom correspondence should be addressed. Associate Editor: Jonathan Wren Received on February 16, 2016; revised on May 2, 2016; accepted on May 26, 2016 Abstract Motivation: Text mining is increasingly used to manage the accelerating pace of the biomedical literature. Many text mining applications depend on accurate named entity recognition (NER) and normalization (grounding). While high performing machine learning methods trainable for many entity types exist for NER, normalization methods are usually specialized to a single entity type. NER and normalization systems are also typically used in a serial pipeline, causing cascading errors and limiting the ability of the NER system to directly exploit the lexical information provided by the normalization. Methods: We propose the first machine learning model for joint NER and normalization during both training and prediction. The model is trainable for arbitrary entity types and consists of a semi-Markov structured linear classifier, with a rich feature approach for NER and supervised semantic indexing for normalization. We also introduce TaggerOne, a Java implementation of our model as a general toolkit for joint NER and normalization. TaggerOne is not specific to any entity type, requiring only annotated training data and a corresponding lexicon, and has been optimized for high throughput. Results: We validated TaggerOne with multiple gold-standard corpora containing both mentionand concept-level annotations. Benchmarking results show that TaggerOne achieves high performance on diseases (NCBI Disease corpus, NER f-score: 0.829, normalization f-score: 0.807) and chemicals (BioCreative 5 CDR corpus, NER f-score: 0.914, normalization f-score 0.895). These results compare favorably to the previous state of the art, notwithstanding the greater flexibility of the model. We conclude that jointly modeling NER and normalization greatly improves performance. Availability and Implementation: The TaggerOne source code and an online demonstration are available at: http://www.ncbi.nlm.nih.gov/bionlp/taggerone Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction Many tasks in biomedical information extraction rely on accurate named entity recognition (NER), the identification of text spans mentioning a concept of a specific class, such as disease or chemical. Recent research has demonstrated that a particular NER approach—namely, conditional random fields with a rich feature set— consistently achieves high performance on a variety of NER tasks when provided with an appropriate training corpus and a relatively small investment in feature engineering. This approach has been used to identify a wide variety of entities, including genes and proteins (Leaman and Gonzalez, 2008; Wei et al., 2015a), diseases (Chowdhury and Lavelli, 2010; Leaman et al., 2013), chemicals. (Leaman et al., 2015b; Rocktaschel et al., 2012) and anatomic entities (Pyysalo and Ananiadou, 2014). However many end-user tasks Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US. 2839 2840 1.1 Related work Named entity recognition (NER) and normalization have long been recognized as important tasks within biomedical text mining. Both tasks have been the subject of community challenges (Hirschman et al., 2005; Kim et al., 2009; Krallinger et al., 2015a,b; Morgan et al., 2008). The development of NER and normalization systems for diseases lagged behind genes and proteins for some time, primarily due to the lack of annotated corpora. Jimeno et al. (2008) created a corpus of sentences that was expanded by Leaman et al. (2009); this was further expanded to become the NCBI Disease Corpus (Do gan et al., 2014). Diseases were also included in the set of entities annotated in the CALBC silver standard corpus (Rebholz-Schuhmann et al., 2010). Several rule or dictionary based systems have used these disease corpora for evaluation of NER (Campos et al., 2013; Song et al., 2015) or normalization (Kang et al., 2012). Our previous work DNorm demonstrated significantly higher normalization performance when using a machine learning model (supervised semantic indexing) trained with pairwise learning to rank (Leaman et al., 2013). Most recently, the Chemical Disease Relation task at the BioCreative V community challenge included disease normalization as a subtask (Li et al., 2015; Wei et al., 2015a,c). The development of chemical NER and normalization systems was initially enabled by rigorous standards for the chemical nomenclature. The OSCAR system normalizes many varieties of chemical mentions, and is intended for mining chemistry publications (Jessop et al., 2011). Fig. 1. Example text with chemical and disease entity annotations, adapted from PMID 7420681. The outer boxes specify the annotated term and MeSH identifier Kolarik et al. (2008) created the SCAI corpus of chemical mentions, Klinger et al. (2008) used this to train and evaluate a machine learning approach for chemical NER. Rocktaschel et al. (2012) expanded the machine learning approach with extensive lexical resources. Chemicals were also included in the CALBC silver standard corpus (RebholzSchuhmann et al., 2010). The CHEMDNER task at BioCreative IV addressed chemical NER, releasing a large corpus of chemical mentions in PubMed abstracts (Krallinger et al., 2015a), where our submission tmChem achieved the highest performance out of 27 teams (Leaman et al., 2015b). The CHEMDNER task at BioCreative V also addressed chemical NER, but changed the domain to patents (Krallinger et al., 2015b). Two recent surveys of the field are Vazquez et al. (2011) and Eltyeb and Salim (2014). Our method builds successfully on previous work in NER and normalization. Cohen and Sarawagi (2004) were the first to apply semi-Markov models to NER, motivated by a need to integrate softmatch dictionary features. Okanohara et al. (2006) later applied semiMarkov models to the biomedical domain. Tsuruoka et al. (2007) is a method for learning term variation, trained directly from a lexicon using similarity measures as features. DNorm instead learned the similarity between individual tokens directly from training data (Leaman et al., 2013). The advantage of joint learning has been demonstrated for many tasks. For example, Finkel and Manning (2009) learned a joint model for parsing and NER in newswire text, while Durrett and Klein (2014) learned a model for joint coreference resolution, named entity classification and entity linking (disambiguation) wh (...truncated)