ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/21/14/3191.full.pdf

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text

0 Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison , Madison, WI 52706 , USA Summary: ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora. Availability: ABNER is available as an executable Java archive and source code from http://www.cs.wisc.edu/bsettles/abner/ Contact: ABNER has an intuitive graphical user interface where text can be typed in manually or loaded from a file and automatically tagged for multiple named entities in real time. A screen shot of the interface is shown in Figure 1. Each entity is highlighted with a unique color (yellow = protein, green = DNA, etc.) for easy The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: 1 INTRODUCTION Interest in developing effective tools for natural language processing (NLP) tasks in biomedical literature has been increasing in recent years. The tasks offer scientific challengesestablished NLP techniques do not port easily to the biomedical domainbut there is also a practical need to effectively curate, organize and retrieve information automatically from textual sources. Named entity recognition, the NLP task of identifying words and phrases belonging to certain classes (e.g. protein and cell line), is an important first step for many larger information management goals. The current state of the art yields F1 scores with exact boundary matching around 70 (Kim et al., 2004; Yeh et al., 2004), but few systems with published results in this range are freely available. ABNER (A Biomedical Named Entity Recognizer) version 1.0 was released in July 2004 as a free, user-friendly interface to a highperforming system developed for the NLPBA 2004 Shared Task (Settles, 2004). Version 1.5 was released open source in March 2005 with some performance improvements and a customizable application programming interface (API). Fig. 1. A screen shot of ABNERs graphical user interface. visual reference, and tagged documents can be saved in a variety of file formats. The software can also annotate plain text files in batch mode. Users can pre-tokenize input text, or make use of ABNERs built-in tokenization, which is quite robust to wrapped lines and biomedical abbreviations. The bundled ABNER application is platform-independent and has been tested on Linux, Windows XP, Solaris and Mac OSX. The distribution includes two built-in entity tagging modules that are trained and evaluated on the standard NLPBA (Kim et al., 2004) and BioCreative1 (Yeh et al., 2004) corpora. Performance details for both modules are presented in Section 4. The Java API allows users to write custom interfaces to ABNER modules or incorporate them into larger biomedical NLP systems. The API also includes routines for training new modules on other corpora. (This may be necessary for tasks that are organism-specific or require tagging conventions not reflected by the built-in modules.) The source code is also available under the terms of the Common Public License. 1This was previously distributed as part of a command-line tool called YAGI (Yet Another Gene Identifier), which has been deprecated. ALGORITHMS AND IMPLEMENTATION Conditional random fields (CRFs) are undirected statistical graphical models, a special case of which corresponds to conditionally trained finite-state machines well suited for labeling and segmenting sequence data (Lafferty et al., 2001). Named entity recognition can be framed as a sequence labeling problem: words in a sentence are tokens to be assigned labels by states in the CRF framework. Let o = o1, o2, . . . , on be a sequence of observed words of length n. Let L be a set of labels (protein, DNA, other, etc.) corresponding to states in a finite-state machine. Then l = l1, l2, . . . , ln is a sequence of labels from L assigned to words in the input sequence o. A first-order linear-chain CRF defines the conditional probability of a label sequence given an input sequence to be: i=1 j=1 where Zo is a normalization factor over all possible label sequences, fj is one of the k binary functions describing a feature at position i in sequence o and j is a weight for that feature. For example, given the text . . . the ATPase. . . fj might be the feature Word=ATPase and have value 1 along the transition where li1 is the label state other (the is a non-entity) and li is the label state protein. Other features with value 1 along this transition are Capitalized, MixedCase and Suffix=ase. The learned weight j should be positive for a feature correlated with the target label, negative for a feature that is anti-correlated and near zero for a relatively uninformative feature. The weights are set to maximize the conditional log-likelihood of m labeled sequences in a training set D = { o, l (1), . . . , o , l (m)}: i=1 LL(D) = log P (l(i)|o(i)) j=1 where the second sum is a Gaussian prior over feature weights to help to prevent overfitting due to sparsity in D. If training sequences are fully labeled, LL(D) is convex and the model is guaranteed to converge optimally. New sequences can then be labeled with the Viterbi algorithm. For more details, see Lafferty et al. (2001). ABNERs default feature set comprises orthographic and contextual features, mostly based on regular expressions and neighboring tokens. The feature set is slightly modified from previous work (Settles, 2004) for improved performance, and can be viewed/modified in the source code distribution. Note that ABNER currently does not use syntactic or semantic features. Research indicates that such features can improve performance slightly, but presently they are not dynamically generated by ABNER. The system is written entirely in Java using graphical window objects from the Swing library. The CRF models are implemented with the MALLET toolkit (http://mallet.cs.umass.edu/), which uses a quasi-Newton method called L-BFGS (Nocedal and Wright, 1999) to find the optimal feature weights efficiently. Tokenization is performed by a deterministic finite-state scanner built with the JLex tool (http://www.cs.princeton.edu/appel/modern/java/JLex/). The NLPBA corpus is a modified version of the GENIA corpus (Kim et al., 2003), containing five entities labeled for 18 546 training sentences and 3856 evaluation sentences. The BioCreative (S F1) Recall, precision and F1 reflect exact boundary matching. S F1 is a soft F1 score wh (...truncated)