ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text
0
Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
,
Madison, WI 52706
,
USA
Summary: ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora. Availability: ABNER is available as an executable Java archive and source code from http://www.cs.wisc.edu/bsettles/abner/ Contact: ABNER has an intuitive graphical user interface where text can be typed in manually or loaded from a file and automatically tagged for multiple named entities in real time. A screen shot of the interface is shown in Figure 1. Each entity is highlighted with a unique color (yellow = protein, green = DNA, etc.) for easy The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email:
1 INTRODUCTION
Interest in developing effective tools for natural language processing
(NLP) tasks in biomedical literature has been increasing in recent
years. The tasks offer scientific challengesestablished NLP
techniques do not port easily to the biomedical domainbut there is also
a practical need to effectively curate, organize and retrieve
information automatically from textual sources. Named entity recognition,
the NLP task of identifying words and phrases belonging to certain
classes (e.g. protein and cell line), is an important first step for
many larger information management goals. The current state of the
art yields F1 scores with exact boundary matching around 70 (Kim
et al., 2004; Yeh et al., 2004), but few systems with published results
in this range are freely available.
ABNER (A Biomedical Named Entity Recognizer) version 1.0
was released in July 2004 as a free, user-friendly interface to a
highperforming system developed for the NLPBA 2004 Shared Task
(Settles, 2004). Version 1.5 was released open source in March
2005 with some performance improvements and a customizable
application programming interface (API).
Fig. 1. A screen shot of ABNERs graphical user interface.
visual reference, and tagged documents can be saved in a
variety of file formats. The software can also annotate plain text files
in batch mode. Users can pre-tokenize input text, or make use of
ABNERs built-in tokenization, which is quite robust to wrapped
lines and biomedical abbreviations. The bundled ABNER
application is platform-independent and has been tested on Linux, Windows
XP, Solaris and Mac OSX. The distribution includes two built-in
entity tagging modules that are trained and evaluated on the
standard NLPBA (Kim et al., 2004) and BioCreative1 (Yeh et al., 2004)
corpora. Performance details for both modules are presented in
Section 4.
The Java API allows users to write custom interfaces to ABNER
modules or incorporate them into larger biomedical NLP systems.
The API also includes routines for training new modules on other
corpora. (This may be necessary for tasks that are organism-specific
or require tagging conventions not reflected by the built-in modules.)
The source code is also available under the terms of the Common
Public License.
1This was previously distributed as part of a command-line tool called YAGI
(Yet Another Gene Identifier), which has been deprecated.
ALGORITHMS AND IMPLEMENTATION
Conditional random fields (CRFs) are undirected statistical
graphical models, a special case of which corresponds to conditionally
trained finite-state machines well suited for labeling and segmenting
sequence data (Lafferty et al., 2001). Named entity recognition can
be framed as a sequence labeling problem: words in a sentence are
tokens to be assigned labels by states in the CRF framework.
Let o = o1, o2, . . . , on be a sequence of observed words of
length n. Let L be a set of labels (protein, DNA, other, etc.)
corresponding to states in a finite-state machine. Then l = l1, l2, . . . , ln
is a sequence of labels from L assigned to words in the input
sequence o. A first-order linear-chain CRF defines the conditional
probability of a label sequence given an input sequence to be:
i=1 j=1
where Zo is a normalization factor over all possible label sequences,
fj is one of the k binary functions describing a feature at position i
in sequence o and j is a weight for that feature. For example, given
the text . . . the ATPase. . . fj might be the feature Word=ATPase
and have value 1 along the transition where li1 is the label state
other (the is a non-entity) and li is the label state protein. Other
features with value 1 along this transition are Capitalized,
MixedCase and Suffix=ase. The learned weight j should be positive for
a feature correlated with the target label, negative for a feature that is
anti-correlated and near zero for a relatively uninformative feature.
The weights are set to maximize the conditional log-likelihood of
m labeled sequences in a training set D = { o, l (1), . . . , o , l (m)}:
i=1
LL(D) =
log P (l(i)|o(i))
j=1
where the second sum is a Gaussian prior over feature weights to
help to prevent overfitting due to sparsity in D. If training sequences
are fully labeled, LL(D) is convex and the model is guaranteed to
converge optimally. New sequences can then be labeled with the
Viterbi algorithm. For more details, see Lafferty et al. (2001).
ABNERs default feature set comprises orthographic and
contextual features, mostly based on regular expressions and
neighboring tokens. The feature set is slightly modified from
previous work (Settles, 2004) for improved performance, and can be
viewed/modified in the source code distribution. Note that ABNER
currently does not use syntactic or semantic features. Research
indicates that such features can improve performance slightly, but
presently they are not dynamically generated by ABNER.
The system is written entirely in Java using graphical window
objects from the Swing library. The CRF models are implemented
with the MALLET toolkit (http://mallet.cs.umass.edu/), which uses
a quasi-Newton method called L-BFGS (Nocedal and Wright, 1999)
to find the optimal feature weights efficiently. Tokenization is
performed by a deterministic finite-state scanner built with the JLex tool
(http://www.cs.princeton.edu/appel/modern/java/JLex/).
The NLPBA corpus is a modified version of the GENIA
corpus (Kim et al., 2003), containing five entities labeled for 18 546
training sentences and 3856 evaluation sentences. The BioCreative
(S F1)
Recall, precision and F1 reflect exact boundary matching. S F1 is a soft F1 score
wh (...truncated)