Harmonization of gene/protein annotations: towards a gold standard MEDLINE
Copyedited by: TRJ
MANUSCRIPT CATEGORY: ORIGINAL PAPER
BIOINFORMATICS
ORIGINAL PAPER
Data and text mining
Vol. 28 no. 9 2012, pages 1253–1261
doi:10.1093/bioinformatics/bts125
Advance Access publication March 13, 2012
Harmonization of gene/protein annotations: towards a gold
standard MEDLINE
David Campos1,∗ , Sérgio Matos1 , Ian Lewin2 , José Luís Oliveira1 and
Dietrich Rebholz-Schuhmann2,∗
1 University
2 European
of Aveiro, IEETA/DETI, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal and
Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Associate Editor: Jonathan Wren
annotation of large amounts of data has become a very demanding
and expensive task. This situation naturally led to the development
of computerized systems to perform these steps automatically.
The goal of information extraction (IE) is to extract structured
and unambiguous information from unstructured data (e.g. natural
language texts). Named entity recognition (NER) is a crucial initial
task of biomedical IE, which intends to extract chunks of text that
refer to specific entities of interest. It is one of the most important
tasks, as the identified entities will be used as input to the following
steps in the IE pipeline. However, gene and protein names have
several characteristics that make difficult their identification in texts
(Zhou et al., 2004).
Received on October 1, 2011; revised on March 7, 2012; accepted
on March 8, 2012
Various systems were developed using different approaches and
techniques, which can be categorized as being based on rules,
dictionaries or machine learning. However, the most recent results
clearly indicate that better performance can be achieved by using
an ensemble of NER systems. As an example, the top five systems
of the BioCreative II gene mention challenge (Smith et al., 2008)
used ensembles of NER solutions. In these systems, each approach
identifies entity mentions with different characteristics and based
on different knowledge. Moreover, most of the NER solutions are
trained and/or tested in only one corpus, which is usually focused
in a specific biomedical domain and provides specific gene/protein
names and contexts. As a consequence, when the system is applied
to a corpus from a different domain, the global performance
drops significantly. Although this occurs with machine learning
approaches, it also affects dictionary-based solutions, depending
on the specificity of the used lexical resource. This is not only
a consequence of the different domains, but also a result of the
different annotation guidelines and their interpretation by human
annotators. For instance, Colosimo et al. (2005) presents a study
1
INTRODUCTION
In the last decades, we have witnessed an explosion of
publicly available data, a consequence of the deep integration
of computerized solutions in society. This rapid growth was
also observed in biomedicine, with an overwhelming amount of
data resulting from high-throughput methods, accompanied by
a corresponding increase of textual information. For instance,
MEDLINE contains over 18 million references to journal papers
covering various biomedical fields (e.g. medicine and dentistry).
MEDLINE and other biomedical resources are manually curated
by expert annotators, in order to correctly identify biological
entities (e.g. genes and proteins) and the relations between them
(e.g. protein–protein interactions) from texts. However, manual
∗ To
whom correspondence should be addressed.
• many entity names are descriptive (e.g. ‘normal thymic
epithelial cells’);
• two or more entity names sharing one head noun (e.g. ‘91
and 84 kDa proteins’ refers to ‘91 kDa protein’ and ‘84 kDa
protein’);
• one entity name with several spelling forms (e.g. ‘Nacetylcysteine’, ‘N-acetyl-cysteine’ and ‘NAcetylCysteine’);
• ambiguous abbreviations are frequently used (e.g. ‘TCF’ may
refer to ‘T cell factor’ or to ‘Tissue Culture Fluid’).
© The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[12:49 9/4/2012 Bioinformatics-bts125.tex]
ABSTRACT
Motivation: The recognition of named entities (NER) is an elementary
task in biomedical text mining. A number of NER solutions have been
proposed in recent years, taking advantage of available annotated
corpora, terminological resources and machine-learning techniques.
Currently, the best performing solutions combine the outputs from
selected annotation solutions measured against a single corpus.
However, little effort has been spent on a systematic analysis of
methods harmonizing the annotation results and measuring against
a combination of Gold Standard Corpora (GSCs).
Results: We present Totum, a machine learning solution that
harmonizes gene/protein annotations provided by heterogeneous
NER solutions. It has been optimized and measured against a
combination of manually curated GSCs. The performed experiments
show that our approach improves the F-measure of state-of-the-art
solutions by up to 10% (achieving ≈70%) in exact alignment and
22% (achieving ≈82%) in nested alignment. We demonstrate that
our solution delivers reliable annotation results across the GSCs and
it is an important contribution towards a homogeneous annotation of
MEDLINE abstracts.
Availability and implementation: Totum is implemented in Java and
its resources are available at http://bioinformatics.ua.pt/totum
Contact: ;
Supplementary information: Supplementary data are available at
Bioinformatics online.
1253
Page: 1253
1253–1261
Copyedited by: TRJ
MANUSCRIPT CATEGORY: ORIGINAL PAPER
D.Campos et al.
2
BACKGROUND
Nowadays, the annotation of biomedical documents is mainly
performed manually by domain experts. Consequently, only small
sets of documents have been manually annotated and made publicly
available. The CALBC (Collaborative Annotation of a Large
Biomedical Corpus) project intends to minimize this problem,
providing a large-scale biomedical text corpus automatically
annotated through the harmonization of several NER systems. This
large corpus will contain annotations of several biological semantic
groups, such as diseases, species, chemicals and genes/proteins
(Rebholz-Schuhmann et al., 2010).
The CALBC corpus is focused in the immunology biomedical
sub-domain, which abstracts were collected from MEDLINE using
the query ‘immunol*’. To generate the first version of this corpus,
four different NER and normalization systems were used:
• System 1: implements a dictionary-based approach that takes
morphological variability into consideration. It uses several
publicly available resources, such as Swiss-Prot (Boutet et al.,
2007) and ChEBI (Degtyarenko et al., 2008);
• System 2: applies a dictionary-based approach using Entrez
Gene (Maglott et al., 2005), Swiss-Prot, Genew (Wain et al.,
2004), GDB (Letovsky et al., 1998) and OMIM (Hamosh
et al., 2005) as terminological resources;
• System 3: implements a machine learning approach u (...truncated)