Harmonization of gene/protein annotations: towards a gold standard MEDLINE (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/28/9/1253/48879521/bioinformatics_28_9_1253.pdf

Harmonization of gene/protein annotations: towards a gold standard MEDLINE

Copyedited by: TRJ MANUSCRIPT CATEGORY: ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Data and text mining Vol. 28 no. 9 2012, pages 1253–1261 doi:10.1093/bioinformatics/bts125 Advance Access publication March 13, 2012 Harmonization of gene/protein annotations: towards a gold standard MEDLINE David Campos1,∗ , Sérgio Matos1 , Ian Lewin2 , José Luís Oliveira1 and Dietrich Rebholz-Schuhmann2,∗ 1 University 2 European of Aveiro, IEETA/DETI, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal and Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Associate Editor: Jonathan Wren annotation of large amounts of data has become a very demanding and expensive task. This situation naturally led to the development of computerized systems to perform these steps automatically. The goal of information extraction (IE) is to extract structured and unambiguous information from unstructured data (e.g. natural language texts). Named entity recognition (NER) is a crucial initial task of biomedical IE, which intends to extract chunks of text that refer to specific entities of interest. It is one of the most important tasks, as the identified entities will be used as input to the following steps in the IE pipeline. However, gene and protein names have several characteristics that make difficult their identification in texts (Zhou et al., 2004). Received on October 1, 2011; revised on March 7, 2012; accepted on March 8, 2012 Various systems were developed using different approaches and techniques, which can be categorized as being based on rules, dictionaries or machine learning. However, the most recent results clearly indicate that better performance can be achieved by using an ensemble of NER systems. As an example, the top five systems of the BioCreative II gene mention challenge (Smith et al., 2008) used ensembles of NER solutions. In these systems, each approach identifies entity mentions with different characteristics and based on different knowledge. Moreover, most of the NER solutions are trained and/or tested in only one corpus, which is usually focused in a specific biomedical domain and provides specific gene/protein names and contexts. As a consequence, when the system is applied to a corpus from a different domain, the global performance drops significantly. Although this occurs with machine learning approaches, it also affects dictionary-based solutions, depending on the specificity of the used lexical resource. This is not only a consequence of the different domains, but also a result of the different annotation guidelines and their interpretation by human annotators. For instance, Colosimo et al. (2005) presents a study 1 INTRODUCTION In the last decades, we have witnessed an explosion of publicly available data, a consequence of the deep integration of computerized solutions in society. This rapid growth was also observed in biomedicine, with an overwhelming amount of data resulting from high-throughput methods, accompanied by a corresponding increase of textual information. For instance, MEDLINE contains over 18 million references to journal papers covering various biomedical fields (e.g. medicine and dentistry). MEDLINE and other biomedical resources are manually curated by expert annotators, in order to correctly identify biological entities (e.g. genes and proteins) and the relations between them (e.g. protein–protein interactions) from texts. However, manual ∗ To whom correspondence should be addressed. • many entity names are descriptive (e.g. ‘normal thymic epithelial cells’); • two or more entity names sharing one head noun (e.g. ‘91 and 84 kDa proteins’ refers to ‘91 kDa protein’ and ‘84 kDa protein’); • one entity name with several spelling forms (e.g. ‘Nacetylcysteine’, ‘N-acetyl-cysteine’ and ‘NAcetylCysteine’); • ambiguous abbreviations are frequently used (e.g. ‘TCF’ may refer to ‘T cell factor’ or to ‘Tissue Culture Fluid’). © The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: [12:49 9/4/2012 Bioinformatics-bts125.tex] ABSTRACT Motivation: The recognition of named entities (NER) is an elementary task in biomedical text mining. A number of NER solutions have been proposed in recent years, taking advantage of available annotated corpora, terminological resources and machine-learning techniques. Currently, the best performing solutions combine the outputs from selected annotation solutions measured against a single corpus. However, little effort has been spent on a systematic analysis of methods harmonizing the annotation results and measuring against a combination of Gold Standard Corpora (GSCs). Results: We present Totum, a machine learning solution that harmonizes gene/protein annotations provided by heterogeneous NER solutions. It has been optimized and measured against a combination of manually curated GSCs. The performed experiments show that our approach improves the F-measure of state-of-the-art solutions by up to 10% (achieving ≈70%) in exact alignment and 22% (achieving ≈82%) in nested alignment. We demonstrate that our solution delivers reliable annotation results across the GSCs and it is an important contribution towards a homogeneous annotation of MEDLINE abstracts. Availability and implementation: Totum is implemented in Java and its resources are available at http://bioinformatics.ua.pt/totum Contact: ; Supplementary information: Supplementary data are available at Bioinformatics online. 1253 Page: 1253 1253–1261 Copyedited by: TRJ MANUSCRIPT CATEGORY: ORIGINAL PAPER D.Campos et al. 2 BACKGROUND Nowadays, the annotation of biomedical documents is mainly performed manually by domain experts. Consequently, only small sets of documents have been manually annotated and made publicly available. The CALBC (Collaborative Annotation of a Large Biomedical Corpus) project intends to minimize this problem, providing a large-scale biomedical text corpus automatically annotated through the harmonization of several NER systems. This large corpus will contain annotations of several biological semantic groups, such as diseases, species, chemicals and genes/proteins (Rebholz-Schuhmann et al., 2010). The CALBC corpus is focused in the immunology biomedical sub-domain, which abstracts were collected from MEDLINE using the query ‘immunol*’. To generate the first version of this corpus, four different NER and normalization systems were used: • System 1: implements a dictionary-based approach that takes morphological variability into consideration. It uses several publicly available resources, such as Swiss-Prot (Boutet et al., 2007) and ChEBI (Degtyarenko et al., 2008); • System 2: applies a dictionary-based approach using Entrez Gene (Maglott et al., 2005), Swiss-Prot, Genew (Wain et al., 2004), GDB (Letovsky et al., 1998) and OMIM (Hamosh et al., 2005) as terminological resources; • System 3: implements a machine learning approach u (...truncated)