Concept annotation in the CRAFT corpus (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/1471-2105-13-161

Concept annotation in the CRAFT corpus

Bada et al. BMC Bioinformatics 2012, 13:161 http://www.biomedcentral.com/1471-2105/13/161 RESEARCH ARTICLE Open Access Concept annotation in the CRAFT corpus Michael Bada1*, Miriam Eckert2, Donald Evans1, Kristin Garcia1, Krista Shipley1, Dmitry Sitnikov3, William A Baumgartner Jr1, K Bretonnel Cohen1, Karin Verspoor1,4, Judith A Blake3 and Lawrence E Hunter1 Abstract Background: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Background With the digitalization of much of the biomedical literature, automated processing of journal publications has become increasingly important in biomedical research. Biomedical researchers struggle to keep abreast of the exponentially growing literature, due to not only its sheer scale but also to the expanding range of disciplines and journals relevant to a typical research question. Biomedical publications, like most texts, are fraught with synonymy, polysemy, ambiguity, and complexity. Transformation of these texts into formal representations of the contained knowledge makes possible the application of sophisticated computational methods that assist * Correspondence: 1 Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA Full list of author information is available at the end of the article researchers and advance science. Substantial progress in biomedical natural-language processing (NLP), particularly in the tasks of information retrieval, concept recognition, and information extraction [1-5] raises the possibility of creating formal representations for the entire biomedical literature. Development of formal ontologies for the representation of domain-specific knowledge has also made substantial progress [6]. Among the most ambitious of these efforts are the Open Biomedical Ontologies (OBOs), a set of ontologies whose domains include anatomy, biological processes and functions, cells and cellular components, chemicals, phenotypes and diseases, and experiments and procedures. These ontologies are largely constructed in a community-driven approach, and their developers commit to a common set of attributes including openness, shared syntax, clear versioning, demarcated content, and clear © 2012 Bada et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Bada et al. BMC Bioinformatics 2012, 13:161 http://www.biomedcentral.com/1471-2105/13/161 definition [7]. Millions of genes, gene products, and biomedical data sets have been annotated with ontological terms, and these annotations are widely used as the basis for high-throughput data analysis. In particular, calculations of enrichment of Gene Ontology (GO) terms in sets of differentially expressed genes are widely used [8-10], and more sophisticated uses of formal knowledge representations in data analysis are beginning to be published (e.g., [11]). Manually annotated, or “gold-standard”, corpora are increasingly important for the development of sophisticated NLP systems, both as training data and for evaluative purposes. Use of manually annotated biomedical corpora in NLP research has consistently led to improved results. In a study by Tomanek et al., the accuracy of tokenization of a test set of biomedical text increased from 71.5% when their tool was trained on a corpus that was tokenized using newspaper language patterns to 95.9% when their tool was trained on a corpus whose tokenization was biomedically motivated [12]. Kulick et al. showed that accuracy of part-of-speech annotation of biomedical text increased from 88.53% to 97.33% on test abstracts when their tagger was retrained after the training corpus was manually checked and corrected [13], and Coden et al. found that adding a small biomedical annotated corpus to a large general-English one increased accuracy of part-of-speech tagging of biomedical text from 87% to 92% [14]. Lease and Charniak demonstrated large reductions in unknown word rates and large increases in accuracy of part-of-speech tagging and parsing when their systems were trained with a biomedical corpus as compared to only general-English and/or business texts [15]. It was shown by Roberts et al. that the best results in recognition of clinical concepts (e.g., conditions, drugs, devices, interventions) in biomedical text, ranging from 10% below to 11% above the interannotator-agreement scores for the gold-standard test set, were obtained with the inclusion of statistical models trained on a manually annotated corpus as compared to dictionary-based concept recognition solely [16]. Craven and Kumlein found generally higher levels of precision of extracted biomedical assertions (e.g., protein-disease associations and subcellular, cell-type, and tissue localizations of proteins) for Naïve-Bayes-model-based (...truncated)