Concept annotation in the CRAFT corpus
Bada et al. BMC Bioinformatics 2012, 13:161
http://www.biomedcentral.com/1471-2105/13/161
RESEARCH ARTICLE
Open Access
Concept annotation in the CRAFT corpus
Michael Bada1*, Miriam Eckert2, Donald Evans1, Kristin Garcia1, Krista Shipley1, Dmitry Sitnikov3,
William A Baumgartner Jr1, K Bretonnel Cohen1, Karin Verspoor1,4, Judith A Blake3 and Lawrence E Hunter1
Abstract
Background: Manually annotated corpora are critical for the training and evaluation of automated methods to
identify concepts in biomedical text.
Results: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a
collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and
syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community.
CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies:
the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein
Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the
Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15
articles for future text-mining competitions (after which these too will be released). Concept annotations were
created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator
agreement.
Conclusions: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000
tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the
journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their
entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than
140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The
concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by
providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated
resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
Background
With the digitalization of much of the biomedical literature, automated processing of journal publications has
become increasingly important in biomedical research.
Biomedical researchers struggle to keep abreast of the
exponentially growing literature, due to not only its
sheer scale but also to the expanding range of disciplines
and journals relevant to a typical research question. Biomedical publications, like most texts, are fraught with
synonymy, polysemy, ambiguity, and complexity. Transformation of these texts into formal representations of
the contained knowledge makes possible the application
of sophisticated computational methods that assist
* Correspondence:
1
Department of Pharmacology, University of Colorado Anschutz Medical
Campus, Aurora, CO, USA
Full list of author information is available at the end of the article
researchers and advance science. Substantial progress in
biomedical natural-language processing (NLP), particularly in the tasks of information retrieval, concept recognition, and information extraction [1-5] raises the possibility
of creating formal representations for the entire biomedical literature.
Development of formal ontologies for the representation
of domain-specific knowledge has also made substantial
progress [6]. Among the most ambitious of these efforts
are the Open Biomedical Ontologies (OBOs), a set of
ontologies whose domains include anatomy, biological
processes and functions, cells and cellular components,
chemicals, phenotypes and diseases, and experiments and
procedures. These ontologies are largely constructed in a
community-driven approach, and their developers commit
to a common set of attributes including openness, shared
syntax, clear versioning, demarcated content, and clear
© 2012 Bada et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Bada et al. BMC Bioinformatics 2012, 13:161
http://www.biomedcentral.com/1471-2105/13/161
definition [7]. Millions of genes, gene products, and biomedical data sets have been annotated with ontological
terms, and these annotations are widely used as the basis
for high-throughput data analysis. In particular, calculations of enrichment of Gene Ontology (GO) terms in sets
of differentially expressed genes are widely used [8-10],
and more sophisticated uses of formal knowledge representations in data analysis are beginning to be published
(e.g., [11]).
Manually annotated, or “gold-standard”, corpora are
increasingly important for the development of sophisticated NLP systems, both as training data and for evaluative purposes. Use of manually annotated biomedical
corpora in NLP research has consistently led to improved
results. In a study by Tomanek et al., the accuracy of tokenization of a test set of biomedical text increased from
71.5% when their tool was trained on a corpus that was
tokenized using newspaper language patterns to 95.9%
when their tool was trained on a corpus whose tokenization was biomedically motivated [12]. Kulick et al. showed
that accuracy of part-of-speech annotation of biomedical
text increased from 88.53% to 97.33% on test abstracts
when their tagger was retrained after the training corpus
was manually checked and corrected [13], and Coden
et al. found that adding a small biomedical annotated corpus to a large general-English one increased accuracy of
part-of-speech tagging of biomedical text from 87% to
92% [14]. Lease and Charniak demonstrated large reductions in unknown word rates and large increases in accuracy of part-of-speech tagging and parsing when their
systems were trained with a biomedical corpus as compared to only general-English and/or business texts [15].
It was shown by Roberts et al. that the best results in recognition of clinical concepts (e.g., conditions, drugs,
devices, interventions) in biomedical text, ranging from
10% below to 11% above the interannotator-agreement
scores for the gold-standard test set, were obtained with
the inclusion of statistical models trained on a manually
annotated corpus as compared to dictionary-based concept recognition solely [16]. Craven and Kumlein found
generally higher levels of precision of extracted biomedical
assertions (e.g., protein-disease associations and subcellular, cell-type, and tissue localizations of proteins) for
Naïve-Bayes-model-based (...truncated)