Gold-standard ontology-based anatomical annotation in the CRAFT Corpus
Database, 2017, 1–13
doi: 10.1093/database/bax087
Original article
Original article
Gold-standard ontology-based anatomical
annotation in the CRAFT Corpus
Michael Bada1,*, Nicole Vasilevsky2, William A. Baumgartner Jr.1,
Melissa Haendel2 and Lawrence E. Hunter1
1
School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus,
12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA and 2Ontology Development Group,
Library, Oregon Health & Science University, 318 SW Sam Jackson, Park Road, Portland, OR 97239, USA
*Corresponding author: Tel: þ1-303-724-3292; Email:
Bada,M., Vasilevsky,N., Baumgartner Jr,W.A. et al. Gold-standard ontology-based anatomical annotation in the CRAFT
Corpus. Database (2017) Vol. 2017: article ID bax087; doi:10.1093/database/bax087
Received 28 July 2017; Revised 25 October 2017; Accepted 27 October 2017
Abstract
Gold-standard annotated corpora have become important resources for the training and
testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic
integration across disparate resources. Bringing together the respective power of these,
the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, openaccess biomedical journal articles with extensive manually created syntactic, formatting
and semantic markup, was previously created and released. This initial public release has
already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on
its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually
created semantic annotations relying on Uberon, an ontology representing anatomical
entities and life-cycle stages of multicellular organisms across species as well as types of
multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This
newly created set of annotations, which has been added for v2.1 of the corpus, is by far the
largest publicly available collection of gold-standard anatomical markup and is the first
large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical
categories, as performed in previous corpora. In addition to presenting and discussing this
newly available resource, we apply it to provide a performance baseline for the automatic
annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from
http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml
C The Author(s) 2017. Published by Oxford University Press.
V
Page 1 of 13
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
(page number not for citation purposes)
Page 2 of 13
Background
evaluation of prominent concept-recognition systems (13),
and it has already been used in multiple projects to drive
development of systems for a variety of syntactic and semantic NLP tasks including lemmatization (14), coordination resolution (15) and concept recognition and mapping
(16–21). It has additionally been used in the development
of more expansive systems focused on tasks such as curation (22, 23), information extraction and discovery (24,
25), function prediction (26), querying and search (27),
summarization (28) and visualization (17).
Motivated by the considerable recent interest in the
automatic identification of anatomical entities in text and,
beyond that, extraction and curation of assertions involving anatomical entities (29–37), we have expanded the semantic markup of the CRAFT Corpus with a large set of
manually created concept annotations using the classes of
the Uberon ontology, a widely used OBO centered on the
representation of anatomical entities and life-cycle stages
of multicellular organisms as well as multicellular organisms defined in terms of life-cycle stage and sexual characteristics (38). This newly created set of over 16 000
anatomical annotations in the public release, which has
been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical
markup and is the first large-scale effort of which we know
to manually mark up biomedical text that relies on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical
categories, as performed in previous corpora. In addition
to presenting and discussing this newly available resource,
we apply it to provide performance baselines for the automatic annotation of anatomical concepts in biomedical
text using a prominent concept recognition system.
Methods
The OBO-format (http://www.geneontology.org/faq/whatobo-file-format) version of the 2015/04/23 version of the
Uberon ontology (i.e. with a specified data version of
uberon/releases/2015-04-23/basic.owl in the .obo file),
which was the version current at the time of the start of the
markup of the CRAFT Corpus with this ontology, was
downloaded and a Protégé-Frames (39) ontology project
was programmatically created by parsing the .obo ontology file and making use of the Protégé-Frames Java API.
Even though Uberon continued to be released in subsequent versions, the aforementioned starting version of
Uberon was used throughout the annotation process so as
to be consistent (as was done for the previous annotation
passes with the ontologies used to create v1.0 of the corpus). As annotation was to be performed in Knowtator [a
tab plugin to Protégé-Frames designed to enable markup of
With the ever-rising amount of biomedical literature, it is
increasingly difficult for scientists to keep up with the published work in their fields of research, much less related
ones. With the digitalization of much of the literature,
natural-language processing (NLP) and mining of publications have become increasingly important in biomedical research and curation (1–5). So too have biomedical
ontologies, whose use facilitates curational consistency and
furthers semantic integration across disparate resources,
and millions of biomedical entities have been annotated
with them (6, 7). Particularly relevant to biomedicine are
the Open Biomedical Ontologies (OBOs), a set of open, orthogonal, interoperable ontologies formally representing
knowledge over a wide range of biology, medicine and
related disciplines (8).
Manually annotated document corpora are critical
gold-standard resources for the t (...truncated)