Can we replace curation with information extraction software?
Database, 2016, 1–4
doi: 10.1093/database/baw150
Perspective
Perspective
Can we replace curation with information
extraction software?
Peter D. Karp
Bioinformatics Research Group, SRI, International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA.
Tel:650-859-4358; Fax: 650-859-3735; E-mail:
Citation details: Karp,P.D. Can we replace curation with information extraction software?. Database (2016) Vol. 2016:
article ID baw150; doi:10.1093/database/baw150
Accepted 19 October 2016
Abstract
Can we use programs for automated or semi-automated information extraction from scientific texts as practical alternatives to professional curation? I show that error rates of
current information extraction programs are too high to replace professional curation
today. Furthermore, current IEP programs extract single narrow slivers of information,
such as individual protein interactions; they cannot extract the large breadth of information extracted by professional curators for databases such as EcoCyc. They also cannot
arbitrate among conflicting statements in the literature as curators can. Therefore, funding agencies should not hobble the curation efforts of existing databases on the assumption that a problem that has stymied Artificial Intelligence researchers for more than 60
years will be solved tomorrow. Semi-automated extraction techniques appear to have significantly more potential based on a review of recent tools that enhance curator productivity. But a full cost-benefit analysis for these tools is lacking. Without such analysis it is
possible to expend significant effort developing information-extraction tools that automate small parts of the overall curation workflow without achieving a significant decrease
in curation costs.
Database URL:
Introduction
Bourne et al. recently proposed (1) that to improve efficiency
and decrease costs, biomedical databases must explore new
business models and methodologies. They suggest three alternatives to traditional literature-based curation by professional curators that they presumably believe will decrease
the costs of curation: ‘complete and accurate automated or
semi-automated extraction of literature’, crowd sourcing of
curation, and curation by authors of publications.
C The Author(s) 2016. Published by Oxford University Press.
V
Although the costs of professional curation are surprisingly low (2) (on average the cost of curating one article for
the EcoCyc database is roughly 10% of the open-access
publication fee for publishing a biomedical article), here we
consider the first alternative to professional curation. What
progress has been made, what challenges remain, and how
practical an alternative is automated or semi-automated information extraction? We will consider the other alternatives in a future perspective.
Page 1 of 4
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
(page number not for citation purposes)
Page 2 of 4
Text mining as an alternative to professional
curation
databases such as EcoCyc, one database entry corresponds
to one biological entity, and curators seek to integrate
many published findings about that entity. For example,
EcoCyc curators synthesize multi-paragraph mini-review
summaries for protein and pathway pages; they follow
changes in the names of genes, proteins, and metabolites;
and they summarize and resolve disagreements and conflicts in the literature—capabilities that far exceed what
IEPs or other Artificial Intelligence techniques can do.
To generalize, the difficulty of automating curation (or,
for that matter, of crowd-sourcing curation), will depend
on the complexity of that curation. Different databases employ curation processes of varying complexity depending
on the number of types of data they extract, the number of
database fields that are populated by the curation effort,
the amount of meta-data extracted (e.g. is extracted information annotated with evidence codes?), the amount of
knowledge integration (interpretation and synthesis) that
the curators perform, whether curators author minireviews, and the end uses to which the data will be put (curation of knowledge to form an executable metabolic model
will be more difficult than curation of knowledge to create
a web page that will be read by scientists).
Semi-automated extraction as an alternative
to professional curation
In my opinion, there is much more near-term potential for
semi-automated text-mining approaches to accelerate curation work. But to date, results have been very limited. One
success story is software developed by WormBase to perform article triage—categorizing the type of information
contained in articles for assignment to an appropriate member of the curation staff (8). WormBase also developed software that identifies sentences within publications that
contain words likely to be stating the cellular compartments
in which proteins are localized, analyzes those sentences,
and pre-fills a curation form that could then be approved or
modified by a curator (9). An evaluation found the tool to
be moderately accurate (F-score of .509 for dictyBase and
.547 for TAIR). The tool was found to increase curator efficiency 2.5-fold for dictyBase and 10-fold for TAIR; an earlier study by these authors found that the time to curate
cellular compartment information could be decreased by a
factor of 8–15 (10).
At first glance these results seem quite significant, but
the accuracy of these tools is limited, and a full cost-benefit
analysis for these tools is lacking. As a database Principal
Investigator, to decide whether to adopt a given new semiautomated extraction tool in the EcoCyc curation
Extracting Information from written texts is a form of the
natural-language understanding problem, an Artificial
Intelligence problem that has remained unsolved for 60þ
years. Although significant progress has been made in this
field, information-extraction programs (IEPs) are not accurate or comprehensive enough to replace manual curation.
One of the simpler IEP tasks involves recognizing the names
of entities in biomedical texts, which is called the namedentity recognition problem. Error rates (computed as 1Fscore) for six state-of-the-art named-entity recognition tools
for recognizing the names of genes, diseases, organisms,
chemicals, and mutations in text (one object type per program) range from 6 to 46% (mean is 18%) (3). Other recent results on named-entity recognition come from the
BioCreative V competition, involving recognition of chemical names and disease names; Table 2 of (4) lists results
from 16 teams where the error rates range from 13 to 48%
(mean is 24%). Recognizing named entities in biomedical
texts is the first step in extracting more complex relationships among those entities. Anania (...truncated)