Can we replace curation with information extraction software? (pdf)

Article PDF cannot be displayed. You can download it here:

https://database.oxfordjournals.org/content/2016/baw150.full.pdf

Can we replace curation with information extraction software?

Database, 2016, 1–4 doi: 10.1093/database/baw150 Perspective Perspective Can we replace curation with information extraction software? Peter D. Karp Bioinformatics Research Group, SRI, International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA. Tel:650-859-4358; Fax: 650-859-3735; E-mail: Citation details: Karp,P.D. Can we replace curation with information extraction software?. Database (2016) Vol. 2016: article ID baw150; doi:10.1093/database/baw150 Accepted 19 October 2016 Abstract Can we use programs for automated or semi-automated information extraction from scientific texts as practical alternatives to professional curation? I show that error rates of current information extraction programs are too high to replace professional curation today. Furthermore, current IEP programs extract single narrow slivers of information, such as individual protein interactions; they cannot extract the large breadth of information extracted by professional curators for databases such as EcoCyc. They also cannot arbitrate among conflicting statements in the literature as curators can. Therefore, funding agencies should not hobble the curation efforts of existing databases on the assumption that a problem that has stymied Artificial Intelligence researchers for more than 60 years will be solved tomorrow. Semi-automated extraction techniques appear to have significantly more potential based on a review of recent tools that enhance curator productivity. But a full cost-benefit analysis for these tools is lacking. Without such analysis it is possible to expend significant effort developing information-extraction tools that automate small parts of the overall curation workflow without achieving a significant decrease in curation costs. Database URL: Introduction Bourne et al. recently proposed (1) that to improve efficiency and decrease costs, biomedical databases must explore new business models and methodologies. They suggest three alternatives to traditional literature-based curation by professional curators that they presumably believe will decrease the costs of curation: ‘complete and accurate automated or semi-automated extraction of literature’, crowd sourcing of curation, and curation by authors of publications. C The Author(s) 2016. Published by Oxford University Press. V Although the costs of professional curation are surprisingly low (2) (on average the cost of curating one article for the EcoCyc database is roughly 10% of the open-access publication fee for publishing a biomedical article), here we consider the first alternative to professional curation. What progress has been made, what challenges remain, and how practical an alternative is automated or semi-automated information extraction? We will consider the other alternatives in a future perspective. Page 1 of 4 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. (page number not for citation purposes) Page 2 of 4 Text mining as an alternative to professional curation databases such as EcoCyc, one database entry corresponds to one biological entity, and curators seek to integrate many published findings about that entity. For example, EcoCyc curators synthesize multi-paragraph mini-review summaries for protein and pathway pages; they follow changes in the names of genes, proteins, and metabolites; and they summarize and resolve disagreements and conflicts in the literature—capabilities that far exceed what IEPs or other Artificial Intelligence techniques can do. To generalize, the difficulty of automating curation (or, for that matter, of crowd-sourcing curation), will depend on the complexity of that curation. Different databases employ curation processes of varying complexity depending on the number of types of data they extract, the number of database fields that are populated by the curation effort, the amount of meta-data extracted (e.g. is extracted information annotated with evidence codes?), the amount of knowledge integration (interpretation and synthesis) that the curators perform, whether curators author minireviews, and the end uses to which the data will be put (curation of knowledge to form an executable metabolic model will be more difficult than curation of knowledge to create a web page that will be read by scientists). Semi-automated extraction as an alternative to professional curation In my opinion, there is much more near-term potential for semi-automated text-mining approaches to accelerate curation work. But to date, results have been very limited. One success story is software developed by WormBase to perform article triage—categorizing the type of information contained in articles for assignment to an appropriate member of the curation staff (8). WormBase also developed software that identifies sentences within publications that contain words likely to be stating the cellular compartments in which proteins are localized, analyzes those sentences, and pre-fills a curation form that could then be approved or modified by a curator (9). An evaluation found the tool to be moderately accurate (F-score of .509 for dictyBase and .547 for TAIR). The tool was found to increase curator efficiency 2.5-fold for dictyBase and 10-fold for TAIR; an earlier study by these authors found that the time to curate cellular compartment information could be decreased by a factor of 8–15 (10). At first glance these results seem quite significant, but the accuracy of these tools is limited, and a full cost-benefit analysis for these tools is lacking. As a database Principal Investigator, to decide whether to adopt a given new semiautomated extraction tool in the EcoCyc curation Extracting Information from written texts is a form of the natural-language understanding problem, an Artificial Intelligence problem that has remained unsolved for 60þ years. Although significant progress has been made in this field, information-extraction programs (IEPs) are not accurate or comprehensive enough to replace manual curation. One of the simpler IEP tasks involves recognizing the names of entities in biomedical texts, which is called the namedentity recognition problem. Error rates (computed as 1Fscore) for six state-of-the-art named-entity recognition tools for recognizing the names of genes, diseases, organisms, chemicals, and mutations in text (one object type per program) range from 6 to 46% (mean is 18%) (3). Other recent results on named-entity recognition come from the BioCreative V competition, involving recognition of chemical names and disease names; Table 2 of (4) lists results from 16 teams where the error rates range from 13 to 48% (mean is 24%). Recognizing named entities in biomedical texts is the first step in extracting more complex relationships among those entities. Anania (...truncated)