HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

Bioinformatics, Oct 2024

With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied “in the wild,” i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications.

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/40/10/btae564/59604134/btae564.pdf

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

Bioinformatics, 2024, 40(10), btae564 https://doi.org/10.1093/bioinformatics/btae564 Advance Access Publication Date: 20 September 2024 Original Paper Data and text mining HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools 2 , 1 Department of Computer Science, Humboldt-Universit€at zu Berlin, Berlin 10099, Germany Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, M€unchen 80539, Germany 3 Research Industrial Systems Engineering (RISE) Forschungs-, Entwicklungs- und Großprojektberatung GmbH, Schwechat 2320, Austria 2 �Corresponding authors. Department of Computer Science, Humboldt-Universit€ at zu Berlin, Unter den Linden 6, Berlin 10099, Germany. E-mails: (M.S.) and (U.L.) † ¼ equal contribution. Associate Editor: Jonathan Wren Abstract Motivation: With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normaliza tion, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied “in the wild,” i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications. Results: Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in “the wild” and show that further research is necessary for more robust BTM tools. Availability and implementation: All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github. com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments. 1 Introduction The volume of biomedical literature is expanding at a rapid pace, with public repositories like PubMed housing over 30 million publication abstracts. A major challenge lies in the high-quality extraction of relevant information from this ever-growing body of literature, a task that no human can feasibly accomplish, thus requiring support from computerassisted methods. A crucial step in such pipelines is the extraction of biomed ical entities (such as genes/proteins and diseases) as it is a pre requisite for further processing steps, like relation extraction (Weber et al. 2022), knowledge base (KB) completion (S€anger and Leser 2021) or pathway curation (Weber et al. 2020). As shown in Fig. 1, this typically involves two steps: (i) named entity recognition (NER) and (ii) named entity nor malization (NEN) (a.k.a entity linking or entity disambigua tion. We refer to their combination as extraction). NER identifies and classifies entities discussed in a given document. However, different documents may use different names (synonyms) to refer to the same biomedical concept. For in stance, “tumor protein p53” or “tumor suppressor p53” are both valid names for the gene “TP53” (NCBI Gene: 7157). The same mention can refer as well to different entities (homonyms), e.g. “RSV” can be “Rous-Sarcoma-Virus” or “Respiratory syncytial virus” depending on context. Entity normalization addresses the issues of synonyms and homo nyms by mapping mentions found by NER to a KB identifier. This process ensures that all entity mentions are recognized as referring to the concept, regardless of how they are expressed in the text, allowing to aggregate and compare in formation across different documents. Over the last two decades, several studies investigated bio medical NER and NEN. Of the many research prototypes, some have been consolidated into mature and easy-to-install or use tools that end users can apply directly for their specific needs (Wei et al. 2019, Weber et al. 2021, Zhang et al. 2021, inter-alia). These tools are commonly deployed “in the wild,” i.e. to custom text collections with specific focus (e.g. cancer or genetic disorders), entity distribution (gene-focused Received: 3 April 2024; Revised: 23 August 2024; Editorial Decision: 13 September 2024; Accepted: 17 September 2024 © The Author(s) 2024. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. €nger 1,�,†, Samuele Garda 1,†, Xing David Wang1,†, Leon Weber-Genzel Mario Sa Pia Droop1, Benedikt Fuchs3, Alan Akbik1, Ulf Leser1,� €nger et al. Sa 2 molecular biology or disease-focused clinical trials), genre (publication, patent, report) and text type (abstract, full text, user-generated content). The tools, however, were originally trained and evaluated on a single or few gold standard cor pora, each having its own specific characteristics (focus, en tity distribution, etc.). The mismatch between these two settings, i.e. training/evaluation versus downstream deploy ment, raises the question whether the performance in the first can be trusted to estimate the one achievable in the second. As named entity extraction is the cornerstone of several appli cations, e.g. for relation and event extraction (Wang et al. 2020), the issue has a direct and critical impact on down stream information extraction pipelines. To better quantify the impact of this issue, previous work proposed to use a cross-corpus evaluation, i.e. training mod els on one corpus and evaluating on a different one (Galea et al. 2018). For instance, Giorgi and Bader (2020) show that the performance of neural networks for NER drops by an av erage of 31.16% F1 when tested on a corpus different from the one used for training. Previous studies, however, present a few limitations: First, (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/bioinformatics/article-pdf/40/10/btae564/59604134/btae564.pdf
Article home page: https://academic.oup.com/bioinformatics/article/40/10/btae564/7762634

Sänger, Mario, Garda, Samuele, Wang, Xing David, Weber-Genzel, Leon, Droop, Pia, Fuchs, Benedikt, Akbik, Alan, Leser, Ulf. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools, Bioinformatics, 2024, Volume 40, Issue 10, DOI: 10.1093/bioinformatics/btae564