HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools
Bioinformatics, 2024, 40(10), btae564
https://doi.org/10.1093/bioinformatics/btae564
Advance Access Publication Date: 20 September 2024
Original Paper
Data and text mining
HunFlair2 in a cross-corpus evaluation of biomedical
named entity recognition and normalization tools
2
,
1
Department of Computer Science, Humboldt-Universit€at zu Berlin, Berlin 10099, Germany
Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, M€unchen 80539, Germany
3
Research Industrial Systems Engineering (RISE) Forschungs-, Entwicklungs- und Großprojektberatung GmbH, Schwechat 2320, Austria
2
�Corresponding authors. Department of Computer Science, Humboldt-Universit€
at zu Berlin, Unter den Linden 6, Berlin 10099, Germany.
E-mails: (M.S.) and (U.L.)
†
¼ equal contribution.
Associate Editor: Jonathan Wren
Abstract
Motivation: With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for
accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normaliza
tion, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents.
However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied “in the wild,”
i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre
or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of
the same corpus, can be trusted for downstream applications.
Results: Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools
were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on
predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our
results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the
redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central.
Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in “the
wild” and show that further research is necessary for more robust BTM tools.
Availability and implementation: All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.
com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.
1 Introduction
The volume of biomedical literature is expanding at a rapid
pace, with public repositories like PubMed housing over 30
million publication abstracts. A major challenge lies in the
high-quality extraction of relevant information from this
ever-growing body of literature, a task that no human can
feasibly accomplish, thus requiring support from computerassisted methods.
A crucial step in such pipelines is the extraction of biomed
ical entities (such as genes/proteins and diseases) as it is a pre
requisite for further processing steps, like relation extraction
(Weber et al. 2022), knowledge base (KB) completion
(S€anger and Leser 2021) or pathway curation (Weber et al.
2020). As shown in Fig. 1, this typically involves two steps:
(i) named entity recognition (NER) and (ii) named entity nor
malization (NEN) (a.k.a entity linking or entity disambigua
tion. We refer to their combination as extraction). NER
identifies and classifies entities discussed in a given document.
However, different documents may use different names
(synonyms) to refer to the same biomedical concept. For in
stance, “tumor protein p53” or “tumor suppressor p53” are
both valid names for the gene “TP53” (NCBI Gene: 7157).
The same mention can refer as well to different entities
(homonyms), e.g. “RSV” can be “Rous-Sarcoma-Virus” or
“Respiratory syncytial virus” depending on context. Entity
normalization addresses the issues of synonyms and homo
nyms by mapping mentions found by NER to a KB identifier.
This process ensures that all entity mentions are recognized
as referring to the concept, regardless of how they are
expressed in the text, allowing to aggregate and compare in
formation across different documents.
Over the last two decades, several studies investigated bio
medical NER and NEN. Of the many research prototypes,
some have been consolidated into mature and easy-to-install
or use tools that end users can apply directly for their specific
needs (Wei et al. 2019, Weber et al. 2021, Zhang et al. 2021,
inter-alia). These tools are commonly deployed “in the wild,”
i.e. to custom text collections with specific focus (e.g. cancer
or genetic disorders), entity distribution (gene-focused
Received: 3 April 2024; Revised: 23 August 2024; Editorial Decision: 13 September 2024; Accepted: 17 September 2024
© The Author(s) 2024. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
€nger 1,�,†, Samuele Garda 1,†, Xing David Wang1,†, Leon Weber-Genzel
Mario Sa
Pia Droop1, Benedikt Fuchs3, Alan Akbik1, Ulf Leser1,�
€nger et al.
Sa
2
molecular biology or disease-focused clinical trials), genre
(publication, patent, report) and text type (abstract, full text,
user-generated content). The tools, however, were originally
trained and evaluated on a single or few gold standard cor
pora, each having its own specific characteristics (focus, en
tity distribution, etc.). The mismatch between these two
settings, i.e. training/evaluation versus downstream deploy
ment, raises the question whether the performance in the first
can be trusted to estimate the one achievable in the second.
As named entity extraction is the cornerstone of several appli
cations, e.g. for relation and event extraction (Wang et al.
2020), the issue has a direct and critical impact on down
stream information extraction pipelines.
To better quantify the impact of this issue, previous work
proposed to use a cross-corpus evaluation, i.e. training mod
els on one corpus and evaluating on a different one (Galea
et al. 2018). For instance, Giorgi and Bader (2020) show that
the performance of neural networks for NER drops by an av
erage of 31.16% F1 when tested on a corpus different from
the one used for training. Previous studies, however, present
a few limitations: First, (...truncated)