Towards robust foundation models for digital pathology
Article
https://doi.org/10.1038/s41467-026-73923-2
Towards robust foundation models for
digital pathology
Received: 24 July 2025
Accepted: 18 May 2026
Jonah Kömen 1,2,11, Edwin D. de Jong 3,11 , Julius Hense 1,2,11,
Hannah Marienwald1,2, Jonas Dippel 1,2,3, Philip Naumann 1,2, Eric Marcus4,
Lukas Ruff 3, Maximilian Alber3,5, Jonas Teuwen 4,
Frederick Klauschen 1,5,6,7,8 & Klaus-Robert Müller 1,2,9,10
Biomedical Foundation Models (FMs) are transforming AI-enabled healthcare
research and entering clinical validation. However, their susceptibility to
learning non-biological features — including variations in laboratory procedures and scanner hardware — poses risks for clinical deployment. We introduce PathoROB, a public benchmark quantifying FM robustness to nonbiological features. Representation-level robustness is assessed using the
robustness index, while output-level robustness is evaluated across clinically
relevant settings, including patch- and slide-level prediction, case retrieval,
and clustering tasks. Our experiments reveal robustness deficits across all 20
evaluated FMs, with substantial differences between them. We find that nonrobust FM representations can cause major diagnostic downstream errors
preventing safe clinical adoption. Using more robust FMs, vision-language
alignment, and post-hoc robustification reduces (but does not yet eliminate)
the risk of such errors. This work establishes that robustness evaluation is
essential for validating pathology FMs before clinical adoption and provides a
blueprint for assessing and improving robustness across biomedical domains.
1234567890():,;
1234567890():,;
Check for updates
Biomedical Foundation Models (FMs) are large-scale AI models pretrained on increasingly large unlabeled biomedical datasets1–4. They
drastically improved performance and generalization capabilities over
standalone supervised models and non-biomedical pre-training across
domains5–12. In digital pathology, FM pre-training has been scaled up to
millions of Whole Slide Images (WSIs) and billions of model
parameters13,14. Some of the resulting models demonstrate remarkable
capabilities at a wide range of diagnostic tasks, such as pan-cancer
classification or rare cancer detection6,15–17. They further advance the
prediction of clinically relevant biomarkers from histology that typically require additional molecular or immunohistochemical testing —
such as MSI, HER2, and EGFR6,18–21 — and enable real-world clinical
utility of ML-based biomarkers21.
As the development of pathology FMs is progressing rapidly,
measuring their capabilities and differences becomes increasingly
challenging22. To this end, many recent efforts have focused on contributing pathology benchmarks to assess the performance potential
of foundation models in various clinically relevant settings6,20,23–30.
However, one major issue that deserves systematic analysis is the
apparent lack of robustness of FMs to technical variability across
medical centers (hospitals, laboratories, biobanks, etc.). Such variability is caused by numerous factors, including biopsy acquisition
1
Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany. 2Machine Learning Group, Technische Universität Berlin, Berlin, Germany.
Aignostics GmbH, Berlin, Germany. 4The Netherlands Cancer Institute Amsterdam (NKI), Antoni van Leeuwenhoek Hospital (AvL), Amsterdam, Netherlands.
5
Institute of Pathology, Charité Universitätsmedizin Berlin, Berlin, Germany. 6Institute of Pathology, Ludwig-Maximilians-Universität München,
Munich, Germany. 7German Cancer Research Center, Heidelberg, and German Cancer Consortium, Munich, Germany. 8Bavarian Center for Cancer Research
(BZKF), Munich, Germany. 9Department of Artificial Intelligence, Korea University, Seoul, Korea. 10Max-Planck Institute for Informatics, Saarbrücken, Germany.
11
These authors contributed equally: Jonah Kömen, Edwin D. de Jong, Julius Hense.
e-mail: ; ;
3
Nature Communications | (2026)17:5218
1
Article
technique, tissue preparation and sectioning, staining protocols, and
whole slide scanning, among other factors. These differences neither
reflect medical nor biological tissue characteristics. Nevertheless,
machine learning models can be negatively influenced by these types
of variation31,32. Note that such systematic technical data biases, also
known as batch effects33–35, are not limited to digital pathology, but
pose a fundamental issue across biomedical disciplines, e.g., in
radiology36,37 or molecular biology34,38–40.
Foundation models might be expected to provide more robust
information thanks to their large and diverse pre-training datasets.
However, the self-supervised learning methods applied to pre-train
pathology FMs are designed to capture any differences in the data37,
which includes technical variation. In fact, recent work suggests that
pathology FMs encode technical/medical center information in their
representations41–46. For example, Filiot et al.43 considered different
stainings and scanners applied to the same slides and observed substantial variations in the resulting FM representations. Other factors
prevalent in real-world diagnostic slides, such as differences in tissue
fixation, section thickness, and quality, were not considered in that
study43, though.
With this work, we intend to contribute to the above-described
challenge by thoroughly investigating FM robustness, its medical
consequences, and strategies for improving FM robustness. As a part
of this endeavor, we constructed PathoROB, a benchmark for systematically measuring pathology foundation model robustness
against non-biological variation across medical centers. It consists of
four multi-class multi-medical center datasets from three public
sources that facilitate comparisons between biological and nonbiological signals present in FM representations of histopathology
images. We present three metrics for assessing FM robustness and its
implications: the performance drop in downstream tasks, a clustering
score reflecting the global organization of the embedding space, and
the robustness index: a metric measuring the degree to which foundation embeddings represent biological features rather than technical
ones. Furthermore, we describe a framework to make foundation
models more robust without retraining them and compare different
ways to achieve this. Applying PathoROB to 20 current pathology FMs
exposed substantial performance differences related to pre-training
scale and objective, but also revealed considerable robustness deficits
in the representation spaces of all FMs. A low robustness index was
correlated with diminished generalization performance and potentially dangerous failures in clinically relevant FM applications, including supervised downstream models, image clustering, and diagnostic
case search. Using post-training robustification methods like imagespace stain normalization47 and representation-space batch
correction48–50 consi (...truncated)