Towards robust foundation models for digital pathology (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-026-73923-2.pdf

Towards robust foundation models for digital pathology

Article https://doi.org/10.1038/s41467-026-73923-2 Towards robust foundation models for digital pathology Received: 24 July 2025 Accepted: 18 May 2026 Jonah Kömen 1,2,11, Edwin D. de Jong 3,11 , Julius Hense 1,2,11, Hannah Marienwald1,2, Jonas Dippel 1,2,3, Philip Naumann 1,2, Eric Marcus4, Lukas Ruff 3, Maximilian Alber3,5, Jonas Teuwen 4, Frederick Klauschen 1,5,6,7,8 & Klaus-Robert Müller 1,2,9,10 Biomedical Foundation Models (FMs) are transforming AI-enabled healthcare research and entering clinical validation. However, their susceptibility to learning non-biological features — including variations in laboratory procedures and scanner hardware — poses risks for clinical deployment. We introduce PathoROB, a public benchmark quantifying FM robustness to nonbiological features. Representation-level robustness is assessed using the robustness index, while output-level robustness is evaluated across clinically relevant settings, including patch- and slide-level prediction, case retrieval, and clustering tasks. Our experiments reveal robustness deﬁcits across all 20 evaluated FMs, with substantial differences between them. We ﬁnd that nonrobust FM representations can cause major diagnostic downstream errors preventing safe clinical adoption. Using more robust FMs, vision-language alignment, and post-hoc robustiﬁcation reduces (but does not yet eliminate) the risk of such errors. This work establishes that robustness evaluation is essential for validating pathology FMs before clinical adoption and provides a blueprint for assessing and improving robustness across biomedical domains. 1234567890():,; 1234567890():,; Check for updates Biomedical Foundation Models (FMs) are large-scale AI models pretrained on increasingly large unlabeled biomedical datasets1–4. They drastically improved performance and generalization capabilities over standalone supervised models and non-biomedical pre-training across domains5–12. In digital pathology, FM pre-training has been scaled up to millions of Whole Slide Images (WSIs) and billions of model parameters13,14. Some of the resulting models demonstrate remarkable capabilities at a wide range of diagnostic tasks, such as pan-cancer classiﬁcation or rare cancer detection6,15–17. They further advance the prediction of clinically relevant biomarkers from histology that typically require additional molecular or immunohistochemical testing — such as MSI, HER2, and EGFR6,18–21 — and enable real-world clinical utility of ML-based biomarkers21. As the development of pathology FMs is progressing rapidly, measuring their capabilities and differences becomes increasingly challenging22. To this end, many recent efforts have focused on contributing pathology benchmarks to assess the performance potential of foundation models in various clinically relevant settings6,20,23–30. However, one major issue that deserves systematic analysis is the apparent lack of robustness of FMs to technical variability across medical centers (hospitals, laboratories, biobanks, etc.). Such variability is caused by numerous factors, including biopsy acquisition 1 Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany. 2Machine Learning Group, Technische Universität Berlin, Berlin, Germany. Aignostics GmbH, Berlin, Germany. 4The Netherlands Cancer Institute Amsterdam (NKI), Antoni van Leeuwenhoek Hospital (AvL), Amsterdam, Netherlands. 5 Institute of Pathology, Charité Universitätsmedizin Berlin, Berlin, Germany. 6Institute of Pathology, Ludwig-Maximilians-Universität München, Munich, Germany. 7German Cancer Research Center, Heidelberg, and German Cancer Consortium, Munich, Germany. 8Bavarian Center for Cancer Research (BZKF), Munich, Germany. 9Department of Artiﬁcial Intelligence, Korea University, Seoul, Korea. 10Max-Planck Institute for Informatics, Saarbrücken, Germany. 11 These authors contributed equally: Jonah Kömen, Edwin D. de Jong, Julius Hense. e-mail: ; ; 3 Nature Communications | (2026)17:5218 1 Article technique, tissue preparation and sectioning, staining protocols, and whole slide scanning, among other factors. These differences neither reﬂect medical nor biological tissue characteristics. Nevertheless, machine learning models can be negatively inﬂuenced by these types of variation31,32. Note that such systematic technical data biases, also known as batch effects33–35, are not limited to digital pathology, but pose a fundamental issue across biomedical disciplines, e.g., in radiology36,37 or molecular biology34,38–40. Foundation models might be expected to provide more robust information thanks to their large and diverse pre-training datasets. However, the self-supervised learning methods applied to pre-train pathology FMs are designed to capture any differences in the data37, which includes technical variation. In fact, recent work suggests that pathology FMs encode technical/medical center information in their representations41–46. For example, Filiot et al.43 considered different stainings and scanners applied to the same slides and observed substantial variations in the resulting FM representations. Other factors prevalent in real-world diagnostic slides, such as differences in tissue ﬁxation, section thickness, and quality, were not considered in that study43, though. With this work, we intend to contribute to the above-described challenge by thoroughly investigating FM robustness, its medical consequences, and strategies for improving FM robustness. As a part of this endeavor, we constructed PathoROB, a benchmark for systematically measuring pathology foundation model robustness against non-biological variation across medical centers. It consists of four multi-class multi-medical center datasets from three public sources that facilitate comparisons between biological and nonbiological signals present in FM representations of histopathology images. We present three metrics for assessing FM robustness and its implications: the performance drop in downstream tasks, a clustering score reﬂecting the global organization of the embedding space, and the robustness index: a metric measuring the degree to which foundation embeddings represent biological features rather than technical ones. Furthermore, we describe a framework to make foundation models more robust without retraining them and compare different ways to achieve this. Applying PathoROB to 20 current pathology FMs exposed substantial performance differences related to pre-training scale and objective, but also revealed considerable robustness deﬁcits in the representation spaces of all FMs. A low robustness index was correlated with diminished generalization performance and potentially dangerous failures in clinically relevant FM applications, including supervised downstream models, image clustering, and diagnostic case search. Using post-training robustiﬁcation methods like imagespace stain normalization47 and representation-space batch correction48–50 consi (...truncated)