A dataset for evaluating clinical research claims in large language models (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41597-025-04417-x.pdf

A dataset for evaluating clinical research claims in large language models

www.nature.com/scientificdata OPEN Data Descriptor A dataset for evaluating clinical research claims in large language models Boya Zhang1 ✉, Alban Bornet 1, Anthony Yazdani 1, Philipp Khlebnikov2, Marija Milutinovic1, Hossein Rouhizadeh1, Poorya Amini2 & Douglas Teodoro 1✉ Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical research, covering 992 unique interventions for 22 disease categories. The dataset used study arms and interventions, primary outcome measures, and results from clinical trials to derive and label clinical research claims. These claims were then linked to supporting information describing clinical trial results in scientific publications. CliniFact contains 1,970 instances from 992 unique clinical trials related to 1,540 unique publications. When evaluating LLMs against CliniFact, discriminative models, such as BioBERT with an accuracy of 80.2%, outperformed generative counterparts, such as Llama3-70B, which reached 53.6% accuracy (p-value < 0.001). Our results demonstrate the potential of CliniFact as a benchmark for evaluating LLM performance in clinical research claim verification. Background & Summary Large language models (LLMs) have demonstrated remarkable success in several natural language processing tasks in the health and life sciences domain1. Due to parameter scaling, access to specialized corpora, and better human alignment techniques, performance has significantly improved in recent years2. Yet, they still struggle with factual accuracy in various domains3. LLMs may produce factual errors that contradict established knowledge available at the time4. These inaccuracies and errors are particularly concerning in critical fields like healthcare, where incorrect information can have severe consequences5. To mitigate issues with factual accuracy and vulnerability to hallucinations, the incorporation of domain-specific knowledge when evaluating LLMs has been proposed6. This stems from the fact that factual accuracy7 and vulnerability to hallucinations8 in LLMs can vary significantly across domains9. Models fine-tuned for a general purpose tend to outperform in the general domain6 while models fine-tuned for specific domains, such as medicine (e.g., Meditron10, Med-PaLM11), often outperform general-purpose models in those areas. Another critical challenge for LLMs is their ability to perform logical reasoning12. This is particularly important in clinical research, where scientific claims are posed as logical statements, such as ‘the intervention X is more effective than placebo for a specific outcome'13, that are either true or false. Evaluating these claims requires a strong understanding of hypothesis testing and causal inference14. However, the nature of LLMs, which are trained to predict tokens within a context15, makes them struggle with complex logical statements16, even making unfaithful reasoning17. Research has shown that LLMs can be easily misled by irrelevant information18. Chain-of-thought (CoT) prompting can improve multi-step reasoning by providing intermediate rationales19. Concerns remain regarding the faithfulness and reliability of these explanations, as they can often be biased or misleading20. Furthermore, while methodologies such as self-correction can improve reasoning accuracy, current models still struggle to correct their errors autonomously without external feedback21. In some cases, their performance degrades after self-correction21. Integrating LLMs with symbolic solvers for logical reasoning22 and hypothesis testing prompting for improved deductive reasoning23 are proposed to address these limitations. 1 Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland. Risklick AG, Bern, Switzerland. ✉e-mail: ; 2 Scientific Data | (2025) 12:86 | https://doi.org/10.1038/s41597-025-04417-x 1 www.nature.com/scientificdata www.nature.com/scientificdata/ Claim verification datasets play a crucial role in assessing the factual accuracy of LLMs across various domains24. FEVER25, a general-domain dataset, was created by rewriting Wikipedia sentences into atomic claims, which are then verified using Wikipedia’s textual knowledge base. FEVER also introduces a three-step fact verification process: document retrieval, evidence selection, and stance detection. In the political domain, the UKP Snopes corpus26, derived from the Snopes fact-checking website, includes 6,422 validated claims paired with evidence text snippets. For the scientific domain, SciFact27 includes 1.4 K expert-written biomedical scientific claims paired with evidence containing abstracts annotated with labels and rationales while Climate-FEVER28 contains 1,535 claims sourced from web searches, with corresponding evidence from Wikipedia. Specifically to the health and life science domains, PUBHEALTH29 gathers public health claims from fact-checking websites and verifies them against news articles. ManConCorpus30 contains claims and sentences from 259 abstracts linked to 24 systematic reviews on cardiovascular disease. The COVID-19 pandemic and its infodemic effect7,31 have further motivated the development of specialized datasets. HealthVer32 is a medical-domain dataset derived by rewriting responses to questions from TREC-COVID33, verified against the CORD19 corpus34. Similarly, COVID-Fact35 targets COVID-19 claims by scraping content from Reddit and verifying them against scientific papers and documents retrieved via Google search. CoVERt36 enhances claim verification in the clinical domain by providing a new COVID verification dataset containing 15 PICO-encoded drug claims and 96 abstracts, each accompanied by one evidence sentence as rationale. These datasets are either focused on lay claims29,32,35, which require simpler reasoning skills, or, when focused on complex clinical research claims, they are disease-specific, e.g., COVID-1936 or cardiovascular30 and of reduced scale (O(101) claims)30,36. Thus, they are limited to evaluating the factuality of complex clinical research claims by LLMs. To reduce this gap, we propose CliniFact37, a large-scale claim dataset to evaluate the generalizability of LLMs in comprehending factuality and logical statements in clinical research. CliniFact37 claims were automatically extracted from clinical trial protocols and results available from ClinicalTrials.gov. The claims were linked to supporting information in scientific publications available in Medline, with evidence provided at the abstract level. The resulting dataset contains O(103) claims spanning across 20 disease classes. This new benchmark offers a novel approach to evaluating LLMs in the health and life (...truncated)