A dataset for evaluating clinical research claims in large language models
www.nature.com/scientificdata
OPEN
Data Descriptor
A dataset for evaluating clinical
research claims in large language
models
Boya Zhang1 ✉, Alban Bornet 1, Anthony Yazdani 1, Philipp Khlebnikov2,
Marija Milutinovic1, Hossein Rouhizadeh1, Poorya Amini2 & Douglas Teodoro
1✉
Large language models (LLMs) have the potential to enhance the verification of health claims.
However, issues with hallucination and comprehension of logical statements require these models
to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset
created from hypothesis testing results in clinical research, covering 992 unique interventions for 22
disease categories. The dataset used study arms and interventions, primary outcome measures, and
results from clinical trials to derive and label clinical research claims. These claims were then linked to
supporting information describing clinical trial results in scientific publications. CliniFact contains 1,970
instances from 992 unique clinical trials related to 1,540 unique publications. When evaluating LLMs
against CliniFact, discriminative models, such as BioBERT with an accuracy of 80.2%, outperformed
generative counterparts, such as Llama3-70B, which reached 53.6% accuracy (p-value < 0.001). Our
results demonstrate the potential of CliniFact as a benchmark for evaluating LLM performance in clinical
research claim verification.
Background & Summary
Large language models (LLMs) have demonstrated remarkable success in several natural language processing
tasks in the health and life sciences domain1. Due to parameter scaling, access to specialized corpora, and better
human alignment techniques, performance has significantly improved in recent years2. Yet, they still struggle
with factual accuracy in various domains3. LLMs may produce factual errors that contradict established knowledge available at the time4. These inaccuracies and errors are particularly concerning in critical fields like healthcare, where incorrect information can have severe consequences5.
To mitigate issues with factual accuracy and vulnerability to hallucinations, the incorporation of
domain-specific knowledge when evaluating LLMs has been proposed6. This stems from the fact that factual
accuracy7 and vulnerability to hallucinations8 in LLMs can vary significantly across domains9. Models fine-tuned
for a general purpose tend to outperform in the general domain6 while models fine-tuned for specific domains,
such as medicine (e.g., Meditron10, Med-PaLM11), often outperform general-purpose models in those areas.
Another critical challenge for LLMs is their ability to perform logical reasoning12. This is particularly important in clinical research, where scientific claims are posed as logical statements, such as ‘the intervention X is
more effective than placebo for a specific outcome'13, that are either true or false. Evaluating these claims requires
a strong understanding of hypothesis testing and causal inference14. However, the nature of LLMs, which are
trained to predict tokens within a context15, makes them struggle with complex logical statements16, even making unfaithful reasoning17.
Research has shown that LLMs can be easily misled by irrelevant information18. Chain-of-thought (CoT)
prompting can improve multi-step reasoning by providing intermediate rationales19. Concerns remain regarding
the faithfulness and reliability of these explanations, as they can often be biased or misleading20. Furthermore,
while methodologies such as self-correction can improve reasoning accuracy, current models still struggle to
correct their errors autonomously without external feedback21. In some cases, their performance degrades after
self-correction21. Integrating LLMs with symbolic solvers for logical reasoning22 and hypothesis testing prompting for improved deductive reasoning23 are proposed to address these limitations.
1
Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
Risklick AG, Bern, Switzerland. ✉e-mail: ;
2
Scientific Data |
(2025) 12:86 | https://doi.org/10.1038/s41597-025-04417-x
1
www.nature.com/scientificdata
www.nature.com/scientificdata/
Claim verification datasets play a crucial role in assessing the factual accuracy of LLMs across various
domains24. FEVER25, a general-domain dataset, was created by rewriting Wikipedia sentences into atomic claims,
which are then verified using Wikipedia’s textual knowledge base. FEVER also introduces a three-step fact verification process: document retrieval, evidence selection, and stance detection. In the political domain, the UKP
Snopes corpus26, derived from the Snopes fact-checking website, includes 6,422 validated claims paired with
evidence text snippets. For the scientific domain, SciFact27 includes 1.4 K expert-written biomedical scientific
claims paired with evidence containing abstracts annotated with labels and rationales while Climate-FEVER28
contains 1,535 claims sourced from web searches, with corresponding evidence from Wikipedia.
Specifically to the health and life science domains, PUBHEALTH29 gathers public health claims from
fact-checking websites and verifies them against news articles. ManConCorpus30 contains claims and sentences from 259 abstracts linked to 24 systematic reviews on cardiovascular disease. The COVID-19 pandemic
and its infodemic effect7,31 have further motivated the development of specialized datasets. HealthVer32 is a
medical-domain dataset derived by rewriting responses to questions from TREC-COVID33, verified against
the CORD19 corpus34. Similarly, COVID-Fact35 targets COVID-19 claims by scraping content from Reddit and
verifying them against scientific papers and documents retrieved via Google search. CoVERt36 enhances claim
verification in the clinical domain by providing a new COVID verification dataset containing 15 PICO-encoded
drug claims and 96 abstracts, each accompanied by one evidence sentence as rationale. These datasets are
either focused on lay claims29,32,35, which require simpler reasoning skills, or, when focused on complex clinical
research claims, they are disease-specific, e.g., COVID-1936 or cardiovascular30 and of reduced scale (O(101)
claims)30,36. Thus, they are limited to evaluating the factuality of complex clinical research claims by LLMs.
To reduce this gap, we propose CliniFact37, a large-scale claim dataset to evaluate the generalizability of LLMs
in comprehending factuality and logical statements in clinical research. CliniFact37 claims were automatically
extracted from clinical trial protocols and results available from ClinicalTrials.gov. The claims were linked to
supporting information in scientific publications available in Medline, with evidence provided at the abstract
level. The resulting dataset contains O(103) claims spanning across 20 disease classes. This new benchmark offers
a novel approach to evaluating LLMs in the health and life (...truncated)