A collaborative large language model for drug analysis
nature biomedical engineering
Article
https://doi.org/10.1038/s41551-025-01471-z
A collaborative large language model for
drug analysis
Received: 4 October 2023
Accepted: 25 June 2025
Published online: xx xx xxxx
Hongjian Zhou1,9, Fenglin Liu 1,9 , Jinge Wu2, Wenjun Zhang1, Guowei Huang1,
Lei Clifton3, David Eyre 4, Haochen Luo1, Fengyuan Liu 1, Kim Branson5,
Patrick Schwab 5, Xian Wu 6, Yefeng Zheng6,7, Anshul Thakur 1 &
David A. Clifton1,8
Check for updates
Large language models (LLMs), such as ChatGPT, have substantially
helped in understanding human inquiries and generating textual
content with human-level fluency. However, directly using LLMs in
healthcare applications faces several problems. LLMs are prone to
produce hallucinations, or fluent content that appears reasonable and
genuine but that is factually incorrect. Ideally, the source of the generated
content should be easily traced for clinicians to evaluate. We propose a
knowledge-grounded collaborative large language model, DrugGPT, to
make accurate, evidence-based and faithful recommendations that can be
used for clinical decisions. DrugGPT incorporates diverse clinical-standard
knowledge bases and introduces a collaborative mechanism that adaptively
analyses inquiries, captures relevant knowledge sources and aligns these
inquiries and knowledge sources when dealing with different drugs.
We evaluate the proposed DrugGPT on drug recommendation, dosage
recommendation, identification of adverse reactions, identification of
potential drug–drug interactions and answering general pharmacology
questions. DrugGPT outperforms a wide range of existing LLMs and
achieves state-of-the-art performance across all metrics with fewer
parameters than generic LLMs.
Large language models (LLMs), such as ChatGPT (GPT-3.5), GPT-41–3 and
PaLM4, have attracted extensive attention owing to their impressive
capabilities in understanding and generating human language. The
adoption of LLMs to assist physicians in clinical decision-making has
received growing research interest in the communities of both artificial
intelligence and clinical medicine5–13.
However, applying LLMs in healthcare remains a challenging
undertaking owing to the lack of faithfulness and evidence in the
generated content9–12. The first challenge, faithfulness, means that
LLMs tend to produce hallucinations12,14, which refer to generated
content that looks reasonable but is not based on factual information
and knowledge. Given an inquiry that asks for “the medication that is
most likely to improve joint pain symptoms (not the first-line treatment for the disease)”, both ChatGPT (that is, GPT-3.5) and GPT-4 not
only give a wrong choice ‘C’ but also give wrong explanations without
any grounding in accurate knowledge (as shown by the red-coloured
text)15–17 (Fig. 1). In detail, these hallucinations, for example, “all of the
options provided are used in rheumatoid arthritis treatment”, include
factually incorrect information that could be explicitly harmful in
healthcare decision-making scenarios. The second challenge is the
Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, UK. 2Institute of Health Informatics, University
College London, London, UK. 3Applied Digital Health, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK. 4Big Data
Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK. 5GlaxoSmithKline, London, UK. 6Tencent Jarvis Lab, Beijing, China.
7
Medical Artificial Intelligence Laboratory, Westlake University, Hangzhou, China. 8Oxford Suzhou Centre for Advanced Research, Suzhou, China.
9
These authors contributed equally: Hongjian Zhou, Fenglin Liu.
e-mail:
1
Nature Biomedical Engineering
Article
https://doi.org/10.1038/s41551-025-01471-z
User inquiry (question): A 29-year-old woman develops painful swelling of both hands. She is also very stiff in the morning. Physical examination reveals involvement of the proximal
interphalangeal joints and metacarpophalangeal (MCP) joints. Her RF is positive and ANA is negative. Which of the following medications is most likely to improve her joint pain
symptoms? A: D-penicillamine; B: ananti-malarial; C: methotrexate; D: NSAID or aspirin
Correct answer: D
ChatGPT
Answer: C
Analysis:
1. Although all of the options provided are used in rheumatoid arthritis treatment [Error1], Methotrexate is often the first line treatment for this condition
2. Methotrexate helps to reduce the pain, swelling, and stiffness of joints [Error2].
GPT-4
Answer: C
Key analysis:
1. Methotrexate helps to reduce joint inflammation, prevent joint damage, and improve symptoms such as pain and stiffness of joints [Error2].
2. Methotrexate is the most likely medication to improve the joint pain symptoms [Error3] in this patient with rheumatoid arthritis.
[Error1]: This is false as D-penicillamine is not used as a common treatment for rheumatoid arthritis treatment (only when others fail). It is not approved for juvenile rheumatoid arthritis.
[Error2]: Methotrexate is primarily used as a disease-modifying drug to slow the progression of rheumatoid arthritis and prevent joint damage rather than providing immediate pain relief.
[Error3]: Methotrexate does not have an immediate impact on pain relief. Many patients will start to experience an improvement in symptoms after six to eight weeks on methotrexate. It is
not the most likely medication to improve joint pain symptoms compared to NSAIDs.
DrugGPT
Answer: D
Key analysis: These symptoms are consistent with rheumatoid arthritis (RA).The RF is positive, which further supports the diagnosis of RA.
1. Methotrexate and anti-malarial are treatments for rheumatoid arthritis, but it is used to slow the progression of the disease, not immediate pain relief.
2. D-penicillamine is used as a second-line treatment for rheumatoid arthritis.
3. NSAIDs are commonly used to provide immediate relief from pain, inflammation, and fever, making them suitable for relieving joint pain symptoms in rheumatoid arthritis.
Fig. 1 | Comparative medical reasoning analysis of LLMs in rheumatoid arthritis treatment. An example drug recommendation provided by DrugGPT and the stateof-the-art LLMs ChatGPT1 (GPT-3.5) and GPT-42. The red- and blue-coloured texts denote emphasized wrong and accurate explanations, respectively.
traceability of evidence. In healthcare, it is crucial for LLMs to show the
source of generated content (evidence), which can be used to explain
why the LLMs give such recommendations. However, existing LLMs
often lack the capability to provide clear evidence. Both ChatGPT and
GPT-4 give inappropriate evidence (Fig. 1). Such challenges hinder
the adoption of LLMs in healthcare settings where transparency, reliability and trustworthiness are critical in achieving trust from clinical
users10,11, and in regulatory approvals that are essential for deployment
in clinical practice.
In thi (...truncated)