A collaborative large language model for drug analysis

Nature Biomedical Engineering, Sep 2025

Large language models (LLMs), such as ChatGPT, have substantially helped in understanding human inquiries and generating textual content with human-level fluency. However, directly using LLMs in healthcare applications faces several problems. LLMs are prone to produce hallucinations, or fluent content that appears reasonable and genuine but that is factually incorrect. Ideally, the source of the generated content should be easily traced for clinicians to evaluate. We propose a knowledge-grounded collaborative large language model, DrugGPT, to make accurate, evidence-based and faithful recommendations that can be used for clinical decisions. DrugGPT incorporates diverse clinical-standard knowledge bases and introduces a collaborative mechanism that adaptively analyses inquiries, captures relevant knowledge sources and aligns these inquiries and knowledge sources when dealing with different drugs. We evaluate the proposed DrugGPT on drug recommendation, dosage recommendation, identification of adverse reactions, identification of potential drug–drug interactions and answering general pharmacology questions. DrugGPT outperforms a wide range of existing LLMs and achieves state-of-the-art performance across all metrics with fewer parameters than generic LLMs.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41551-025-01471-z.pdf

A collaborative large language model for drug analysis

nature biomedical engineering Article https://doi.org/10.1038/s41551-025-01471-z A collaborative large language model for drug analysis Received: 4 October 2023 Accepted: 25 June 2025 Published online: xx xx xxxx Hongjian Zhou1,9, Fenglin Liu 1,9 , Jinge Wu2, Wenjun Zhang1, Guowei Huang1, Lei Clifton3, David Eyre 4, Haochen Luo1, Fengyuan Liu 1, Kim Branson5, Patrick Schwab 5, Xian Wu 6, Yefeng Zheng6,7, Anshul Thakur 1 & David A. Clifton1,8 Check for updates Large language models (LLMs), such as ChatGPT, have substantially helped in understanding human inquiries and generating textual content with human-level fluency. However, directly using LLMs in healthcare applications faces several problems. LLMs are prone to produce hallucinations, or fluent content that appears reasonable and genuine but that is factually incorrect. Ideally, the source of the generated content should be easily traced for clinicians to evaluate. We propose a knowledge-grounded collaborative large language model, DrugGPT, to make accurate, evidence-based and faithful recommendations that can be used for clinical decisions. DrugGPT incorporates diverse clinical-standard knowledge bases and introduces a collaborative mechanism that adaptively analyses inquiries, captures relevant knowledge sources and aligns these inquiries and knowledge sources when dealing with different drugs. We evaluate the proposed DrugGPT on drug recommendation, dosage recommendation, identification of adverse reactions, identification of potential drug–drug interactions and answering general pharmacology questions. DrugGPT outperforms a wide range of existing LLMs and achieves state-of-the-art performance across all metrics with fewer parameters than generic LLMs. Large language models (LLMs), such as ChatGPT (GPT-3.5), GPT-41–3 and PaLM4, have attracted extensive attention owing to their impressive capabilities in understanding and generating human language. The adoption of LLMs to assist physicians in clinical decision-making has received growing research interest in the communities of both artificial intelligence and clinical medicine5–13. However, applying LLMs in healthcare remains a challenging undertaking owing to the lack of faithfulness and evidence in the generated content9–12. The first challenge, faithfulness, means that LLMs tend to produce hallucinations12,14, which refer to generated content that looks reasonable but is not based on factual information and knowledge. Given an inquiry that asks for “the medication that is most likely to improve joint pain symptoms (not the first-line treatment for the disease)”, both ChatGPT (that is, GPT-3.5) and GPT-4 not only give a wrong choice ‘C’ but also give wrong explanations without any grounding in accurate knowledge (as shown by the red-coloured text)15–17 (Fig. 1). In detail, these hallucinations, for example, “all of the options provided are used in rheumatoid arthritis treatment”, include factually incorrect information that could be explicitly harmful in healthcare decision-making scenarios. The second challenge is the Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, UK. 2Institute of Health Informatics, University College London, London, UK. 3Applied Digital Health, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK. 4Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK. 5GlaxoSmithKline, London, UK. 6Tencent Jarvis Lab, Beijing, China. 7 Medical Artificial Intelligence Laboratory, Westlake University, Hangzhou, China. 8Oxford Suzhou Centre for Advanced Research, Suzhou, China. 9 These authors contributed equally: Hongjian Zhou, Fenglin Liu. e-mail: 1 Nature Biomedical Engineering Article https://doi.org/10.1038/s41551-025-01471-z User inquiry (question): A 29-year-old woman develops painful swelling of both hands. She is also very stiff in the morning. Physical examination reveals involvement of the proximal interphalangeal joints and metacarpophalangeal (MCP) joints. Her RF is positive and ANA is negative. Which of the following medications is most likely to improve her joint pain symptoms? A: D-penicillamine; B: ananti-malarial; C: methotrexate; D: NSAID or aspirin Correct answer: D ChatGPT Answer: C Analysis: 1. Although all of the options provided are used in rheumatoid arthritis treatment [Error1], Methotrexate is often the first line treatment for this condition 2. Methotrexate helps to reduce the pain, swelling, and stiffness of joints [Error2]. GPT-4 Answer: C Key analysis: 1. Methotrexate helps to reduce joint inflammation, prevent joint damage, and improve symptoms such as pain and stiffness of joints [Error2]. 2. Methotrexate is the most likely medication to improve the joint pain symptoms [Error3] in this patient with rheumatoid arthritis. [Error1]: This is false as D-penicillamine is not used as a common treatment for rheumatoid arthritis treatment (only when others fail). It is not approved for juvenile rheumatoid arthritis. [Error2]: Methotrexate is primarily used as a disease-modifying drug to slow the progression of rheumatoid arthritis and prevent joint damage rather than providing immediate pain relief. [Error3]: Methotrexate does not have an immediate impact on pain relief. Many patients will start to experience an improvement in symptoms after six to eight weeks on methotrexate. It is not the most likely medication to improve joint pain symptoms compared to NSAIDs. DrugGPT Answer: D Key analysis: These symptoms are consistent with rheumatoid arthritis (RA).The RF is positive, which further supports the diagnosis of RA. 1. Methotrexate and anti-malarial are treatments for rheumatoid arthritis, but it is used to slow the progression of the disease, not immediate pain relief. 2. D-penicillamine is used as a second-line treatment for rheumatoid arthritis. 3. NSAIDs are commonly used to provide immediate relief from pain, inflammation, and fever, making them suitable for relieving joint pain symptoms in rheumatoid arthritis. Fig. 1 | Comparative medical reasoning analysis of LLMs in rheumatoid arthritis treatment. An example drug recommendation provided by DrugGPT and the stateof-the-art LLMs ChatGPT1 (GPT-3.5) and GPT-42. The red- and blue-coloured texts denote emphasized wrong and accurate explanations, respectively. traceability of evidence. In healthcare, it is crucial for LLMs to show the source of generated content (evidence), which can be used to explain why the LLMs give such recommendations. However, existing LLMs often lack the capability to provide clear evidence. Both ChatGPT and GPT-4 give inappropriate evidence (Fig. 1). Such challenges hinder the adoption of LLMs in healthcare settings where transparency, reliability and trustworthiness are critical in achieving trust from clinical users10,11, and in regulatory approvals that are essential for deployment in clinical practice. In thi (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41551-025-01471-z.pdf
Article home page: https://www.nature.com/articles/s41551-025-01471-z

Zhou, Hongjian, Liu, Fenglin, Wu, Jinge, Zhang, Wenjun, Huang, Guowei, Clifton, Lei, Eyre, David, Luo, Haochen, Liu, Fengyuan, Branson, Kim, Schwab, Patrick, Wu, Xian, Zheng, Yefeng, Thakur, Anshul, Clifton, David A.. A collaborative large language model for drug analysis, Nature Biomedical Engineering, 2025, DOI: 10.1038/s41551-025-01471-z