Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher

BMC Medical Education, Mar 2025

There is an unprecedented increase in the use of Generative AI in medical education. There is a need to assess these models’ accuracy to ensure patient safety. This study assesses the accuracy of ChatGPT, Gemini, and Copilot in answering multiple-choice questions (MCQs) compared to a qualified medical teacher. This study randomly selected 40 Multiple Choice Questions (MCQs) from past United States Medical Licensing Examination (USMLE) and asked for answers to three LLMs: ChatGPT, Gemini, and Copilot. The results of an LLM are then compared with those of a qualified medical teacher and with responses from other LLMs. The Fleiss’ Kappa Test was used to determine the concordance between four responders (3 LLMs + 1 Medical Teacher). In case of poor agreement between responders, Cohen’s Kappa test was performed to assess the agreement between responders. ChatGPT demonstrated the highest accuracy (70%, Cohen’s Kappa = 0.84), followed by Copilot (60%, Cohen’s Kappa = 0.69), while Gemini showed the lowest accuracy (50%, Cohen’s Kappa = 0.53). The Fleiss’ Kappa value of -0.056 indicated significant disagreement among all four responders. The study provides an approach for assessing the accuracy of different LLMs. The study concludes that ChatGPT is far superior (70%) to other LLMs when asked medical questions across different specialties, while contrary to expectations, Gemini (50%) performed poorly. When compared with medical teachers, the low accuracy of LLMs suggests that general-purpose LLMs should be used with caution in medical education. This study evaluates Large Language Models (LLMs) compared to experienced medical teachers. The analysis examines the performance of three prominent LLMs—ChatGPT, Gemini, and Copilot. The study employs Fleiss’ Kappa Test to statistically analyze the concordance between LLMs and human responses. In discordance, Cohen’s Kappa test was used to find agreement between three Gen AI tools and a Medical Teacher. Results reveal a significant difference in the performance between LLMs and medical teachers, highlighting potential limitations in using AI alone for medical education.

Article PDF cannot be displayed. You can download it here:

https://bmcmededuc.biomedcentral.com/counter/pdf/10.1186/s12909-025-07009-w

Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher

Mishra et al. BMC Medical Education (2025) 25:443 https://doi.org/10.1186/s12909-025-07009-w BMC Medical Education Open Access RESEARCH Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher Vinaytosh Mishra1,2*, Yotam Lurie3 and Shlomo Mark4 Abstract Background There is an unprecedented increase in the use of Generative AI in medical education. There is a need to assess these models’ accuracy to ensure patient safety. This study assesses the accuracy of ChatGPT, Gemini, and Copilot in answering multiple-choice questions (MCQs) compared to a qualified medical teacher. Methods This study randomly selected 40 Multiple Choice Questions (MCQs) from past United States Medical Licensing Examination (USMLE) and asked for answers to three LLMs: ChatGPT, Gemini, and Copilot. The results of an LLM are then compared with those of a qualified medical teacher and with responses from other LLMs. The Fleiss’ Kappa Test was used to determine the concordance between four responders (3 LLMs + 1 Medical Teacher). In case of poor agreement between responders, Cohen’s Kappa test was performed to assess the agreement between responders. Results ChatGPT demonstrated the highest accuracy (70%, Cohen’s Kappa = 0.84), followed by Copilot (60%, Cohen’s Kappa = 0.69), while Gemini showed the lowest accuracy (50%, Cohen’s Kappa = 0.53). The Fleiss’ Kappa value of -0.056 indicated significant disagreement among all four responders. Conclusion The study provides an approach for assessing the accuracy of different LLMs. The study concludes that ChatGPT is far superior (70%) to other LLMs when asked medical questions across different specialties, while contrary to expectations, Gemini (50%) performed poorly. When compared with medical teachers, the low accuracy of LLMs suggests that general-purpose LLMs should be used with caution in medical education. Practice points • This study evaluates Large Language Models (LLMs) compared to experienced medical teachers. • The analysis examines the performance of three prominent LLMs—ChatGPT, Gemini, and Copilot. • The study employs Fleiss’ Kappa Test to statistically analyze the concordance between LLMs and human responses. • In discordance, Cohen’s Kappa test was used to find agreement between three Gen AI tools and a Medical Teacher. *Correspondence: Vinaytosh Mishra Full list of author information is available at the end of the article © The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Mishra et al. BMC Medical Education (2025) 25:443 Page 2 of 8 • Results reveal a significant difference in the performance between LLMs and medical teachers, highlighting potential limitations in using AI alone for medical education. Keywords Generative AI, LLM, Machine learning, Medical education Introduction The use of computers in medical education has significantly increased in the last few decades. Integration of technologies such as virtual reality (VR), augmented reality (AR), and computer-assisted learning (CAL) has been instrumental in imparting various aspects of training [1]. Integrating these technologies has transformed traditional pedagogical methods, enhancing the learning experience for medical students and professionals alike. This amalgamation facilitates the acquisition of complex skills and addresses the limitations of conventional training methods [2]. The e-learning tools, including threedimensional resources, offer advantages over traditional training by enabling greater accessibility and flexibility [3]. By utilizing virtual space, students may effectively comprehend complex anatomical structures that are typically difficult to understand through textbooks or static representations. Computer-assisted learning programs have been successfully utilized in different medical fields, including dentistry, alongside surgical training. Karemore et al. stated that CAL could replicate genuine patient encounters, allowing dental students to get useful experience without any potential risks [4]. Moreover, it highlighted the significance of incorporating digital technology into dentistry education. It was observed that students prefer computer-assisted learning tools if they give concrete educational advantages [5]. Recent advancements in generative artificial intelligence (AI) have opened new opportunities in medical education. Generative AI in medical education Generative artificial intelligence refers to the ability of artificial intelligence systems to produce text, photos, videos, or other types of data using generative models. This is typically done in response to specific prompts or inputs. Generative AI models acquire knowledge of the patterns and organization of their input training data, enabling them to produce novel data with comparable attributes [6]. Generative AI has been significantly influenced by methods like Generative Adversarial Networks (GANs) and extensive language models like BERT- Bidirectional Encoder Representations from Transformers and GPT- Generative Pre-train Transformers. These techniques have facilitated the practical application of generative AI in various fields, including medical education [7]. Open AI developed GPT, while BERT was used by Google AI earlier. Both fall under a wider umbrella of natural language processing (NLP) models known as Large Language Models (LLMs). LLMs are machine learning models that can understand and generate human language text and are called language models [8]. BERT aims mainly to comprehend text by considering the context before and after the target word. It is extensively utilized for jobs requiring profound text comprehension, such as classification and question answering. GPT is specifically engineered to generate text by accurately predicting the subsequent word in each sequence. It is mostly utilized for text generation jobs, like creating dialogues or summarizing content. Thus, GPT is expected to outperform others in generating coherent & creative text; hence, it is a powerful tool for content creation. On the other hand, BERT, with its deep understanding of context, excels in tasks requiring nuanced comprehension and accurate informati (...truncated)


This is a preview of a remote PDF: https://bmcmededuc.biomedcentral.com/counter/pdf/10.1186/s12909-025-07009-w
Article home page: https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-025-07009-w

Mishra, Vinaytosh, Lurie, Yotam, Mark, Shlomo. Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher, BMC Medical Education, 2025, pp. 1-8, Volume 25, Issue 1, DOI: 10.1186/s12909-025-07009-w