Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher
Mishra et al. BMC Medical Education
(2025) 25:443
https://doi.org/10.1186/s12909-025-07009-w
BMC Medical Education
Open Access
RESEARCH
Accuracy of LLMs in medical education:
evidence from a concordance test
with medical teacher
Vinaytosh Mishra1,2*, Yotam Lurie3 and Shlomo Mark4
Abstract
Background There is an unprecedented increase in the use of Generative AI in medical education. There is a need
to assess these models’ accuracy to ensure patient safety. This study assesses the accuracy of ChatGPT, Gemini, and
Copilot in answering multiple-choice questions (MCQs) compared to a qualified medical teacher.
Methods This study randomly selected 40 Multiple Choice Questions (MCQs) from past United States Medical
Licensing Examination (USMLE) and asked for answers to three LLMs: ChatGPT, Gemini, and Copilot. The results of
an LLM are then compared with those of a qualified medical teacher and with responses from other LLMs. The Fleiss’
Kappa Test was used to determine the concordance between four responders (3 LLMs + 1 Medical Teacher). In case
of poor agreement between responders, Cohen’s Kappa test was performed to assess the agreement between
responders.
Results ChatGPT demonstrated the highest accuracy (70%, Cohen’s Kappa = 0.84), followed by Copilot (60%, Cohen’s
Kappa = 0.69), while Gemini showed the lowest accuracy (50%, Cohen’s Kappa = 0.53). The Fleiss’ Kappa value of -0.056
indicated significant disagreement among all four responders.
Conclusion The study provides an approach for assessing the accuracy of different LLMs. The study concludes that
ChatGPT is far superior (70%) to other LLMs when asked medical questions across different specialties, while contrary
to expectations, Gemini (50%) performed poorly. When compared with medical teachers, the low accuracy of LLMs
suggests that general-purpose LLMs should be used with caution in medical education.
Practice points
• This study evaluates Large Language Models (LLMs) compared to experienced medical teachers.
• The analysis examines the performance of three prominent LLMs—ChatGPT, Gemini, and Copilot.
• The study employs Fleiss’ Kappa Test to statistically analyze the concordance between LLMs and human
responses.
• In discordance, Cohen’s Kappa test was used to find agreement between three Gen AI tools and a Medical
Teacher.
*Correspondence:
Vinaytosh Mishra
Full list of author information is available at the end of the article
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Mishra et al. BMC Medical Education
(2025) 25:443
Page 2 of 8
• Results reveal a significant difference in the performance between LLMs and medical teachers, highlighting
potential limitations in using AI alone for medical education.
Keywords Generative AI, LLM, Machine learning, Medical education
Introduction
The use of computers in medical education has significantly increased in the last few decades. Integration of
technologies such as virtual reality (VR), augmented
reality (AR), and computer-assisted learning (CAL) has
been instrumental in imparting various aspects of training [1]. Integrating these technologies has transformed
traditional pedagogical methods, enhancing the learning
experience for medical students and professionals alike.
This amalgamation facilitates the acquisition of complex
skills and addresses the limitations of conventional training methods [2]. The e-learning tools, including threedimensional resources, offer advantages over traditional
training by enabling greater accessibility and flexibility
[3]. By utilizing virtual space, students may effectively
comprehend complex anatomical structures that are typically difficult to understand through textbooks or static
representations. Computer-assisted learning programs
have been successfully utilized in different medical fields,
including dentistry, alongside surgical training. Karemore
et al. stated that CAL could replicate genuine patient
encounters, allowing dental students to get useful experience without any potential risks [4].
Moreover, it highlighted the significance of incorporating digital technology into dentistry education. It was
observed that students prefer computer-assisted learning tools if they give concrete educational advantages
[5]. Recent advancements in generative artificial intelligence (AI) have opened new opportunities in medical
education.
Generative AI in medical education
Generative artificial intelligence refers to the ability of
artificial intelligence systems to produce text, photos,
videos, or other types of data using generative models.
This is typically done in response to specific prompts
or inputs. Generative AI models acquire knowledge
of the patterns and organization of their input training data, enabling them to produce novel data with
comparable attributes [6]. Generative AI has been significantly influenced by methods like Generative Adversarial Networks (GANs) and extensive language models
like BERT- Bidirectional Encoder Representations from
Transformers and GPT- Generative Pre-train Transformers. These techniques have facilitated the practical
application of generative AI in various fields, including
medical education [7].
Open AI developed GPT, while BERT was used by
Google AI earlier. Both fall under a wider umbrella of
natural language processing (NLP) models known as
Large Language Models (LLMs). LLMs are machine
learning models that can understand and generate
human language text and are called language models [8].
BERT aims mainly to comprehend text by considering the
context before and after the target word. It is extensively
utilized for jobs requiring profound text comprehension, such as classification and question answering. GPT
is specifically engineered to generate text by accurately
predicting the subsequent word in each sequence. It is
mostly utilized for text generation jobs, like creating dialogues or summarizing content. Thus, GPT is expected to
outperform others in generating coherent & creative text;
hence, it is a powerful tool for content creation.
On the other hand, BERT, with its deep understanding of context, excels in tasks requiring nuanced comprehension and accurate informati (...truncated)