Large Language Models in Physics: Analysis of Accuracy and Teacher Perception
Interdisciplinary Description of Complex Systems 23(6), 668-684, 2025
LARGE LANGUAGE MODELS IN PHYSICS:
ANALYSIS OF ACCURACY
AND TEACHER PERCEPTION
Boško Lišnić and Marija Gaurina*
University of Split, Faculty of Science
Split, Croatia
DOI: 10.7906/indecs.23.6.5
Regular article
Received: 29 July 2025.
Accepted: 12 November 2025.
ABSTRACT
This article explores the educational potential of large language models (LLMs) in physics teaching for
secondary schools. The aim is to examine the accuracy, consistency and execution time of LLMs in
solving tasks from national physics exams for secondary schools in the Republic of Croatia. In addition,
we want to collect teachers’ perceptions of their use in teaching and as a tool in preparation for state
maturas. The quantitative part analysed the responses obtained from 15 models from four different
platforms (OpenAI, Perplexity, Microsoft Copilot, DeepSeek) in solving tasks from national physics
exams. In the qualitative part, a focus group was conducted with physics teachers. The results show a
high level of accuracy of individual models, and problems in solving tasks with a graphical display
were revealed. Teachers recognised the potential of LLMs as auxiliary tools but emphasised the
necessity of students’ prior knowledge in physics and teachers’ need for critical literacy and control.
The research provides insight into the possibilities and limitations that LLMs bring with them in STEM
education and opens space for further research into the application of artificial intelligence tools in
education.
KEY WORDS
LLM, education, physics, STEM, AI literacy, evaluation
CLASSIFICATION
APA: 3580, 3600
JEL:
I21, I25
*Corresponding author, : ; -; -
Large language models in physics: analysis of accuracy and teacher perception
INTRODUCTION
The emergence of large language models (LLMs) has brought new opportunities to the field of
education [1]. LLMs have the potential to provide a wide range of benefits and opportunities
for students at all stages of education [2]. Today, students actively use various LLM tools, such
as ChatGPT, as a tool in class or at home. According to research from 2023, already then 89%
of surveyed American students used ChatGPT to do homework [3], and today those numbers
are certainly even higher. Artificial intelligence tools provide numerous advantages. Thus,
some findings highlight the positive impact of AI on the advancement of conceptual
understanding, providing personalised learning, facilitating social interaction and assessment
methods [4]. Special emphasis is placed on various learning supports that are based on the
personal needs of students [5]. An improvement in more efficient learning and increased
autonomous learning was recorded [6]. LLMs have excellent interdisciplinary opportunities,
allowing students to connect in integrated learning and develop interdisciplinary thinking
skills [5]. With the aim of improving learning methods, the use of LLMs is on the rise as a tool
that achieves student achievement levels in subjects such as mathematics, computer science,
and physics [1], therefore, it is not surprising that such tools are increasingly used in solving
factual, as well as mathematical-logical tasks. Although students appreciate the explanations
provided, they may also lose confidence in the tools due to inaccuracies [7]. These tools bring
with them challenges in their use in teaching. Challenges have been identified in the form of
technical infrastructure, training data, and data privacy [4]. These tools are limited to the data
they are trained on and can give fictional answers in a convincing tone [8].
Therefore, we aim to examine how useful LLMs can be in preparing students for the high
school physics exam. First, we want to determine the accuracy of the solutions obtained from
such models. There is also a need to check how consistent they are in their answers, whether
they differ in time of execution, and what teachers’ attitudes towards their use are. The research
question is: How accurate, consistent, and time-efficient are LLMs in solving tasks for the high
school physics exam, and how are they used, perceived and interpreted by physics teachers?
RELATED WORKS
LLMs show safety deficiencies, especially in situations involving unclear, ambiguous and
ethical tasks, according to the SafetyBench study [9]. Although the performance is formally
good, there is a tendency to hallucinate facts, which in an educational context can seriously
undermine students’ trust in the systems [9]. Maitland et al. [10] highlighted a high proportion
of factual and conceptual errors that are unacceptable in the context of the reliability of clinical
decision-making systems. As with the clinical context, we can draw a parallel with the
educational context: models make false claims in a convincing tone that causes misconceptions
in students. In physics, such mistakes reinforce misconceptions students already have,
which [11] also highlights as a critical risk: “LLMs often reinforce misconceptions because they
respond with great certainty and authority”. Rong et al. [12] introduce the concept of
exclusionary reasoning, which refers to the model’s ability to know when not to intervene,
depending on the situation. This plays an especially important role in education when unwanted
corrections of student answers can reduce self-confidence. Wu et al. [13] point out that multistep explanations (e.g. chain-of-thought prompting) can reduce accuracy because hallucinating
the model can lead to incorrect answers. Sonkar et al. [14] define Student Data Paradox, the
concept related to model training on student-tutor dialogue, which results in better imitation of
misconceptions but weaker reasoning. Thus, he offers a solution in the form of hallucinatory
tokens, which can also be used in physics because they are models that distinguish between
situations, i.e. when to “act” and when to give the correct answer. Although LLMs show good
performance in factual responses, they can also show deficiencies in responses that involve
visual elements, understanding symbols, or that require deep reasoning [15, 16].
669
B. Lišnić and M. Gaurina
Careful pedagogical integration is needed, given that uncritical and unmoderated use of LLMs
in education can lead to over-reliance on models and reduce student engagement. On the other
hand, some research highlights the potential of LLMs in supporting reflective learning if used
in the right way through a structured framework and teacher guidance [11, 17]. Regarding
evaluation, Chang et al. [15] propose a multidimensional framework for evaluating LLMs,
addressing the questions of what to evaluate (type of task), where (benchmark tasks), and how
(methods and metrics). Such and similar models allow us to assess the appropriateness of LLMs
in fields such as education [15].
Given that AI tools surround students, the importance of AI literacy is emphasised. This
concept primarily refers to a cr (...truncated)