Large Language Models in Physics: Analysis of Accuracy and Teacher Perception (pdf)

Article PDF cannot be displayed. You can download it here:

https://indecs.eu/2025/indecs2025-pp668-684.pdf

Large Language Models in Physics: Analysis of Accuracy and Teacher Perception

Interdisciplinary Description of Complex Systems 23(6), 668-684, 2025 LARGE LANGUAGE MODELS IN PHYSICS: ANALYSIS OF ACCURACY AND TEACHER PERCEPTION Boško Lišnić and Marija Gaurina* University of Split, Faculty of Science Split, Croatia DOI: 10.7906/indecs.23.6.5 Regular article Received: 29 July 2025. Accepted: 12 November 2025. ABSTRACT This article explores the educational potential of large language models (LLMs) in physics teaching for secondary schools. The aim is to examine the accuracy, consistency and execution time of LLMs in solving tasks from national physics exams for secondary schools in the Republic of Croatia. In addition, we want to collect teachers’ perceptions of their use in teaching and as a tool in preparation for state maturas. The quantitative part analysed the responses obtained from 15 models from four different platforms (OpenAI, Perplexity, Microsoft Copilot, DeepSeek) in solving tasks from national physics exams. In the qualitative part, a focus group was conducted with physics teachers. The results show a high level of accuracy of individual models, and problems in solving tasks with a graphical display were revealed. Teachers recognised the potential of LLMs as auxiliary tools but emphasised the necessity of students’ prior knowledge in physics and teachers’ need for critical literacy and control. The research provides insight into the possibilities and limitations that LLMs bring with them in STEM education and opens space for further research into the application of artificial intelligence tools in education. KEY WORDS LLM, education, physics, STEM, AI literacy, evaluation CLASSIFICATION APA: 3580, 3600 JEL: I21, I25 *Corresponding author, : ; -; - Large language models in physics: analysis of accuracy and teacher perception INTRODUCTION The emergence of large language models (LLMs) has brought new opportunities to the field of education [1]. LLMs have the potential to provide a wide range of benefits and opportunities for students at all stages of education [2]. Today, students actively use various LLM tools, such as ChatGPT, as a tool in class or at home. According to research from 2023, already then 89% of surveyed American students used ChatGPT to do homework [3], and today those numbers are certainly even higher. Artificial intelligence tools provide numerous advantages. Thus, some findings highlight the positive impact of AI on the advancement of conceptual understanding, providing personalised learning, facilitating social interaction and assessment methods [4]. Special emphasis is placed on various learning supports that are based on the personal needs of students [5]. An improvement in more efficient learning and increased autonomous learning was recorded [6]. LLMs have excellent interdisciplinary opportunities, allowing students to connect in integrated learning and develop interdisciplinary thinking skills [5]. With the aim of improving learning methods, the use of LLMs is on the rise as a tool that achieves student achievement levels in subjects such as mathematics, computer science, and physics [1], therefore, it is not surprising that such tools are increasingly used in solving factual, as well as mathematical-logical tasks. Although students appreciate the explanations provided, they may also lose confidence in the tools due to inaccuracies [7]. These tools bring with them challenges in their use in teaching. Challenges have been identified in the form of technical infrastructure, training data, and data privacy [4]. These tools are limited to the data they are trained on and can give fictional answers in a convincing tone [8]. Therefore, we aim to examine how useful LLMs can be in preparing students for the high school physics exam. First, we want to determine the accuracy of the solutions obtained from such models. There is also a need to check how consistent they are in their answers, whether they differ in time of execution, and what teachers’ attitudes towards their use are. The research question is: How accurate, consistent, and time-efficient are LLMs in solving tasks for the high school physics exam, and how are they used, perceived and interpreted by physics teachers? RELATED WORKS LLMs show safety deficiencies, especially in situations involving unclear, ambiguous and ethical tasks, according to the SafetyBench study [9]. Although the performance is formally good, there is a tendency to hallucinate facts, which in an educational context can seriously undermine students’ trust in the systems [9]. Maitland et al. [10] highlighted a high proportion of factual and conceptual errors that are unacceptable in the context of the reliability of clinical decision-making systems. As with the clinical context, we can draw a parallel with the educational context: models make false claims in a convincing tone that causes misconceptions in students. In physics, such mistakes reinforce misconceptions students already have, which [11] also highlights as a critical risk: “LLMs often reinforce misconceptions because they respond with great certainty and authority”. Rong et al. [12] introduce the concept of exclusionary reasoning, which refers to the model’s ability to know when not to intervene, depending on the situation. This plays an especially important role in education when unwanted corrections of student answers can reduce self-confidence. Wu et al. [13] point out that multistep explanations (e.g. chain-of-thought prompting) can reduce accuracy because hallucinating the model can lead to incorrect answers. Sonkar et al. [14] define Student Data Paradox, the concept related to model training on student-tutor dialogue, which results in better imitation of misconceptions but weaker reasoning. Thus, he offers a solution in the form of hallucinatory tokens, which can also be used in physics because they are models that distinguish between situations, i.e. when to “act” and when to give the correct answer. Although LLMs show good performance in factual responses, they can also show deficiencies in responses that involve visual elements, understanding symbols, or that require deep reasoning [15, 16]. 669 B. Lišnić and M. Gaurina Careful pedagogical integration is needed, given that uncritical and unmoderated use of LLMs in education can lead to over-reliance on models and reduce student engagement. On the other hand, some research highlights the potential of LLMs in supporting reflective learning if used in the right way through a structured framework and teacher guidance [11, 17]. Regarding evaluation, Chang et al. [15] propose a multidimensional framework for evaluating LLMs, addressing the questions of what to evaluate (type of task), where (benchmark tasks), and how (methods and metrics). Such and similar models allow us to assess the appropriateness of LLMs in fields such as education [15]. Given that AI tools surround students, the importance of AI literacy is emphasised. This concept primarily refers to a cr (...truncated)