A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists

Nature Chemistry, May 2025

Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains. The alternative text for this image may have been generated using AI.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41557-025-01815-x.pdf

A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists

nature chemistry Article https://doi.org/10.1038/s41557-025-01815-x A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists Received: 1 April 2024 A list of authors and their affiliations appears at the end of the paper Accepted: 26 March 2025 Published online: 20 May 2025 Check for updates Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains. Large language models (LLMs) are machine learning (ML) models trained on massive amounts of text to complete sentences. Aggressive scaling of these models has led to a rapid increase in their capabilities1,2, with the leading models now being able to pass the US Medical Licensing Examination3 or other professional licensing exams. They also have been shown to design and autonomously perform chemical reactions when augmented with external tools such as web search and synthesis planners4–7. While some see ‘sparks of artificial general intelligence (AGI)’ in them8, others see them as ‘stochastic parrots’—that is, systems that only regurgitate what they have been trained on9 and that show inherent limitations owing to the way they are trained10. Nevertheless, the promise of these models is that they have shown the ability to solve a wide variety of tasks they have not been explicitly trained on11–13. Chemists and materials scientists have quickly caught on to the mounting attention given to LLMs, with some voices even suggesting that ‘the future of chemistry is language’14. This statement is motivated by a growing number of reports that use LLMs to predict properties of molecules or materials2,15–19, optimize reactions20,21, generate materials22–25, extract information26–33 or even to prototype systems that can autonomously perform experiments in the physical world based on commands provided in natural language5–7. In addition, since a lot—if not most—of the information about chemistry is currently stored and communicated in text, there is a strong reason to believe that there is still a lot of untapped potential in LLMs for chemistry and materials science34. For instance, most insights in chemical research do not directly originate from data stored in databases but rather from the scientists interpreting the data. Many of these insights are in the form of text in scientific publications. Thus, operating on such texts might be our best way of unlocking these insights and learning from them. This might ultimately lead to general e-mail: Nature Chemistry | Volume 17 | July 2025 | 1027–1034 1027 Article https://doi.org/10.1038/s41557-025-01815-x Data preparation (>2,800 total questions) Knowledge Reasoning Intuition 19 respondents 251 diverse questions Semantic annotation curation chembench.org Corpus in BIG-bench format Question: What is the number 1 of signals in the H NMR HO spectrum of the molecule on the right? O Automatically updated OH 0.61 Answer: 0.57 0.51 Models Peer-reviewed Leaderboard Humans Closed-source models Open-weight models Diverse settings ... Question: What is the number of signals in the 1H NMR spectrum of a molecule with the SMILES [START_SMILES] OCC1C2CC1(O)C2=O[END_SMILES]? ... Topic leaders Overall leaders Answer: Fig. 1 | Overview of the ChemBench framework. The different components of the ChemBench framework. The framework’s foundation is the benchmark corpus comprising thousands of questions and answers that we manually or semi-automatically compiled from various sources in a format based in the one introduced in the BIG-bench benchmark (Extended Data Fig. 1). Questions are classified on the basis of topics, required skills (reasoning, calculation, knowledge and intuition) and difficulty levels. We then used this corpus to evaluate the performance of various models and tool-augmented systems using a custom framework. To provide a baseline, we built a web application that we used to survey experts in chemistry. The results of the evaluations are then compiled in publicly accessible leaderboards (Supplementary Note 15), which we propose as a foundation for evaluating future models. copilot systems for chemists that can provide answers to questions or even suggest new experiments on the basis of vastly more information than a human could ever read. However, the rapid increase in capabilities of chemical ML models led (even before the recent interest in LLMs) to concerns about the potential for the dual use of these technologies, for example, for the design of chemical weapons35–40. To some extent, this is not surprising as any technology that, for instance, is used to design non-toxic molecules can also be used inversely to predict toxic ones (even though the synthesis would still require access to controlled physical resources and facilities). Still, it is essential to realize that the user base of LLMs is broader than that of chemistry and materials science experts who can critically reflect on every output these models produce. For example, many students frequently consult these tools—perhaps even to prepare chemical experiments41. This also applies to users from the general public, who might consider using LLMs to answer questions about the safety of chemicals. Thus, for some users, misleading information—especially about safety-related aspects—might lead to harmful outcomes. However, even for experts, chemical knowledge and reasoning capabilities are essential as they will determine the capabilities and limitations of their models in their work, for example, in copilot systems for chemists. Unfortunately, apart from exploratory reports, such as by prompting leading models with various scientific questions13, there is little systematic evidence on how LLMs perform compared with expert (human) chemists. Thus, to better understand what LLMs can do for the chemical sciences and where they might be improved with further developments, evaluation frameworks are needed to allow us to measur (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41557-025-01815-x.pdf
Article home page: https://www.nature.com/articles/s41557-025-01815-x

Mirza, Adrian, Alampara, Nawaf, Kunchapu, Sreekanth, Ríos-García, Martiño, Emoekabu, Benedict, Krishnan, Aswanth, Gupta, Tanya, Schilling-Wilhelmi, Mara, Okereke, Macjonathan, Aneesh, Anagha, Asgari, Mehrdad, Eberhardt, Juliane, Elahi, Amir Mohammad, Elbeheiry, Hani M., Gil, María Victoria, Glaubitz, Christina, Greiner, Maximilian, Holick, Caroline T., Hoffmann, Tim, Ibrahim, Abdelrahman, Klepsch, Lea C., Köster, Yannik, Kreth, Fabian Alexander, Meyer, Jakob, Miret, Santiago, Peschel, Jan Matthias, Ringleb, Michael, Roesner, Nicole C., Schreiber, Johanna, Schubert, Ulrich S., Stafast, Leanne M., Wonanke, A. D. Dinga, Pieler, Michael, Schwaller, Philippe, Jablonka, Kevin Maik. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists, Nature Chemistry, 2025, DOI: 10.1038/s41557-025-01815-x