A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists
Nature Chemistry ,
May 2025
Mirza, Adrian , Alampara, Nawaf , Kunchapu, Sreekanth , Ríos-García, Martiño , Emoekabu, Benedict , Krishnan, Aswanth , Gupta, Tanya , et al.
Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains. The alternative text for this image may have been generated using AI.
A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists
nature chemistry
Article
https://doi.org/10.1038/s41557-025-01815-x
A framework for evaluating the chemical
knowledge and reasoning abilities of large
language models against the expertise
of chemists
Received: 1 April 2024
A list of authors and their affiliations appears at the end of the paper
Accepted: 26 March 2025
Published online: 20 May 2025
Check for updates
Large language models (LLMs) have gained widespread interest owing
to their ability to process human language and perform tasks on which
they have not been explicitly trained. However, we possess only a limited
systematic understanding of the chemical capabilities of LLMs, which
would be required to improve models and mitigate potential harm. Here
we introduce ChemBench, an automated framework for evaluating the
chemical knowledge and reasoning abilities of state-of-the-art LLMs against
the expertise of chemists. We curated more than 2,700 question–answer
pairs, evaluated leading open- and closed-source LLMs and found that the
best models, on average, outperformed the best human chemists in our
study. However, the models struggle with some basic tasks and provide
overconfident predictions. These findings reveal LLMs’ impressive chemical
capabilities while emphasizing the need for further research to improve
their safety and usefulness. They also suggest adapting chemistry education
and show the value of benchmarking frameworks for evaluating LLMs in
specific domains.
Large language models (LLMs) are machine learning (ML) models
trained on massive amounts of text to complete sentences. Aggressive
scaling of these models has led to a rapid increase in their capabilities1,2,
with the leading models now being able to pass the US Medical Licensing Examination3 or other professional licensing exams. They also have
been shown to design and autonomously perform chemical reactions
when augmented with external tools such as web search and synthesis
planners4–7. While some see ‘sparks of artificial general intelligence
(AGI)’ in them8, others see them as ‘stochastic parrots’—that is, systems
that only regurgitate what they have been trained on9 and that show
inherent limitations owing to the way they are trained10. Nevertheless,
the promise of these models is that they have shown the ability to solve
a wide variety of tasks they have not been explicitly trained on11–13.
Chemists and materials scientists have quickly caught on to the
mounting attention given to LLMs, with some voices even suggesting
that ‘the future of chemistry is language’14. This statement is motivated by a growing number of reports that use LLMs to predict properties of molecules or materials2,15–19, optimize reactions20,21, generate
materials22–25, extract information26–33 or even to prototype systems
that can autonomously perform experiments in the physical world
based on commands provided in natural language5–7.
In addition, since a lot—if not most—of the information about
chemistry is currently stored and communicated in text, there is a
strong reason to believe that there is still a lot of untapped potential in
LLMs for chemistry and materials science34. For instance, most insights
in chemical research do not directly originate from data stored in
databases but rather from the scientists interpreting the data. Many
of these insights are in the form of text in scientific publications. Thus,
operating on such texts might be our best way of unlocking these
insights and learning from them. This might ultimately lead to general
e-mail:
Nature Chemistry | Volume 17 | July 2025 | 1027–1034
1027
Article
https://doi.org/10.1038/s41557-025-01815-x
Data preparation
(>2,800 total questions)
Knowledge Reasoning Intuition
19 respondents
251 diverse questions
Semantic annotation
curation
chembench.org
Corpus in BIG-bench format
Question: What is the number
1
of signals in the H NMR HO
spectrum of the molecule
on the right?
O
Automatically updated
OH
0.61
Answer:
0.57
0.51
Models
Peer-reviewed
Leaderboard
Humans
Closed-source models
Open-weight models
Diverse settings
...
Question: What is the number
of signals in the 1H NMR
spectrum of a molecule
with the SMILES [START_SMILES]
OCC1C2CC1(O)C2=O[END_SMILES]?
...
Topic leaders
Overall leaders
Answer:
Fig. 1 | Overview of the ChemBench framework. The different components
of the ChemBench framework. The framework’s foundation is the benchmark
corpus comprising thousands of questions and answers that we manually or
semi-automatically compiled from various sources in a format based in the
one introduced in the BIG-bench benchmark (Extended Data Fig. 1). Questions
are classified on the basis of topics, required skills (reasoning, calculation,
knowledge and intuition) and difficulty levels. We then used this corpus to
evaluate the performance of various models and tool-augmented systems using a
custom framework. To provide a baseline, we built a web application that we used
to survey experts in chemistry. The results of the evaluations are then compiled
in publicly accessible leaderboards (Supplementary Note 15), which we propose
as a foundation for evaluating future models.
copilot systems for chemists that can provide answers to questions or
even suggest new experiments on the basis of vastly more information
than a human could ever read.
However, the rapid increase in capabilities of chemical ML models
led (even before the recent interest in LLMs) to concerns about the
potential for the dual use of these technologies, for example, for the
design of chemical weapons35–40. To some extent, this is not surprising
as any technology that, for instance, is used to design non-toxic molecules can also be used inversely to predict toxic ones (even though the
synthesis would still require access to controlled physical resources
and facilities). Still, it is essential to realize that the user base of LLMs
is broader than that of chemistry and materials science experts who
can critically reflect on every output these models produce. For example, many students frequently consult these tools—perhaps even
to prepare chemical experiments41. This also applies to users from
the general public, who might consider using LLMs to answer questions about the safety of chemicals. Thus, for some users, misleading
information—especially about safety-related aspects—might lead to
harmful outcomes. However, even for experts, chemical knowledge
and reasoning capabilities are essential as they will determine the
capabilities and limitations of their models in their work, for example,
in copilot systems for chemists. Unfortunately, apart from exploratory
reports, such as by prompting leading models with various scientific
questions13, there is little systematic evidence on how LLMs perform
compared with expert (human) chemists.
Thus, to better understand what LLMs can do for the chemical
sciences and where they might be improved with further developments, evaluation frameworks are needed to allow us to measur (...truncated)
This is a preview of a remote PDF: https://www.nature.com/articles/s41557-025-01815-x.pdf
Article home page: https://www.nature.com/articles/s41557-025-01815-x
Mirza, Adrian, Alampara, Nawaf, Kunchapu, Sreekanth, Ríos-García, Martiño, Emoekabu, Benedict, Krishnan, Aswanth, Gupta, Tanya, Schilling-Wilhelmi, Mara, Okereke, Macjonathan, Aneesh, Anagha, Asgari, Mehrdad, Eberhardt, Juliane, Elahi, Amir Mohammad, Elbeheiry, Hani M., Gil, María Victoria, Glaubitz, Christina, Greiner, Maximilian, Holick, Caroline T., Hoffmann, Tim, Ibrahim, Abdelrahman, Klepsch, Lea C., Köster, Yannik, Kreth, Fabian Alexander, Meyer, Jakob, Miret, Santiago, Peschel, Jan Matthias, Ringleb, Michael, Roesner, Nicole C., Schreiber, Johanna, Schubert, Ulrich S., Stafast, Leanne M., Wonanke, A. D. Dinga, Pieler, Michael, Schwaller, Philippe, Jablonka, Kevin Maik.
A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists ,
Nature Chemistry,
2025, DOI: 10.1038/s41557-025-01815-x