1.5 million materials narratives generated by chatbots
www.nature.com/scientificdata
OPEN
Data Descriptor
1.5 million materials narratives
generated by chatbots
Yang Jeong Park
1,2,3
, Sung Eun Jerng4, Sungroh Yoon3,5 ✉ & Ju Li
1,2,6 ✉
The advent of artificial intelligence (AI) has enabled a comprehensive exploration of materials for
various applications. However, AI models often prioritize frequently encountered material examples
in the scientific literature, limiting the selection of suitable candidates based on inherent physical and
chemical attributes. To address this imbalance, we generated a dataset consisting of 1,453,493 natural
language-material narratives from OQMD, Materials Project, JARVIS, and AFLOW2 databases based
on ab initio calculation results that are more evenly distributed across the periodic table. The generated
text narratives were then scored by both human experts and GPT-4, based on three rubrics: technical
accuracy, language and structure, and relevance and depth of content, showing similar scores but with
human-scored depth of content being the most lagging. The integration of multimodal data sources and
large language models holds immense potential for AI frameworks to aid the exploration and discovery
of solid-state materials for specific applications of interest.
Background & Summary
Materials are of such significance in human history that the designations assigned to each era of civilization are
predicated upon the prevalent materials of the time. With the emergence of the climate crisis, the 21st century
has presented humanity with a multitude of challenges, prompting the exploration of novel materials for diverse
new applications (solar cells1,2, batteries3–5, catalysts6–8, etc.) in as short time as possible in order to wean the entire
economy off burning fossil fuels. The expeditious discovery of materials possessing desirable attributes for specific
applications garners considerable attention; however, it is impeded by the lack of digestible information (to a
mechanical or electrical engineer, for example) about materials. For example, when asked about a specific material “Li4Mn5Ni(PO4)6”, even a materials expert would usually turn to Google search, and the outcome would likely
be quite dense and varied literature with no guarantee of finding what one wants, that can take hours or days to
parse through, which is just too slow, especially if all one needs is an initial screening. Oftentimes, it is hard to
present aggregated information, as properties are spread over multiple experimental and ab initio databases.
The desired attributes (figure-of-merit) required to realize a given specific device may be known, while the
specific materials embodying superior figure-of-merit are generally unknown and more difficult to identify.
Throughout history, materials with technological functionalities have frequently been discovered through a
combination of intuition, trial and error, and fortuitous circumstances. Today, the prevailing paradigm has
transitioned towards a more comprehensive exploration of the vast space of potential materials. This endeavor
is facilitated by the applications of first-principles calculations and artificial intelligence (AI). Notably, the
advent of generative AI models has spurred a surge of research into the realm of inverse material design9–11.
Through the utilization of generative AI techniques, researchers have been able to accelerate the process of
materials discovery and design, offering promising opportunities for breakthroughs in the figure-of-merit for
specific applications. Some of the authors have also examined the utilization of automated systems capable of
generating scientific hypotheses in their recent work12. These systems based on large language model (LLM),
including chatbots such as ChatGPT13, possess an inherent probabilistic nature that enables them to generate
intriguing hypotheses, thereby expediting scientific advancements akin to human researchers. However, the
examples presented in the Supplementary Information section 1 also demonstrate certain challenges with the
1
Massachusetts Institute of Technology, Department of Nuclear Science and Engineering, Cambridge, 02139, USA.
Massachusetts Institute of Technology, Department of Materials Science and Engineering, Cambridge, 02139, USA.
3
Seoul National University, Department of Electrical and Computer Engineering, Seoul, 08826, Republic of Korea.
4
The University of Suwon, Department of Environmental and Energy Engineering, Hwaseong-si, 18323, Republic
of Korea. 5Seoul National University, Interdisciplinary Program in Artificial Intelligence, Seoul, 08826, Republic of
Korea. 6Massachusetts Institute of Technology, MIT-IBM Watson AI Lab, Cambridge, 02142, USA. ✉e-mail: sryoon@
snu.ac.kr;
2
Scientific Data |
(2024) 11:1060 | https://doi.org/10.1038/s41597-024-03886-w
1
www.nature.com/scientificdata
www.nature.com/scientificdata/
“common-core” LLMs such as the standard ChatGPT, including bias toward “hot materials” and “hot topics”,
whereas true ground-breaking innovations may spring from “cold topics” or less well-known materials12. The
“common-core” LLMs, owing to their learning process based on the probabilistic distribution of tokens, tend
to prioritize the presentation of materials frequently encountered on the web and in scientific literature and
publications14–18, rather than “comprehending” the inherent properties and structures of materials and selecting
suitable candidates more rationally. This is because the “common-core” text corpora found on the web are highly
tilted toward materials already studied by human researchers, which can be rather limited, as researchers tend to
flock toward “hot materials”. This may limit the inventiveness of the narratives and inferences generated directly
with “common-core” ChatGPT12. The present work aims to generate more balanced plain-language materials
narratives that can be supplemented to the common corpus and used to further train more specialized LLMs so
their inferences will be less biased toward “hot” but narrow-based materials.
In recent years substantial progress has been made in the realm of multimodal learning across diverse
domains. The amalgamation and integration of information from various modalities, encompassing text,
images, audio, and video, have facilitated breakthroughs in comprehending intricate data. This interdisciplinary
approach has yielded remarkable applications in computer vision, natural language processing (NLP), and audio
analysis, thus empowering the development of more comprehensive and resilient learning systems. However,
the field of materials research has yet to embrace the endeavor of multimodal learning. To surmount these challenges, our research team has generated and shared data of 1,453,493 natural language-material pairs utilizing
publicly available material databases and chatbots. This is a fairly large number considering that the number of
training images in ImageNet is 1,281,167.
The fusion and (...truncated)