1.5 million materials narratives generated by chatbots (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41597-024-03886-w.pdf

1.5 million materials narratives generated by chatbots

www.nature.com/scientificdata OPEN Data Descriptor 1.5 million materials narratives generated by chatbots Yang Jeong Park 1,2,3 , Sung Eun Jerng4, Sungroh Yoon3,5 ✉ & Ju Li 1,2,6 ✉ The advent of artificial intelligence (AI) has enabled a comprehensive exploration of materials for various applications. However, AI models often prioritize frequently encountered material examples in the scientific literature, limiting the selection of suitable candidates based on inherent physical and chemical attributes. To address this imbalance, we generated a dataset consisting of 1,453,493 natural language-material narratives from OQMD, Materials Project, JARVIS, and AFLOW2 databases based on ab initio calculation results that are more evenly distributed across the periodic table. The generated text narratives were then scored by both human experts and GPT-4, based on three rubrics: technical accuracy, language and structure, and relevance and depth of content, showing similar scores but with human-scored depth of content being the most lagging. The integration of multimodal data sources and large language models holds immense potential for AI frameworks to aid the exploration and discovery of solid-state materials for specific applications of interest. Background & Summary Materials are of such significance in human history that the designations assigned to each era of civilization are predicated upon the prevalent materials of the time. With the emergence of the climate crisis, the 21st century has presented humanity with a multitude of challenges, prompting the exploration of novel materials for diverse new applications (solar cells1,2, batteries3–5, catalysts6–8, etc.) in as short time as possible in order to wean the entire economy off burning fossil fuels. The expeditious discovery of materials possessing desirable attributes for specific applications garners considerable attention; however, it is impeded by the lack of digestible information (to a mechanical or electrical engineer, for example) about materials. For example, when asked about a specific material “Li4Mn5Ni(PO4)6”, even a materials expert would usually turn to Google search, and the outcome would likely be quite dense and varied literature with no guarantee of finding what one wants, that can take hours or days to parse through, which is just too slow, especially if all one needs is an initial screening. Oftentimes, it is hard to present aggregated information, as properties are spread over multiple experimental and ab initio databases. The desired attributes (figure-of-merit) required to realize a given specific device may be known, while the specific materials embodying superior figure-of-merit are generally unknown and more difficult to identify. Throughout history, materials with technological functionalities have frequently been discovered through a combination of intuition, trial and error, and fortuitous circumstances. Today, the prevailing paradigm has transitioned towards a more comprehensive exploration of the vast space of potential materials. This endeavor is facilitated by the applications of first-principles calculations and artificial intelligence (AI). Notably, the advent of generative AI models has spurred a surge of research into the realm of inverse material design9–11. Through the utilization of generative AI techniques, researchers have been able to accelerate the process of materials discovery and design, offering promising opportunities for breakthroughs in the figure-of-merit for specific applications. Some of the authors have also examined the utilization of automated systems capable of generating scientific hypotheses in their recent work12. These systems based on large language model (LLM), including chatbots such as ChatGPT13, possess an inherent probabilistic nature that enables them to generate intriguing hypotheses, thereby expediting scientific advancements akin to human researchers. However, the examples presented in the Supplementary Information section 1 also demonstrate certain challenges with the 1 Massachusetts Institute of Technology, Department of Nuclear Science and Engineering, Cambridge, 02139, USA. Massachusetts Institute of Technology, Department of Materials Science and Engineering, Cambridge, 02139, USA. 3 Seoul National University, Department of Electrical and Computer Engineering, Seoul, 08826, Republic of Korea. 4 The University of Suwon, Department of Environmental and Energy Engineering, Hwaseong-si, 18323, Republic of Korea. 5Seoul National University, Interdisciplinary Program in Artificial Intelligence, Seoul, 08826, Republic of Korea. 6Massachusetts Institute of Technology, MIT-IBM Watson AI Lab, Cambridge, 02142, USA. ✉e-mail: sryoon@ snu.ac.kr; 2 Scientific Data | (2024) 11:1060 | https://doi.org/10.1038/s41597-024-03886-w 1 www.nature.com/scientificdata www.nature.com/scientificdata/ “common-core” LLMs such as the standard ChatGPT, including bias toward “hot materials” and “hot topics”, whereas true ground-breaking innovations may spring from “cold topics” or less well-known materials12. The “common-core” LLMs, owing to their learning process based on the probabilistic distribution of tokens, tend to prioritize the presentation of materials frequently encountered on the web and in scientific literature and publications14–18, rather than “comprehending” the inherent properties and structures of materials and selecting suitable candidates more rationally. This is because the “common-core” text corpora found on the web are highly tilted toward materials already studied by human researchers, which can be rather limited, as researchers tend to flock toward “hot materials”. This may limit the inventiveness of the narratives and inferences generated directly with “common-core” ChatGPT12. The present work aims to generate more balanced plain-language materials narratives that can be supplemented to the common corpus and used to further train more specialized LLMs so their inferences will be less biased toward “hot” but narrow-based materials. In recent years substantial progress has been made in the realm of multimodal learning across diverse domains. The amalgamation and integration of information from various modalities, encompassing text, images, audio, and video, have facilitated breakthroughs in comprehending intricate data. This interdisciplinary approach has yielded remarkable applications in computer vision, natural language processing (NLP), and audio analysis, thus empowering the development of more comprehensive and resilient learning systems. However, the field of materials research has yet to embrace the endeavor of multimodal learning. To surmount these challenges, our research team has generated and shared data of 1,453,493 natural language-material pairs utilizing publicly available material databases and chatbots. This is a fairly large number considering that the number of training images in ImageNet is 1,281,167. The fusion and (...truncated)