Naturally-meaningful and efficient descriptors: machine learning of material properties based on robust one-shot ab initio descriptors (pdf)

Article PDF cannot be displayed. You can download it here:

https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-022-00658-9

Naturally-meaningful and efficient descriptors: machine learning of material properties based on robust one-shot ab initio descriptors

Tawfik and Russo Journal of Cheminformatics https://doi.org/10.1186/s13321-022-00658-9 Journal of Cheminformatics (2022) 14:78 Open Access RESEARCH Naturally‑meaningful and efficient descriptors: machine learning of material properties based on robust one‑shot ab initio descriptors Sherif Abdulkader Tawfik1,2* and Salvy P. Russo1,3* Abstract Establishing a data-driven pipeline for the discovery of novel materials requires the engineering of material features that can be feasibly calculated and can be applied to predict a material’s target properties. Here we propose a new class of descriptors for describing crystal structures, which we term Robust One-Shot Ab initio (ROSA) descriptors. ROSA is computationally cheap and is shown to accurately predict a range of material properties. These simple and intuitive class of descriptors are generated from the energetics of a material at a low level of theory using an incomplete ab initio calculation. We demonstrate how the incorporation of ROSA descriptors in ML-based property prediction leads to accurate predictions over a wide range of crystals, amorphized crystals, metal–organic frameworks and molecules. We believe that the low computational cost and ease of use of these descriptors will significantly improve ML-based predictions. Introduction A major objective in material science is to generate machine learning (ML) models that can accurately, and rapidly, predict a property for a given material by using information derived from the material’s structure only [1, 2]. Predicting material properties such as the energy bandgap would then only take a few seconds or a fraction of a second using an ML model, instead of consuming several hours, or even days on a supercomputer to perform a first principles calculation, such as density functional theory (DFT). With the availability of massive materials datasets such as MaterialsProject [3] and AFLOW [4] which host more than 3.5 million materials, it is becoming increasingly possible to screen materials for their properties [2]. To achieve such an objective, one must find features that can map a material structure against the highly nonlinear material properties. The vector of feature descriptors (the individual quantities that constitute the feature)1 must be unique to each material and feasible to calculate. An ML model can subsequently be trained to translate descriptors into properties i.e. perform the mapping of structure against property. No matter how sophisticated or “deep” the ML models are, they will fail as long as the descriptors are poorly chosen. 1 The terms “features” and “descriptors” are used interchangeable in the literature. Here we refer to a “feature” as a group of “descriptors”. Note that other terms, including “attributes” and “fingerprints”, are also frequently used in the literature, and they have the same meaning as “descriptors”. *Correspondence: ; 1 ARC Centre of Excellence in Exciton Science, School of Science, RMIT University, Melbourne, VIC 3001, Australia Full list of author information is available at the end of the article © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Tawfik and Russo Journal of Cheminformatics (2022) 14:78 The quality of descriptors is usually appraised by the ability of the descriptors to train predictive ML models. However, we emphasize the importance of three other key elements for judging the quality of descriptors, that are as important as their predictive power: 1. Meaningfulness of features: the term “meaningful features” appears frequently in the broad ML literature, such in in Ref. [5]. In the context of material science, the term loosely means that the features are related to a physical and/or chemical principle. An example is Ref. [6]. 2. Calculation efficiency: the cost of computing the feature should be much less than that of calculating the target property. 3. Number of descriptors within a feature: the expression of a material structure into a relatively small number of features (i.e. a few hundreds) can ensure the simplicity of the ML model. Features that require the calculation of thousands of features entail costly storage requirements for datasets, higher processing requirements for the utilization of the trained ML models, and non-transparent, or “black-box” ML models [5]. We call these four criteria of ML features for materials the MENA criteria (Meaningful, Efficient, small Number of descriptors, Accurate). A number of ML features have recently been proposed in the literature for predicting the various DFT-calculated properties for materials, but they differ in how they satisfy the MENA criteria. We classify these features into the following four classes: 1. Elemental features: this is the simplest type of features, and the quickest to calculate. The descriptor values within these features are directly related to a property of the elements within the crystal structure or molecule, and therefore are physically and chemically meaningful. For example, for a crystal structure, a possible elemental feature is the mean atomic number and mean elemental melting point of the atoms within the crystal unit cell. However, these features are nonunique; two materials with equal composition, but different structural phases, will have the same elemental features. They are thus not accurate. It was also reported that that, in some cases, the most significant features for predicting a property seem to be counter-intuitive [6]. Using those features alone might work in limited cases, such as when using a small dataset (such as the ~ 300 materials dataset in Ref. [7]), but cannot be generalized for the broader set of materials. The reason these features work is related to the distribution of polymorphs in present- Page 2 of 11 day materials databases: in MaterialsProject, for example, the average number of polymorphs for each materials is ~ 1.4. If there were more polymorphs, elemental features (...truncated)