Naturally-meaningful and efficient descriptors: machine learning of material properties based on robust one-shot ab initio descriptors
Tawfik and Russo Journal of Cheminformatics
https://doi.org/10.1186/s13321-022-00658-9
Journal of Cheminformatics
(2022) 14:78
Open Access
RESEARCH
Naturally‑meaningful and efficient
descriptors: machine learning of material
properties based on robust one‑shot ab initio
descriptors
Sherif Abdulkader Tawfik1,2* and Salvy P. Russo1,3*
Abstract
Establishing a data-driven pipeline for the discovery of novel materials requires the engineering of material features
that can be feasibly calculated and can be applied to predict a material’s target properties. Here we propose a new
class of descriptors for describing crystal structures, which we term Robust One-Shot Ab initio (ROSA) descriptors.
ROSA is computationally cheap and is shown to accurately predict a range of material properties. These simple and
intuitive class of descriptors are generated from the energetics of a material at a low level of theory using an incomplete ab initio calculation. We demonstrate how the incorporation of ROSA descriptors in ML-based property prediction leads to accurate predictions over a wide range of crystals, amorphized crystals, metal–organic frameworks and
molecules. We believe that the low computational cost and ease of use of these descriptors will significantly improve
ML-based predictions.
Introduction
A major objective in material science is to generate
machine learning (ML) models that can accurately, and
rapidly, predict a property for a given material by using
information derived from the material’s structure only
[1, 2]. Predicting material properties such as the energy
bandgap would then only take a few seconds or a fraction of a second using an ML model, instead of consuming several hours, or even days on a supercomputer to
perform a first principles calculation, such as density
functional theory (DFT). With the availability of massive materials datasets such as MaterialsProject [3] and
AFLOW [4] which host more than 3.5 million materials,
it is becoming increasingly possible to screen materials
for their properties [2]. To achieve such an objective,
one must find features that can map a material structure
against the highly nonlinear material properties. The vector of feature descriptors (the individual quantities that
constitute the feature)1 must be unique to each material
and feasible to calculate. An ML model can subsequently
be trained to translate descriptors into properties i.e. perform the mapping of structure against property. No matter how sophisticated or “deep” the ML models are, they
will fail as long as the descriptors are poorly chosen.
1
The terms “features” and “descriptors” are used interchangeable in the literature. Here we refer to a “feature” as a group of “descriptors”. Note that other
terms, including “attributes” and “fingerprints”, are also frequently used in the
literature, and they have the same meaning as “descriptors”.
*Correspondence: ;
1
ARC Centre of Excellence in Exciton Science, School of Science, RMIT
University, Melbourne, VIC 3001, Australia
Full list of author information is available at the end of the article
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco
mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Tawfik and Russo Journal of Cheminformatics
(2022) 14:78
The quality of descriptors is usually appraised by the
ability of the descriptors to train predictive ML models.
However, we emphasize the importance of three other
key elements for judging the quality of descriptors, that
are as important as their predictive power:
1. Meaningfulness of features: the term “meaningful
features” appears frequently in the broad ML literature, such in in Ref. [5]. In the context of material
science, the term loosely means that the features are
related to a physical and/or chemical principle. An
example is Ref. [6].
2. Calculation efficiency: the cost of computing the feature should be much less than that of calculating the
target property.
3. Number of descriptors within a feature: the expression of a material structure into a relatively small
number of features (i.e. a few hundreds) can ensure
the simplicity of the ML model. Features that require
the calculation of thousands of features entail costly
storage requirements for datasets, higher processing requirements for the utilization of the trained
ML models, and non-transparent, or “black-box” ML
models [5].
We call these four criteria of ML features for materials
the MENA criteria (Meaningful, Efficient, small Number
of descriptors, Accurate). A number of ML features have
recently been proposed in the literature for predicting the
various DFT-calculated properties for materials, but they
differ in how they satisfy the MENA criteria. We classify
these features into the following four classes:
1. Elemental features: this is the simplest type of features, and the quickest to calculate. The descriptor
values within these features are directly related to a
property of the elements within the crystal structure
or molecule, and therefore are physically and chemically meaningful. For example, for a crystal structure,
a possible elemental feature is the mean atomic number and mean elemental melting point of the atoms
within the crystal unit cell. However, these features
are nonunique; two materials with equal composition, but different structural phases, will have the
same elemental features. They are thus not accurate.
It was also reported that that, in some cases, the most
significant features for predicting a property seem to
be counter-intuitive [6]. Using those features alone
might work in limited cases, such as when using a
small dataset (such as the ~ 300 materials dataset in
Ref. [7]), but cannot be generalized for the broader
set of materials. The reason these features work is
related to the distribution of polymorphs in present-
Page 2 of 11
day materials databases: in MaterialsProject, for
example, the average number of polymorphs for each
materials is ~ 1.4. If there were more polymorphs,
elemental features (...truncated)