Material representations play a crucial role in machine learning applications for materials science. The way materials are represented as input data to artificial intelligence (AI) methods has a significant impact on the accuracy and efficiency of the machine learning algorithms. The choice of material representation depends on the type of problem being addressed and the learning algorithm being used.
In popular databases, materials are typically represented with their chemical formula, their structural space group or with CIF files. CIF, or Crystallographic Information File, is a standardized format for storing information about crystal structures (Hall et al., 1991). This file format contains information about the atomic positions and lattice parameters of a crystal structure, as well as information about symmetry operations, and more. This type of representation is commonly used in materials science and provides a detailed description of the 3D structure of a material. However, as chemical formulas contain limited information about materials and CIF files are a very inefficient material representation for AI methods to learn from them, a wide array of material representations have emerged. To keep things simple, in this post we consider three broad groups of material representations: composition-based, structure-based, and property-based.
Composition-based representations encapsulate the essential details of a material composition, including the elements that make it up and the proportion in which they are present. Representing materials through their composition can be done through two popular data types: strings of characters and 1D vectors. For example, the mineral quartz (SiO2) could be represented as the string “SiO2” which represents its chemical formula. Another straightforward representation of quartz might be a vector with 118 numbers, most of which are zeros except for position 14 (silicon’s atomic number) which holds a value of 1 (representing the proportion of silicon in quartz) and position 8 (oxygen’s atomic number) which holds a value of 2 (representing the proportion of oxygen in quartz). A widely used 1D vector representation in inorganic materials science is the Magpie attribute set, which encodes information such as stoichiometry, elemental and ionic properties, and electronic structure into a sequence of 145 numbers (Ward et al., 2016).
Structure-based material representations take into account the arrangement of atoms in a material, providing information about its internal organization. These representations can provide a deeper understanding of the material’s properties and behavior, as they encode information about bonding and interatomic interactions. These representations can take various forms, such as 1D vectors, 2D matrices, 3D voxels and graphs, each of which can provide unique insights into the material’s structure. In addition to composition-based 1D vectors, there are also 1D material representations that encode structural information; popular examples are local environment fingerprints such as Atom Centered Symmetry Function (ACSF) or Smooth Overlap of Atomic Positions (SOAP), which are descriptors that encode information about the local atomic relationships in the material structure (Behler, 2011; Bartók et al., 2013). The 2D matrix representation of a material contains, like the CIF, information about the arrangement of atoms in a material, but in a simplified, mathematical form (as in figure below, to the left). 3D voxels are a way of representing a material’s 3-dimensional structure as a discrete, regularly spaced 3-dimensional grid of cube-shaped voxels. In this representation, the value of each voxel can represent the presence (or absence) of an atom at that point in space: in the figure below, to the center, a yellow-green color indicates the presence of a given chemical element at that position (electronic density) while purple color shows the absence of atoms in a small region of the voxel. Finally, graphs are expressive data structures that can capture the material’s atomic structure. In this representation, each atom is represented as a node in the graph, and chemical bonds between neighbor atoms are represented as edges between nodes (figure below, to the right). This representation can capture the arrangement of atoms or molecules in space, as well as the chemical bonds between them.
2D matrix representation (Ren et al., 2021). | Voxel representation (Long et al,, 2020). Left: 3D voxel for each element. Center: chemical elements in 3D structure. Right: 3D structure. | Graph representation (Xie & Grossman, 2018). Left: 3D structure. Center, from top to bottom: local environment of Na in structure. local environment of Cl in structure, vector describing Na-Cl bond. Right: graph. |
The third broad category of material representations are property-based. These representations, as the name suggests, contain information about the properties of the material, such as electrical conductivity, optical properties, mechanical properties, and more. The extraction of the right material properties can ease and enhance the learning process of AI models. This information can be encoded in a variety of formats, such as 1D or 2D vectors, numerical values, and more, depending on the specific use case and the AI method being used.
In conclusion, the choice of material representation is a crucial aspect of using AI methods for materials science, as it can have a significant impact on the performance of the models and their ability to learn and make predictions. It is important to carefully consider the trade-offs between accuracy and computational efficiency when selecting a representation, and to choose a representation that is appropriate for the specific application under study and for the AI method to effectively learn from the material.