Banner Banner

A Machine Learning and Explainable AI Framework Tailored for Unbalanced Experimental Catalyst Discovery

Parastoo Semnani
Mihail Bogojeski
Florian Bley
Zizheng Zhang
Qiong Wu
Thomas Kneib
Jan Herrmann
Christoph Weisser
Florina Patcas
Klaus-Robert Müller

July 10, 2024

The successful application of machine learning in catalyst design depends on high quality and diverse data to ensure effective generalization to novel compositions, thereby aiding in catalyst discovery. However, due to the complex interactions of catalyst components, the design of novel catalysts has long relied on trial-and-error, a costly and labor-intensive process that results in scarce data that is heavily biased towards undesired, low-yield catalysts. Despite the increasing popularity of machine learning applications in this field, most of the efforts so far have not focused on dealing with the challenges presented by such experimental data. To address these challenges, we in troduce a robust machine learning and explainable AI framework to accurately classify the catalytic yield of various compositions and identify the contributions of individual components to the yield. This framework combines a series of ML practices designed to handle the scarcity and imbalance of catalyst data. We apply the framework to the task of determining the yield of different catalyst combinations in oxidative methane coupling, and use it to evaluate the performance of a range of ML models: tree-based models (such as decision trees, random forest, and gradient boosted trees), logistic regression, support vector machines, and neural networks. These experiments demonstrate that the methods used in our framework lead to a significant improvement in the performance of all but one of the evaluated models. Additionally, the decision-making process of each ML model is analyzed by identifying the most important features for predicting catalyst performance using explainable AI (XAI) methods. Our analysis found that XAI methods, which provide class-aware explanations, such as Layer-wise Relevance Propagation, managed to identify key components that contribute specifically to high-yield catalysts. These findings align with chemical intuition and existing literature, reinforcing their validity. We believe that such insights can assist chemists in the development and identification of novel catalysts with superior performance. Illustration of the abstract is depicted in Figure 1.