Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions

Anna Hedström

Philine Lou Bommer

Thomas F Burns

Sebastian Lapuschkin

Wojciech Samek

Marina MC Höhne

February 10, 2025

Abstract: Interpretability researchers face a universal question: without access to ground truth labels, how can the faithfulness of an explanation to its model be determined? Despite immense efforts to develop new evaluation methods, current approaches remain in a pre-paradigmatic state: fragmented, difficult to calibrate, and lacking cohesive theoretical grounding. Observ- ing the lack of a unifying theory, we propose a novel evaluative criterion entitled Generalised Explanation Faithfulness (GEF) which is centered on explanation-to-model alignment, and integrates existing perturbation-based evaluations to eliminate the need for singular, task-specific evaluations. Complementing this unifying perspective, from a geometric point of view, we reveal a prevalent yet critical oversight in current evaluation practice: the failure to account for the learned geometry, and non-linear mapping present in the model, and explanation spaces. To solve this, we propose a general-purpose, threshold-free faithfulness evaluator GEF that incorporates principles from differential geometry, and facilitates evaluation agnostically across tasks, and interpretability approaches. Through extensive cross-domain benchmarks on natural language processing, vision, and tabular tasks, we provide first-of-its-kind insights into the comparative performance of various interpretable methods. This includes local linear approximators, global feature visualisation methods, large language models as post-hoc explainers, and sparse autoencoders. Our contributions are important to the interpretability and AI safety communities, offering a principled, unified approach for evaluation.

https://openreview.net/forum?id=ukLxqA8zXj

BIFOLD AUTHORS

Prof. Dr. Wojciech Samek

Prof. Dr. Marina Marie-Claire Höhne (Née Vidovic)