QUICK REVIEW

[论文解读] GraphFramEx: Towards Systematic Evaluation of Explainability Methods for Graph Neural Networks

Kenza Amara, Rex Ying|arXiv (Cornell University)|Jun 20, 2022

Explainable Artificial Intelligence (XAI)被引用 25

一句话总结

GraphFramEx 提出一个系统化框架，用于在用户需求下评估 GNN 解释性，引入基于保真度的刻画与用于真实数据集和合成数据集评估的 top-k 掩模协议。

ABSTRACT

As one of the most popular machine learning models today, graph neural networks (GNNs) have attracted intense interest recently, and so does their explainability. Users are increasingly interested in a better understanding of GNN models and their outcomes. Unfortunately, today's evaluation frameworks for GNN explainability often rely on few inadequate synthetic datasets, leading to conclusions of limited scope due to a lack of complexity in the problem instances. As GNN models are deployed to more mission-critical applications, we are in dire need for a common evaluation protocol of explainability methods of GNNs. In this paper, we propose, to our best knowledge, the first systematic evaluation framework for GNN explainability, considering explainability on three different "user needs". We propose a unique metric that combines the fidelity measures and classifies explanations based on their quality of being sufficient or necessary. We scope ourselves to node classification tasks and compare the most representative techniques in the field of input-level explainability for GNNs. For the inadequate but widely used synthetic benchmarks, surprisingly shallow techniques such as personalized PageRank have the best performance for a minimum computation time. But when the graph structure is more complex and nodes have meaningful features, gradient-based methods are the best according to our evaluation criteria. However, none dominates the others on all evaluation dimensions and there is always a trade-off. We further apply our evaluation protocol in a case study for frauds explanation on eBay transaction graphs to reflect the production environment.

研究动机与目标

出于基准多样且受限的原因，激励建立一个用于 GNN 解释性的通用评估协议。
定义多个人用户导向的目标（现象解释 vs. 模型解释）以及可解释性掩码类型（硬掩码 vs. 软掩码）。
引入一种无需真实标签的评估，通过保真度量来区分必要解释和充分解释。
提出刻画度量，用于联合评估必要解释和充分解释。
将该框架应用于真实与合成数据集，并通过一个 eBay 欺诈案例研究展示在实际生产中的相关性。

提出的方法

模型无关和模型感知的事后解释，在节点分类任务上进行评估。
定义通过逐元素掩蔽生成可解释子图的边特征掩码和节点特征掩码（A_S = M_E ⊙ A, X_S = M_NF ⊙ X）。
基于两个保真度量（Fid+ 和 Fid−）将解释划分为必要、充分或两者兼具。
将保真度合并为一个单一的刻画度量（charact），作为 Fid+ 与 (1 − Fid−) 的加权调和平均。
采用 top-k 掩蔽策略，将解释大小固定为 k 条边，以便在方法之间进行公平比较。
提出一种决策树风格的指导（GraphFramEx），根据用户目标和模型准确度选择解释方法。

实验结果

研究问题

RQ1在统一、无真实标签的评估框架下，现有的 GNN 解释方法如何比较？
RQ2解释关注点（现象 vs. 模型）和掩码类型（硬 vs. 软）对评估结果有什么影响？
RQ3一个单一的刻画度量能否在不同数据集上有意义地平衡必要解释和充分解释？
RQ4在类似真实生产环境的设置（如 eBay 欺诈图）中，解释方法的表现如何，与合成基准相比？
RQ5各方法之间在解释质量与计算时间上的权衡是什么？

主要发现

没有任何单一的解释方法在所有评估维度上都占据优势；存在权衡。
像 PageRank 这样的浅层方法在不足型-1 的合成基准上以最小计算时间表现最佳，而基于梯度的方法在具有更复杂图和有意义特征的场景中表现出色。
Saliency 在真实数据集上提供最强的整体刻画，特别是对必要解释，而 Occlusion、Grad-CAM 与 PageRank 更适合提供充分解释。
大多数方法能提供良好的充分解释，但很少能提供强烈的必要解释；Saliency、Distance 和 Occlusion 在这方面表现突出。
GNNExplainer 在生产环境的 eBay 图中对欺诈节点的解释表现出色，基于扰动的方法在该场景中常常优于其他方法。
在方法排序方面，合成基准与真实数据之间存在不一致性，凸显 type-1 合成数据集的局限性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。