[论文解读] Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods
本文区分特征添加性与特征选择性解释,并为特征选择视角使用一个非平凡的神经模型提出自动验证框架,展示流行解释器的失败模式并提供一个开放的评估测试。
For AI systems to garner widespread public acceptance, we must develop methods capable of explaining the decisions of black-box models such as neural networks. In this work, we identify two issues of current explanatory methods. First, we show that two prevalent perspectives on explanations --- feature-additivity and feature-selection --- lead to fundamentally different instance-wise explanations. In the literature, explainers from different perspectives are currently being directly compared, despite their distinct explanation goals. The second issue is that current post-hoc explainers are either validated under simplistic scenarios (on simple models such as linear regression, or on models trained on syntactic datasets), or, when applied to real-world neural networks, explainers are commonly validated under the assumption that the learned models behave reasonably. However, neural networks often rely on unreasonable correlations, even when producing correct decisions. We introduce a verification framework for explanatory methods under the feature-selection perspective. Our framework is based on a non-trivial neural network architecture trained on a real-world task, and for which we are able to provide guarantees on its inner workings. We validate the efficacy of our evaluation by showing the failure modes of current explainers. We aim for this framework to provide a publicly available, off-the-shelf evaluation when the feature-selection perspective on explanations is needed.
研究动机与目标
- 突出针对实例级解释的特征添加性与特征选择性解释视角之间的根本差异。
- 提出一个自动验证框架,用于在一个非平凡的神经模型上评估特征选择解释器,并对目标模型的行为提供保证。
- 展示在真实任务上流行解释器(LIME、SHAP、L2X)的失败模式,并提供开源评估测试。
提出的方法
- 使用在真实啤酒评审任务(BeerAdvocate)上训练的基于RCNN的模型,以创建可识别零贡献和明显相关令牌的数据集。
- 对数据进行剪枝,消除手势/握手并确保每个样本至少有一个明显相关的令牌,得到分区 S_x = SR_x ∪ SDK_x,N_x 作为零贡献令牌。
- 定义评估指标,惩罚将零贡献令牌排在明显相关令牌之前:%_first、%_misrnk、avg_misrnk。
- 在这三个方面(外观、香气、口感)下,基于这些指标评估三种解释器(LIME、SHAP - 特征添加;L2X - 特征选择)。
- 提供为何 LIME/SHAP 的表现优于 L2X 的分析,并讨论框架的局限性(并非普遍适用的真值测试)。
- 发布可供现成使用的评估测试,并讨论其对其他任务(如计算机视觉)的普适性。
实验结果
研究问题
- RQ1特征添加性与特征选择性解释在实例级解释和评估行为上有何不同。
- RQ2我们能否使用一个非平凡的神经网络并对目标模型行为提供保证,自动验证特征选择解释的可信度?
- RQ3在真实任务上对严格的特征选择框架进行评估时,流行解释器(LIME、SHAP、L2X)会有哪些失败模式?
主要发现
| 模型 | %_first | %_misrnk | avg_misrnk | |
|---|---|---|---|---|
| APPEARANCE | LIME | 4.24 | 24.39 | 7.02 (24.12) |
| APPEARANCE | SHAP | 4.74 | 16.81 | 1.16 (7.75) |
| APPEARANCE | L2X | 6.58 | 28.85 | 3.54 (12.66) |
| AROMA | LIME | 14.79 | 32.08 | 12.74 (33.54) |
| AROMA | SHAP | 4.24 | 13.53 | 0.83 (7.10) |
| AROMA | L2X | 12.95 | 31.61 | 4.41 (16.25) |
| PALATE | LIME | 2.92 | 13.93 | 3.48 (17.38) |
| PALATE | SHAP | 2.65 | 9.20 | 9.25 (9.70) |
| PALATE | L2X | 12.77 | 29.83 | 3.70 (13.05) |
- LIME 和 SHAP 经常将零贡献令牌排在明显相关令牌之上,指示在特征选择视角下的失败模式。
- L2X 常常将若干零贡献令牌高于明显相关令牌,特别是在 K(预设特征数)与任务不匹配时;当 K 设为人工标注的平均值时,性能下降。
- 总体而言,在所测试的方面,LIME 和 SHAP 在大多数指标上实现了较低的错误率,而 L2X 在若干设置中显示出更高的排序错误。
- 该评估框架可以在不假设真实模型推理的前提下自动揭示关键失败,并且可适用于其他领域。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。