QUICK REVIEW

[论文解读] The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations

Peter Hase, Harry Xie|arXiv (Cornell University)|Jun 1, 2021

Explainable Artificial Intelligence (XAI)被引用 23

一句话总结

本文指出，NLP中的特征重要性（FI）解释存在社会错位问题，因为通过移除或替换特征生成的反事实输入属于分布外（OOD），导致模型行为受随机权重初始化和先验影响。为解决此问题，作者提出在反事实输入上进行模型训练，以使推理时的解释与训练时的分布对齐，并引入一种新型并行局部搜索（PLS）方法，在六个文本分类数据集上，其在充分性指标上优于基线方法5.4分，在可 comprehensiveness 指标上优于基线17分。

ABSTRACT

Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time. For example, in the standard Sufficiency metric, only the top-k most important tokens are kept. In this paper, we study several under-explored dimensions of FI explanations, providing conceptual and empirical improvements for this form of explanation. First, we advance a new argument for why it can be problematic to remove features from an input when creating or evaluating explanations: the fact that these counterfactual inputs are out-of-distribution (OOD) to models implies that the resulting explanations are socially misaligned. The crux of the problem is that the model prior and random weight initialization influence the explanations (and explanation metrics) in unintended ways. To resolve this issue, we propose a simple alteration to the model training process, which results in more socially aligned explanations and metrics. Second, we compare among five approaches for removing features from model inputs. We find that some methods produce more OOD counterfactuals than others, and we make recommendations for selecting a feature-replacement function. Finally, we introduce four search-based methods for identifying FI explanations and compare them to strong baselines, including LIME, Anchors, and Integrated Gradients. Through experiments with six diverse text classification datasets, we find that the only method that consistently outperforms random search is a Parallel Local Search (PLS) that we introduce. Improvements over the second-best method are as large as 5.4 points for Sufficiency and 17 points for Comprehensiveness. All supporting code for experiments in this paper is publicly available at https://github.com/peterbhase/ExplanationSearch.

研究动机与目标

识别并解决在解释评估过程中，由于反事实输入分布外（OOD）导致的特征重要性（FI）解释社会错位问题。
评估并比较不同特征替换函数（Replace函数）在生成的反事实输入对模型而言有多OOD，以及其对解释质量和度量可靠性的影响。
设计并评估新型基于搜索的解释方法，以识别高质量的FI解释，优于LIME、Anchors和Integrated Gradients等现有基线方法。
提出一种训练时干预措施——在训练过程中暴露模型于解释用的反事实输入——以使推理时的反事实输入分布内化，减少对模型先验和权重初始化的依赖。
通过标准度量（如充分性和可 comprehensiveness）在六个多样化的文本分类数据集上实证验证所提方法的有效性。

提出的方法

提出一种新型训练流程：在由解释方法生成的反事实输入（如用特殊标记替换top-k特征）上对模型进行微调，使推理时的反事实输入分布内化。
引入一种新型基于搜索的解释方法——并行局部搜索（PLS），通过局部搜索启发式方法并行探索多个候选解释，以最大化充分性度量。
系统性地比较五种Replace函数：(1) 完全移除标记，(2) 替换为零嵌入，(3) 替换为特殊[Mask]标记，(4) 对反事实进行边际化处理，(5) 修改注意力掩码而非输入文本。
将充分性度量定义为特征被替换后模型置信度的下降：$\textrm{Suff}(f,x,e) = f(x)_{\hat{y}} - f(\texttt{Replace}(x,e))_{\hat{y}}$，其中$\hat{y}$为模型原始预测结果。
使用前向和反向传播次数作为计算预算，以公平比较搜索方法，同时将实际运行时间作为次要基准。
在六个文本分类数据集（如FEVER、SNLI）上评估所有方法，使用充分性和可 comprehensiveness 两种度量，并进行Replace函数和训练时干预的消融研究。

实验结果

研究问题

RQ1在FI解释评估中，用于反事实输入的分布外（OOD）性质如何导致受模型先验和随机权重初始化影响的社会错位解释？
RQ2哪些特征替换函数（Replace函数）产生的反事实输入对模型而言最不OOD？其对解释质量和度量可靠性有何影响？
RQ3基于搜索的方法能否在识别高质量特征重要性解释方面优于LIME、Anchors和Integrated Gradients等成熟基线方法？
RQ4在训练过程中对模型进行反事实输入微调，是否能提升推理时解释的鲁棒性和社会对齐性？
RQ5不同搜索算法的相对性能如何？所提出的并行局部搜索（PLS）方法是否在多种数据集和度量上始终优于其他方法？

主要发现

在由解释方法生成的反事实输入上进行模型训练，可显著降低推理时归因的OOD程度，从而产生更鲁棒且社会对齐的解释。
并行局部搜索（PLS）方法在所有基线中表现最优，在六个文本分类数据集上，充分性指标最高提升5.4分，可 comprehensiveness 指标最高提升17分。
在所评估的五种Replace函数中，使用特殊[Mask]标记替换或修改注意力掩码产生的反事实输入比使用零嵌入或完全移除更少OOD，因此推荐用于提升度量可靠性。
Replace函数的选择对解释质量有显著影响，能更好保留输入结构和语义一致性的方法，产生的解释更可靠且可解释。
在反事实输入上微调的模型对归因操作表现出更强鲁棒性，且对模型先验和权重初始化的依赖降低，验证了训练时干预的有效性。
尽管存在优化目标的理论担忧，PLS方法的优越性能并非因其解决了一个不同的问题，而是因为它在正确度量下有效探索了搜索空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。