QUICK REVIEW

[论文解读] Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

Yaru Hao, Li Dong|arXiv (Cornell University)|Apr 23, 2020

Adversarial Robustness in Machine Learning参考文献 41被引用 23

一句话总结

本文提出自注意力归因（AttAttr），一种基于积分梯度的方法，用于解释像 BERT 这类 Transformer 模型中信息交互的机制。该方法识别显著的注意力头，构建层次化归因树以可视化组合依赖关系，实现有效的头剪枝，并生成严重降低模型性能的对抗性触发词，揭示了模型对虚假模式的过度依赖。

ABSTRACT

The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions to individual input features with different saliency measures, but they fail to explain how these input features interact with each other to reach predictions. In this paper, we propose a self-attention attribution method to interpret the information interactions inside Transformer. We take BERT as an example to conduct extensive studies. Firstly, we apply self-attention attribution to identify the important attention heads, while others can be pruned with marginal performance degradation. Furthermore, we extract the most salient dependencies in each layer to construct an attribution tree, which reveals the hierarchical interactions inside Transformer. Finally, we show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.

研究动机与目标

为解决在 Transformer 模型中输入标记通过自注意力机制相互作用时缺乏可解释性的问题。
开发一种方法，不仅解释单个标记的重要性，还能解释标记之间的组合性交互。
通过基于归因分数识别最具影响力的注意力头，实现结构化剪枝，从而支持模型剪枝。
构建分层归因树，以可视化信息在各层之间的流动过程。
从归因分数中发现并利用对抗性模式，以测试模型的鲁棒性。

提出的方法

提出 AttAttr，一种基于积分梯度的自注意力归因方法，用于计算每个注意力头对最终预测的贡献。
将积分梯度应用于注意力权重，计算归因分数，以反映每个注意力连接在模型决策中的重要性。
利用归因分数识别每层中最关键的注意力头，实现结构化剪枝，且性能损失最小。
设计启发式算法提取最显著的依赖关系，并构建归因树，以可视化跨层的信息流动层次结构。
利用最高归因分数提取对抗性触发词——特定的词模式，当插入时会显著降低模型准确率。
在多个 NLP 数据集上的 BERT 模型上验证该方法，通过定量分析评估归因树中边的贡献。

实验结果

研究问题

RQ1如何在超越单个标记显著性之外，解释 Transformer 自注意力机制中输入标记之间的交互关系？
RQ2注意力权重在多大程度上与模型预测的实际贡献相关？
RQ3能否利用归因分数识别并剪枝不重要的注意力头，而不会造成显著的性能下降？
RQ4能否重构反映模型组合推理过程的分层依赖结构（即归因树）？
RQ5能否利用归因方法识别出的最显著交互模式，来构造有效的非目标对抗性攻击？

主要发现

在 MNLI 数据集上，将最严重的对抗性触发词（‘with’ 和 ‘math’）插入前提句后，蕴涵准确率从 82.87% 降至 0.8%，表明模型存在极端脆弱性。
前 3 名对抗性触发词使 MNLI 和 SST-2 所有类别平均准确率下降超过 40 个百分点，表明模型广泛依赖于虚假模式。
基于 AttAttr 的剪枝方法在性能上与基于泰勒展开的方法相当，剪枝后准确率下降极小。
在同质任务和数据集上，BERT 中重要注意力头具有高度一致性，表明其具有稳定的功能角色。
基于 AttAttr 分数构建的归因树揭示了分层信息流动，展示了依赖关系如何在各层间组合形成。
该方法表明，仅凭注意力权重并不能可靠反映贡献程度，因为某些高权重连接对预测的贡献微乎其微。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。