[论文解读] Self-Explaining Structures Improve NLP Models
本文提出了一种自解释神经网络框架,通过在任意现有模型之上添加解释层,提升了NLP模型的可解释性与性能。该层为所有文本片段(如短语、句子)分配可学习权重,实现无需外部探测模型的直接、高层级显著性评分,并在SST-5上取得59.1的新SOTA结果,在SNLI上取得92.3的新SOTA结果。
Existing approaches to explaining deep learning models in NLP usually suffer from two major drawbacks: (1) the main model and the explaining model are decoupled: an additional probing or surrogate model is used to interpret an existing model, and thus existing explaining tools are not self-explainable; (2) the probing model is only able to explain a model's predictions by operating on low-level features by computing saliency scores for individual words but are clumsy at high-level text units such as phrases, sentences, or paragraphs. To deal with these two issues, in this paper, we propose a simple yet general and effective self-explaining framework for deep learning models in NLP. The key point of the proposed framework is to put an additional layer, as is called by the interpretation layer, on top of any existing NLP model. This layer aggregates the information for each text span, which is then associated with a specific weight, and their weighted combination is fed to the softmax function for the final prediction. The proposed model comes with the following merits: (1) span weights make the model self-explainable and do not require an additional probing model for interpretation; (2) the proposed model is general and can be adapted to any existing deep learning structures in NLP; (3) the weight associated with each text span provides direct importance scores for higher-level text units such as phrases and sentences. We for the first time show that interpretability does not come at the cost of performance: a neural model of self-explaining features obtains better performances than its counterpart without the self-explaining nature, achieving a new SOTA performance of 59.1 on SST-5 and a new SOTA performance of 92.3 on SNLI.
研究动机与目标
- 解决现有NLP模型缺乏自解释性的问题,这些模型依赖于独立的探测或代理模型进行解释。
- 克服词级显著性方法的局限性,后者无法捕捉短语和句子等高层文本单元的语义组合。
- 开发一种可泛化的框架,在提升模型性能的同时,实现在片段层面的精确、可解释性解释。
- 证明可解释性与性能并非相互排斥,而是可以通过架构设计协同提升。
提出的方法
- 在任意预训练NLP模型之上引入解释层,计算所有可能文本片段(O(n²)个片段)的注意力权重。
- 每个文本片段关联一个可学习权重,反映其对最终预测的贡献,实现直接解释。
- 将片段表示的加权和通过Softmax层进行最终分类,将解释过程整合到主预测路径中。
- 解释层与主模型端到端联合训练,无需额外的探测模型。
- 利用片段级注意力权重生成短语、句子或段落的显著性评分,实现高层级可解释性。
- 通过将最显著的片段替换为释义,将该框架应用于对抗样本生成,实现高效的攻击成功。
实验结果
研究问题
- RQ1能否在不依赖外部探测模型的前提下设计出自解释的NLP模型?
- RQ2与词级显著性方法相比,能否更有效地实现在短语和句子层面的可解释性?
- RQ3在模型中引入自解释机制是否会降低或提升模型性能?
- RQ4片段级注意力机制能否用于生成更有效的NLP对抗样本?
- RQ5自解释模型如何揭示模型预测中的失败模式,例如关注无关从句、未能识别情感转变或误解反语?
主要发现
- 所提出的自解释框架在SST-5情感分类基准上实现了59.1的新SOTA性能。
- 该模型在SNLI自然语言蕴含数据集上也取得了92.3的新SOTA结果,表明其泛化能力得到提升。
- 模型的解释层可为短语和句子提供直接、高层级的显著性评分,相比词级方法能实现更清晰的错误分析。
- 该框架通过将最显著的片段替换为释义,实现了有效的对抗样本生成,在IMDB上使模型准确率降低84%,在Yahoo! Answers上降低48.86%。
- 错误分析显示,模型常错误关注对比结构中的无关从句,无法检测情感转变,且会误解反语或类比。
- 自解释机制并未损害性能,反而提升了性能,证明在NLP模型中可解释性与准确性可以共存。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。