QUICK REVIEW

[论文解读] Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

R. Thomas McCoy, Ellie Pavlick|arXiv (Cornell University)|Feb 4, 2019

Topic Modeling参考文献 47被引用 88

一句话总结

本文本文献这篇论文介绍了 HANS 数据集，用以诊断 NLI 的句法启发式，显示最先进的模型依赖这些易错的启发式，在 HANS 上表现不佳，并且证明通过使用类似 HANS 的数据来增强训练可以减少对启发式的依赖。

ABSTRACT

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area

研究动机与目标

激励并诊断自然语言推理（NLI）中浅层句法启发式的使用。
介绍 HANS (Heuristic Analysis for NLI Systems) 用以测试有针对性的启发式。
在 HANS 上评估领先的 NLI 模型，以评估对启发式的依赖。
证明用 HANS-like 示例来增强训练可以减少由启发式引发的失败。

提出的方法

定义三种易错的句法启发式：词汇重叠、子序列和成分。
通过为每种启发式生成 10,000 个示例（总共 30 个模板，跨启发式）并控制可信度，构建 HANS。
评估在 MNLI 上训练的四个流行 NLI 模型（DA、ESIM、SPINN、BERT）在 HANS 上的表现。
将 HANS 标注为 entailment 或 non-entailment 标签，以测试基于启发式的预测。
评估用 HANS-like 示例增强 MNLI 是否改善在 HANS 和相关结构依赖任务上的表现。

实验结果

研究问题

RQ1NLI 模型在实践中是否采用所提出的句法启发式？
RQ2流行模型在为测试每种启发式而设计的 HANS 子集上的表现如何？
RQ3以 HANS-like 示例进行训练是否能减少对这些启发式的依赖，同时不损害 MNLI 的表现？
RQ4模型架构与训练数据在启发式易感性方面的相对贡献是什么？

主要发现

四个模型在 MNLI 上表现良好，但在 HANS 上失败，因为启发式导致错误的蕴含预测（非蕴含案例的准确率接近机会水平或以下）。
DA 和 ESIM 在各启发式子集上几乎没有表现，表明其依赖词汇重叠而不考虑词序。
SPINN 在 subsequence 和 constituent 情况下表现相对更好，表明来自基于树的表示的某些结构性益处，但并非普遍鲁棒。
BERT 在 constituent 和 lexical overlap 情况下表现优于其他模型，但在 HANS 上仍远未完美。
用 HANS-like 示例增强 MNLI 显著改善各模型在 HANS 的表现，尽管效果因架构而异；MNLI 的表现取决于模型，结果不一。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。