QUICK REVIEW

[论文解读] Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia|arXiv (Cornell University)|Jun 11, 2018

Topic Modeling被引用 211

一句话总结

本文提出 SQuAD 2.0，这是一个将 SQuAD 1.1 的可回答问题与超过 53,775 个人工撰写的不可回答问题结合在一起的数据集，强制模型在段落无支持答案时保持 abstain（放弃回答）。

ABSTRACT

Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0.

研究动机与目标

为了测试真正的理解能力，而不仅仅是段落内句子提取所需的能力，强调对不可回答问题的需求。
创建一个大规模、高质量的数据集，其中包含与段落相关且具有合理答案的不可回答问题。
评估现有模型并建立一个具有挑战性的基准，显示机器与人类之间的差距。
证明自动生成的负样本对模型而言比人工撰写的对抗性不可回答问题更容易。

提出的方法

众包人员为每个段落设计最多五个不可回答的问题，这些问题引用段落中的实体并且有合理的答案。
将 SQuAD 1.1 的可回答问题与 53,775 个不可回答的问题结合形成 SQuAD 2.0。
将数据划分为训练/开发/测试集，在开发/测试集中实现可回答/不可回答问题的 roughly balanced；训练集以正向样本为主。
通过预测不可回答性并在阈值之上选择放弃来评估现有模型（BiDAF-No-Answer、带/不带 ELMo 的 DocQA）。
与自动生成的负样本（TfIdf、基于规则的方法）进行比较，以评估难度。
通过人工评估分析人类表现和众包工作者的干扰者的可信度。

实验结果

研究问题

RQ1阅读理解模型是否能判断段落是否并不蕴含对某问题的答案？
RQ2在将对抗性不可回答问题加入后，模型性能与 SQuAD 1.1 相比有何变化？
RQ3人工撰写的不可回答问题是否比自动生成的负样本更难？
RQ4合理的干扰项是否能有效误导模型和人类？

主要发现

SQuAD 2.0 对最先进的模型仍然显著比 SQuAD 1.1 更困难（测试集最佳模型 ~66.3 的 F1，与人类 89.5 的 F1 相比）。
自动负样本（TfIdf/RuleBased）对模型而言比人工撰写的不可回答问题更容易，在开发/测试集上产生更大的 F1 差距。
可理解但不正确的答案约占机器与人类双方错误正例的约一半，验证其作为干扰项的有效性。
在人类在 SQuAD 2.0 的开发/测试中的准确率为 89.0/89.5 的 F1，而模型表现落后约 23 个点（相对于 SQuAD 1.1，差距扩大）。
数据集显示除了否定/同义替换之外，负样本也具有多样性，其中采样的 93% 的负样本真正不可回答。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。