QUICK REVIEW

[论文解读] CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig|arXiv (Cornell University)|Nov 2, 2018

Topic Modeling参考文献 41被引用 310

一句话总结

CommonsenseQA 引入了一个来自 ConceptNet 的大规模常识问答数据集，评估了多种基线，并显示人类显著超越当前模型（最佳约 55.9% 对比 ~88.9% 的人类水平）。

ABSTRACT

When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from ConceptNet (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.

研究动机与目标

引入一个用于常识问答的数据集，以测试超出上下文的背景知识。
提出一种利用众包工作者从 ConceptNet 生成可扩展的问题的方法。
评估最先进的自然语言理解模型并揭示机器与人类表现之间的差距。

提出的方法

从 ConceptNet 生成问题集，选择一个源概念及三个共享一关系的目标概念。
众包工作者为每组撰写三个问题，每个问题以一个正确的目标概念作为答案，两个来自 ConceptNet 的干扰项再加一个自创干扰项。
通过独立的工作者核验质量，并仅保留至少有一个正确核验的问题。
通过为每个答案候选获取前100条网络片段来附加文本上下文，以研究带外部上下文的阅读理解模型。
评估包括预训练语言模型微调（BERT, GPT）、传统问答模型，以及带网页上下文的阅读理解模型在随机分割和问-概念分割上的准确率。

实验结果

研究问题

RQ1当前的 NLU 模型在大规模常识问答数据集上的表现如何？
RQ2将问题定位在 ConceptNet，并采用多样的干扰项策略，是否使难度超越表面线索？
RQ3预训练语言模型（如 BERT、GPT）在常识推理任务上的极限是什么？
RQ4使用网页片段作为上下文对常识问题的模型表现有何影响？

主要发现

收集了 12,247 份常识问题，显示人类准确率很高（约 ~88.9%）。
最佳模型（BERT-large）在随机分割上的准确率为 55.9%，远低于人类表现。
GPT等基线相较于 BERT-large 表现不佳，且带网页上下文的 BiDAF++ 收益有限。
SANITY 干扰项控制表明具有挑战性的干扰项对提升模型鲁棒性的重要性。
学习曲线表明数据增大仅带来适度提升；即使有 100k 条样本，BERT-large 的准确率也可能只有约 75%，仍低于人类。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。