[论文解读] Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Introduces ARC, a large, human-authored grade-school science QA dataset split into a challenging Challenge Set and an easier Easy Set, plus a 14M-sentence ARC Corpus and several neural baselines; results show current models struggle on the Challenge Set, highlighting the need for deeper reasoning.
We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.
研究动机与目标
- 通过强调需要超出表层线索推理的问题,激励在高级问答领域的人工智能研究。
- 提供一个大型、公开的数据集(ARC),包含明确定义的 Challenge Set,旨在击败简单的信息检索(IR)和共现基线。
- 发布一个支持性的科学语料库(ARC Corpus)和基线神经模型,为研究社区确立起点。
提出的方法
- 使用基于检索和共现的基线将 ARC 划分为 Challenge Set(困难)和 Easy Set(较易),以界定难度。
- 提供 14M 条科学句子的 ARC Corpus 以支持基于知识的推理。
- 将三种神经问答模型(DecompAttn、BiDAF、DGEM)改编为带检索增强输入的多项选择题问答。
- 在 Challenge Set 和 Easy Set 上比较包括 IR、PMI 和神经模型在内的基线,以评估难度和知识需求。
- 发布代码和排行榜以促进社区参与。
实验结果
研究问题
- RQ1Can standard IR/PMI baselines and leading neural QA models outperform random guessing on the ARC Challenge Set?
- RQ2To what extent does the ARC Corpus assist retrieval-based baselines in answering Challenge questions?
- RQ3Do neural models that perform well on SNLI/SQuAD significantly improve over random on the ARC Challenge Set?
- RQ4Which knowledge and reasoning types are most critical for answering ARC Challenge questions?
- RQ5How do performance patterns differ between the ARC Challenge Set and Easy Set?
主要发现
- No baseline model significantly outperforms random chance on the ARC Challenge Set (within tight confidence bounds).
- On the Easy Set, baselines generally achieve 55–65% accuracy, while Challenge Set performance remains near random, highlighting the difficulty.
- IR and PMI baselines perform poorly on the Challenge Set but can improve with the ARC Corpus for some questions, indicating knowledge is present but not easily exploitable by simple retrieval.
- Neural baselines (DecompAttn, BiDAF, DGEM) improve on the Easy Set but fail to surpass random on the Challenge Set, suggesting the need for more advanced retrieval and multi-hop reasoning strategies.
- ARC Corpus contains knowledge relevant to approximately 95% of Challenge questions, yet simple retrieval over this corpus is insufficient for the hardest questions.
- A notable gap exists in retrieval strategies that can combine multiple facts and perform multi-fact reasoning (chaining).
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。