QUICK REVIEW

[論文レビュー] Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter E. Clark, Isaac Cowhey|arXiv (Cornell University)|Mar 14, 2018

Topic Modeling参考文献 23被引用数 488

ひとこと要約

ARCを紹介する、大規模で人間著の初等科学QAデータセットを、挑戦的なChallenge Setとより容易なEasy Setに分割、さらに14M文のARC Corpusといくつかのニューラルベースラインを付与。結果は現行モデルがChallenge Setで苦戦することを示し、より深い推論の必要性を強調する。

ABSTRACT

We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.

研究の動機と目的

表面的な手掛かりを超えた推論を必要とする問いを強調することで、高度な質問応答におけるAI研究を動機づける。
単純なIRや同時出現ベースラインを打ち負かすよう設計された、明確に定義されたChallenge Setを備える大規模な公開データセット（ARC）を提供する。
研究コミュニティの出発点を確立するために、支援的な科学コーパス（ARC Corpus）と基準ニューラルモデルを公開する。

提案手法

検索と共起ベースラインを用いて難易度を定義し、ARCをChallenge Set（難）とEasy Set（易）に分割する。
知識ベースの推論を支えるために、14M の科学文のARC Corpusを提供する。
3つのニューラルQAモデル（DecompAttn、BiDAF、DGEM）を、検索拡張入力を用いた多択問題QAへ適用する。
Challenge SetとEasy Setの両方でIR、PMI、ニューラルモデルを含むベースラインを比較し、難易度と知識要件を評価する。
コードとリーダーボードを公開し、コミュニティ参加を促進する。

実験結果

リサーチクエスチョン

RQ1標準的なIR/PMIベースラインと最先端のニューラルQAモデルは、ARC Challenge Setでランダム推測を上回ることができるか？
RQ2ARC CorpusはChallengeの質問に対する検索ベースのベースラインをどの程度支援するか？
RQ3SNLI/SQuADで高性能を示すニューラルモデルは、ARC Challenge Setでランダムを著しく上回るか？
RQ4ARC Challengeの質問に答える際に最も重要な知識および推論タイプは何か？
RQ5ARC Challenge SetとEasy Setで性能パターンはどのように異なるか？

主な発見

ソルバー	Challenge Set	Easy Set
IR (dataset defn)	1.02	74.48
PMI (dataset defn)	2.03	77.82
IR (using ARC Corpus)	20.26	62.55
TupleInference	23.83	60.81
DecompAttn	24.34	58.27
Guess-all ("random")	25.02	25.02
DGEM-OpenIE	26.41	57.45
BiDAF	26.54	50.11
TableILP	26.97	36.15
DGEM	27.11	58.97

ARC Challenge Setでは、どのベースラインモデルもランダム確率を有意に上回らない（信頼区間が狭い範囲で）。
Easy Setでは、ベースラインは一般に55–65%の精度を達成する一方、Challenge Setの性能はランダム近接であり、難しさを浮き彫りにする。
IRとPMIベースラインはChallenge Setで性能が低いが、ARC Corpusを用いると一部の質問で改善され、知識は存在するが単純な検索では容易に活用できないことを示している。
Neuralベースライン（DecompAttn、BiDAF、DGEM）はEasy Setで改善するが、Challenge Setでランダムを上回るには至らず、より高度な検索とマルチホップ推論戦略の必要性を示唆する。
ARC CorpusにはChallengeの質問の約95%に関連する知識が含まれているが、このコーパスに対する単純な検索では最難問には不十分である。
複数の事実を組み合わせてマルチファクト推論（連鎖）を行える検索戦略には顕著なギャップが存在する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。