QUICK REVIEW

[論文レビュー] Quasar: Datasets for Question Answering by Search and Reading

Bhuwan Dhingra, Kathryn Mazaitis|arXiv (Cornell University)|Jul 12, 2017

Topic Modeling参考文献 2被引用数 139

ひとこと要約

二つの大規模QAデータセット（Quasar-SとQuasar-T）を導入し、大規模テキストコーパス上でのエンドツーエンドQAを、retrieval（検索）とreading（抽出型QA）の二つのサブタスクを通じて評価する。

ABSTRACT

We present two new large-scale datasets aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. The Quasar-S dataset consists of 37000 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The Quasar-T dataset consists of 43000 open-domain trivia questions and their answers obtained from various internet sources. ClueWeb09 serves as the background corpus for extracting these answers. We pose these datasets as a challenge for two related subtasks of factoid Question Answering: (1) searching for relevant pieces of text that include the correct answer to a query, and (2) reading the retrieved text to answer the query. We also describe a retrieval system for extracting relevant sentences and documents from the corpus given a query, and include these in the release for researchers wishing to only focus on (2). We evaluate several baselines on both datasets, ranging from simple heuristics to powerful neural models, and show that these lag behind human performance by 16.4% and 32.1% for Quasar-S and -T respectively. The datasets are available at https://github.com/bdhingra/quasar .

研究の動機と目的

大規模なオープンドメインのファクトイドQAを研究するために、巨大なテキストコーパス上での検索と読取が必要なデータセットを提供する。
retrievalとreadingタスクでエンドツーエンドのQAシステムとベースラインを評価する。
結合した検索と読取の研究を推進し、非構造化テキスト上での最終タスク性能を向上させる。

提案手法

Quasar-S（Stack Overflowの定義からの37,000+のクローズ形式問題）とQuasar-T（43,000+のオープンドメインのトリビア問題）の二つのデータセットを作成する。
大規模な背景コーパスを構築する：Quasar-SはStack Overflowスレッド、Quasar-TはClueWeb09。
Quasar-Sは固定の回答語彙で質問を定式化し、Quasar-Tは自由形式のスパンを定式化する。
二段階の検索を開発する：半関連の疑似ドキュメントを収集し、Luceneインデックスを構築し、質問とhead tagを条件としてトップドキュメントを取得する（Quasar-S）または質問テキストだけを条件とする（Quasar-T）。
候補回答リストを組み立てる：Quasar-Sは4,874語彙の閉じた語彙を使用；Quasar-Tは文脈から名詞句候補をPOSタグ付けによって導出する。
ベースラインモデルを評価する：ヒューリスティクス、従来の言語モデル、読解理解アーキテクチャ（GA Reader, BiDAF）。

実験結果

リサーチクエスチョン

RQ1エンドツーエンドのQAシステムは大規模で非構造化されたコーパス上で検索と読取を効果的に組み合わせられるか？
RQ2検索強化型のQAベースラインは、ドメイン特化（Quasar-S）とオープンドメイン（Quasar-T）のデータセットで人間のパフォーマンスとどう比較されるか？
RQ3検索されたドキュメントの数の増減は、検索と読取のパフォーマンスにどのような影響を与えるか？
RQ4ニューラルリーダーはノイズの多いまたは大規模な背景コーパラでヒューリスティックベースラインより優れているか？

主な発見

BiRNN言語モデルはQuasar-Sで33.6%の精度を達成し、ベースラインの中で最高。
GA ReaderはQuasar-Sの文脈内に答えがあるサブセットで48.3%の精度を示すが、全体のパフォーマンスは検索品質（65%の検索精度）によって制限されている。
Quasar-TではBiDAFがベースラインの中で最高のF1スコア28.5%を達成し、人間のパフォーマンス（約32.1%）との差が顕著。
ニューラルモデルはヒューリスティックベースラインを大きく上回るが、依然として人間には及ばず、検索と読取を結合したシステムの改善余地を浮き彫りにしている。
取得ドキュメント数を増やすと検索のカバレッジが向上する一方、より長くノイズが多い passages により読取精度が低下する可能性がある。
オープンブックの非専門家はバックグラウンド検索が提供されれば専門家と同等かそれ以上の成果を挙げられる場合があり、QAパフォーマンスにおけるアクセス可能な検索の役割を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。