QUICK REVIEW

[论文解读] Quasar: Datasets for Question Answering by Search and Reading

Bhuwan Dhingra, Kathryn Mazaitis|arXiv (Cornell University)|Jul 12, 2017

Topic Modeling参考文献 2被引用 139

一句话总结

引入两个大型 QA 数据集（Quasar-S 与 Quasar-T），通过两个子任务：检索（搜索）和阅读（抽取式 QA），对大文本语料库上的端到端 QA 进行评估。

ABSTRACT

We present two new large-scale datasets aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. The Quasar-S dataset consists of 37000 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The Quasar-T dataset consists of 43000 open-domain trivia questions and their answers obtained from various internet sources. ClueWeb09 serves as the background corpus for extracting these answers. We pose these datasets as a challenge for two related subtasks of factoid Question Answering: (1) searching for relevant pieces of text that include the correct answer to a query, and (2) reading the retrieved text to answer the query. We also describe a retrieval system for extracting relevant sentences and documents from the corpus given a query, and include these in the release for researchers wishing to only focus on (2). We evaluate several baselines on both datasets, ranging from simple heuristics to powerful neural models, and show that these lag behind human performance by 16.4% and 32.1% for Quasar-S and -T respectively. The datasets are available at https://github.com/bdhingra/quasar .

研究动机与目标

提供大规模数据集以研究需要在大文本语料库上同时进行检索和阅读的开放领域事实性问答。
在检索和阅读任务上评估端到端 QA 系统与基线。
促进对检索与阅读的联合研究，以提升对非结构化文本的最终任务性能。

提出的方法

创建两个数据集：Quasar-S（来自 Stack Overflow 定义的 37,000 余道 Cloze 式问题）和 Quasar-T（43,000 余道开放领域趣味问题）。
构建大背景语料库：Quasar-S 的 Stack Overflow 主题线索与 Quasar-T 的 ClueWeb09。
用固定答案词汇表来表述 Quasar-S 的问题，Quasar-T 使用自由形式的片段作为答案。
开发两阶段检索：收集半相关伪文档，构建 Lucene 索引，并在问题和头部标签（Quasar-S）或仅问题文本（Quasar-T）的条件下检索前文档。
汇集候选答案列表：Quasar-S 使用 4,874 个实体的封闭词汇表；Quasar-T 通过词性标注从上下文中派生名词短语候选项。
评估覆盖启发式、传统语言模型和阅读理解架构（GA Reader、BiDAF）的基线模型。

实验结果

研究问题

RQ1端到端的 QA 系统能否在大规模、非结构化语料库上有效结合检索与阅读？
RQ2检索增强的 QA 基线在领域特定（Quasar-S）和开放领域（Quasar-T）数据集上的表现，与人类表现相比有何差异？
RQ3检索到的文档数量对检索与阅读的性能有何影响？
RQ4在存在嘈杂或大规模背景语料的情况下，神经阅读器是否优于启发式基线？

主要发现

BiRNN 语言模型在 Quasar-S 上的准确率为 33.6%，在基线中表现最佳。
GA Reader 在 Quasar-S 的上下文中位于答案在上下文中的子集上达到 48.3% 的准确率，但整体性能受限于检索质量（65% 的检索准确率）。
对于 Quasar-T，BiDAF 在基线中达到最高的 F1 值 28.5%，与人类表现存在显著差距（约 32.1%）。
神经模型显著优于启发式基线，但仍落后于人类，凸显在联合检索与阅读系统方面提升的空间。
增加检索文档数量会提升检索覆盖率，但由于文段更长、噪声增多，可能降低阅读准确性。
开放式回答者在提供背景检索时可以匹配甚至超过专家，这凸显可访问检索对 QA 性能的作用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。