Skip to main content
QUICK REVIEW

[论文解读] DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua, Yizhong Wang|arXiv (Cornell University)|Mar 1, 2019
Natural Language Processing Techniques被引用 178
一句话总结

DROP 是一个复杂的阅读理解基准,需对段落内容进行离散数字和逻辑推理;最先进的模型显著逊于人类,激发神经-符号方法。

ABSTRACT

Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.

研究动机与目标

  • Introduce DROP, a crowdsourced benchmark evaluating discrete reasoning over paragraph content.
  • Push toward models combining neural representations with symbolic, discrete reasoning.
  • Characterize dataset properties and challenge existing QA systems with numeracy-focused tasks.

提出的方法

  • Crowdsourced creation of 96.6k questions from Wikipedia passages with adversarial targeting to require discrete reasoning.
  • Semantic parsing baselines using a tabular representation of predicate-argument structures and a rule-driven logical form language.
  • SQuAD-style reading comprehension baselines (BiDAF, QANet, QANet+ELMo, BERT) adapted to evaluate non-span answers.
  • Introduction of NAQANet, a numerically aware QA model that extends QANet with counting and simple arithmetic over numbers.
  • Weakly supervised training that marginalizes over executions producing correct answers, enabling neural-symbolic integration.

实验结果

研究问题

  • RQ1How difficult is paragraph-level QA that requires discrete reasoning compared to existing QA datasets?
  • RQ2Can neural models be augmented with symbolic numeric reasoning to handle counting and arithmetic in passages?
  • RQ3What are the main challenges for semantic parsing approaches when applied to DROP's paragraph-based questions?
  • RQ4What is the performance gap between human experts and current models on DROP, and which phenomena drive errors?

主要发现

  • Best baseline (BERT) achieves 32.70 F1 on test with DROP, far below human 96.42 F1, demonstrating dataset difficulty.
  • NAQANet with complete arithmetic capabilities achieves 47.01 F1 on the test set, a substantial improvement over prior baselines but still below human performance.
  • Semantic parsing baselines perform poorly due to reliance on information extraction quality and weakly supervised training; only a subset of questions yields valid logical forms.
  • Counting and arithmetic questions dominate model errors, with arithmetic contributing to 51% of analyzed errors in NAQANet’s error analysis.
  • Complete model variants integrating numerical reasoning (Add/Sub) yield the strongest gains among tested approaches.
  • Heuristic baselines show near-zero performance, indicating limited dataset biases that could be exploited by simple tricks.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。