QUICK REVIEW

[论文解读] DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua, Yizhong Wang|arXiv (Cornell University)|Mar 1, 2019

Natural Language Processing Techniques被引用 178

一句话总结

DROP 是一个复杂的阅读理解基准，需对段落内容进行离散数字和逻辑推理；最先进的模型显著逊于人类，激发神经-符号方法。

ABSTRACT

Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.

研究动机与目标

Introduce DROP, a crowdsourced benchmark evaluating discrete reasoning over paragraph content.
Push toward models combining neural representations with symbolic, discrete reasoning.
Characterize dataset properties and challenge existing QA systems with numeracy-focused tasks.

提出的方法

Crowdsourced creation of 96.6k questions from Wikipedia passages with adversarial targeting to require discrete reasoning.
Semantic parsing baselines using a tabular representation of predicate-argument structures and a rule-driven logical form language.
SQuAD-style reading comprehension baselines (BiDAF, QANet, QANet+ELMo, BERT) adapted to evaluate non-span answers.
Introduction of NAQANet, a numerically aware QA model that extends QANet with counting and simple arithmetic over numbers.
Weakly supervised training that marginalizes over executions producing correct answers, enabling neural-symbolic integration.

实验结果

研究问题

RQ1How difficult is paragraph-level QA that requires discrete reasoning compared to existing QA datasets?
RQ2Can neural models be augmented with symbolic numeric reasoning to handle counting and arithmetic in passages?
RQ3What are the main challenges for semantic parsing approaches when applied to DROP's paragraph-based questions?
RQ4What is the performance gap between human experts and current models on DROP, and which phenomena drive errors?

主要发现

Best baseline (BERT) achieves 32.70 F1 on test with DROP, far below human 96.42 F1, demonstrating dataset difficulty.
NAQANet with complete arithmetic capabilities achieves 47.01 F1 on the test set, a substantial improvement over prior baselines but still below human performance.
Semantic parsing baselines perform poorly due to reliance on information extraction quality and weakly supervised training; only a subset of questions yields valid logical forms.
Counting and arithmetic questions dominate model errors, with arithmetic contributing to 51% of analyzed errors in NAQANet’s error analysis.
Complete model variants integrating numerical reasoning (Add/Sub) yield the strongest gains among tested approaches.
Heuristic baselines show near-zero performance, indicating limited dataset biases that could be exploited by simple tricks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。