[论文解读] WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoGrande 将 Winograd 风格的代词消解问题扩展到 44k 项,并使用 AfLite 去偏数据,展示了人类与模型表现之间的显著差距,并实现向相关基准的迁移学习。
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed. Furthermore, we establish new state-of-the-art results on five related benchmarks - WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
研究动机与目标
- 研究大型模型是否真正具备常识推理能力,还是依赖数据集偏差。
- 创建一个更大、更加困难的受 WSC 启发的数据集,以挑战当前模型。
- 开发并应用偏差降低方法(AfLite)以减轻数据集特定伪迹。
- 评估从 WinoGrande 到其他常识基准的迁移学习能力。
提出的方法
- 通过主题锚点引导的同句对代词指代歧义题的众包生成以提高多样性。
- AfLite:一种轻量级对抗筛选算法,使用 RoBERTa 嵌入与线性分类器的集成来去除偏差较重的实例。
- 比较去偏数据集与全部数据集设置以评估偏差效应,使用 KL 发散与 PCA 可视化。
- 在 WinoGrande 去偏与全部数据上对基线和最先进模型(WKH、集成语言模型、BERT、RoBERTa,是否对 DPR 进行微调)进行评测。
- 迁移学习实验在 RoBERTa 上对 WinoGrande 进行微调以评估在 WSC、DPR、COPA、KnowRef、Winogender 等基准上的收益。
实验结果
研究问题
- RQ1能否将以 WSC 为灵感的众包问题扩展到数万题而保持对 AI 的难度并确保人类可解?
- RQ2数据集特定偏差是否会提高模型在 WSC 式任务中的表现,AfLite 是否能够减轻这些偏差?
- RQ3WinoGrande 去偏如何影响模型性能以及向相关基准的迁移学习?
- RQ4在多大程度上在 WinoGrande 上训练的模型能够迁移到其他常识推理数据集?],
- RQ5key_findings_idk_note
- RQ6key_findings":["Best RoBERTa performance on debiased WinoGrande test set is 79.1% (dev 79.3%).","Human performance on debiased WinoGrande exceeds 94.0% accuracy, far above model scores.","AfLite debiasing dramatically reduces KL divergence between label distributions, indicating reduced dataset-specific bias.","RoBERTa fine-tuned on WinoGrande improves state-of-the-art results on WSC, DPR, COPA, KnowRef, and Winogender.","WinoGrande enables transfer learning; RoBERTa-WinoGrande achieves 90.1% on WSC-related tasks, 93.1% on DPR, 90.6% on COPA, 85.6% on KnowRef, and 97.1% on Winogender (relative to respective baselines).","Results suggest substantial biases in existing benchmarks and the need for algorithmic bias reduction to better gauge true commonsense capabilities."],
- RQ7table_headers_translate
- RQ8table_headers translated to Chinese
- RQ9table_headers_translation
- RQ10table_rows_translate
- RQ11Main results table (Dev/Test accuracy on WinoGrande-debiased)
- RQ12["RoBERTa","79.3","79.1"],["BERT","65.8","64.9"],["Ensemble LMs","53.0","50.9"],["WKH","49.4","49.6"],["RoBERTa (local context)","52.1","50.0"],["BERT (local context)","52.5","51.9"],["BERT-DPR Star","50.2","51.0"],["RoBERTa-DPR Star","59.4","58.9"],["Human Perf.","94.1","94.0"]
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。