QUICK REVIEW

[论文解读] A Simple Method for Commonsense Reasoning

Trieu H. Trinh, Quoc V. Le|arXiv (Cornell University)|Jun 7, 2018

Natural Language Processing Techniques参考文献 38被引用 312

一句话总结

作者表明在多样化无标签语料上训练的未监督大型语言模型可以通过对候选替换进行评分来解决 Winograd Schema 挑战和 Pronoun Disambiguation 问题，在没有手工设计特征或带注释的知识库的情况下实现了最先进的准确率。

ABSTRACT

Commonsense reasoning is a long-standing challenge for deep learning. For example, it is difficult to use neural networks to tackle the Winograd Schema dataset (Levesque et al., 2011). In this paper, we present a simple method for commonsense reasoning with neural networks, using unsupervised learning. Key to our method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests. On both Pronoun Disambiguation and Winograd Schema challenges, our models outperform previous state-of-the-art methods by a large margin, without using expensive annotated knowledge bases or hand-engineered features. We train an array of large RNN language models that operate at word or character level on LM-1-Billion, CommonCrawl, SQuAD, Gutenberg Books, and a customized corpus for this task and show that diversity of training data plays an important role in test performance. Further analysis also shows that our system successfully discovers important features of the context that decide the correct answer, indicating a good grasp of commonsense knowledge.

研究动机与目标

将常识推理作为标注数据稀缺时的低监督问题来进行动机阐述。
提出一个使用语言模型对 Winograd Schema 和 PDP 任务中的候选替换进行评分的简单方法。
证明在多样语料上训练的集成模型在性能上优于之前的最先进方法。
分析评分策略和训练数据多样性如何影响推理基准的表现。

提出的方法

在句子中用每个候选参考替换代词，并使用语言模型对得到的句子进行评分。
比较完整句子概率（Score_full）与给定替换情况下尾部的条件概率（Score_partial）。
在大型无标签语料上训练基于单词和字符的语言模型（LM-1-Billion，CommonCrawl，SQuAD，Gutenberg，STORIES），并对它们的输出进行集成。
在 PDP-60 和 WSC-273 上进行评估，以衡量在没有注释知识库的情况下的推理能力。
探索从 CommonCrawl 派生的定制 STORIES 语料，以进一步提升 Winograd Schema 任务的性能。
通过检查每个标记的概率比来分析关键字级特征，以识别对决策至关重要的词语。

实验结果

研究问题

RQ1未监督语言模型是否可以从大量无标签语料中学到足够的常识推理，以解决 Winograd Schema 和代词歧义任务？
RQ2评分方法（全分 vs 部分分）是否会影响推理性能，训练数据多样性又如何影响结果？
RQ3语料选择对 LM 在常识任务中的表现有何影响，是否故事类语料可以带来额外收益？
RQ4在不同语料上训练的多个语言模型的集成是否优于单一模型或使用知识库的模型？
RQ5模型是否能够识别驱动 Winograd Schema 决策的关键词或特殊词语？

主要发现

方法	PDP-60 准确率	WSC-273 准确率
Unsupervised Semantic Similarity Method (USSM)	48.3%	N/A
USSM + Cause-Effect Knowledge Base	55.0%	N/A
USSM + Cause-Effect + WordNet + ConceptNet	56.7%	N/A
Char-LM - partial	45.0%	N/A
Char-LM - full	53.3%	N/A
Word-LM - partial	53.3%	56.4%
Word-LM - full	60.0%	53.8%
Ensemble of 5 Unsupervised LMs - full	70.0%	N/A
ENSEMBLE of 10 Unsupervised LMs - partial	N/A	61.5%
Word-LM - STORIES (single model)	N/A	62.6%
ENSEMBLE of 14 LMs - STORIES + others	N/A	63.7%

使用完整评分时，单模型语言模型在 PDP-60 上优于先前方法，Word-LM-full 的准确率为 60.0%。
未监督语言模型的集成超过此前最佳 PDP-60 结果（66.7%），达到 70.0% 的准确率。
在 WSC-273 上，Word-LM-full 达到 53.8% 的准确率，Word-LM-partial 达到 56.4%。
对 diverse corpora 训练的 10 个语言模型的集成在 WSC-273 上达到 61.5% 的准确率，扩展的集成（包括 STORIES）达到 63.7%。
部分评分在 WSC-273 和 PDP 数据集上始终优于完整评分，Score_full 的归一化在 PDP-122 上有帮助。
在 STORIES 语料上的训练获得强单模型表现（62.6%），并将基于 STORY 的模型加入集成后，最终 WSC-273 的准确率提升至 63.7%。
训练数据的多样性是有益的；在多样语料集上训练的集成优于在单一语料上训练的集成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。