QUICK REVIEW

[论文解读] Adversarial Learning for Neural Dialogue Generation

Jiwei Li, Will Monroe|arXiv (Cornell University)|Jan 23, 2017

Topic Modeling参考文献 45被引用 221

一句话总结

该论文通过对抗强化学习在判别器上训练对话生成模型，以产生人类般的开放域回复，并提出对抗评估作为一个度量。它在多个指标上相对于标准的 Seq2Seq 基线显示出改进。

ABSTRACT

In this paper, drawing intuition from the Turing test, we propose using adversarial training for open-domain dialogue generation: the system is trained to produce sequences that are indistinguishable from human-generated dialogue utterances. We cast the task as a reinforcement learning (RL) problem where we jointly train two systems, a generative model to produce response sequences, and a discriminator---analagous to the human evaluator in the Turing test--- to distinguish between the human-generated dialogues and the machine-generated ones. The outputs from the discriminator are then used as rewards for the generative model, pushing the system to generate dialogues that mostly resemble human dialogues. In addition to adversarial training we describe a model for adversarial {\em evaluation} that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls. Experimental results on several metrics, including adversarial evaluation, demonstrate that the adversarially-trained system generates higher-quality responses than previous baselines.

研究动机与目标

推动开放域对话生成超越最大似然训练，因为该方法容易产生乏味和重复的回复。
提出一个对抗训练框架，在判别器奖励下，生成器学会生成人类难以区分的对话。
开发并分析在每次生成步骤提供奖励的策略，以及对抗性训练对话系统的可靠评估方法。
探讨对抗性训练是否提升互动质量，以及如何对这类模型进行稳健评估。

提出的方法

将对话生成形式化为一个具有生成器 G 和判别器 D 的强化学习问题。
使用分层编码器表示对话历史，并采用类似 Seq2Seq 的生成器来产生回复。
使用策略梯度（REINFORCE）进行训练，将判别器得分 Q+({x,y}) 作为生成 utterances 的奖励。
引入每一步奖励（REGS）通过蒙特卡罗搜索或为部分序列设计的判别器来分配中间奖励。
加入教师强制和替代奖励策略以稳定训练，包括将对抗性更新与最大似然更新混合。
对生成器在标准 Seq2Seq 目标上进行预训练，并对判别器在真实数据与生成数据之间的任务上进行预训练。

实验结果

研究问题

RQ1对抗性强化学习是否比标准 Seq2Seq 训练产生更高质量的开放域对话回复？
RQ2如何使用自动评估器和对抗性指标来可靠地评估对抗性训练的对话系统？
RQ3哪些奖励结构（每次生成一步与全序列）和训练稳健策略最能提升对话质量？
RQ4在单轮与多轮评估中，对抗性训练相对于强基线（MLE、带 MI 重排序的束搜索）表现如何？

主要发现

在基于对抗性训练的模型下，按评估结果对话回复质量高于标准 Seq2Seq 基线。
通过对抗性评估中的对手成功率（AdverSuc）显示所提出的模型在骗过评估者方面优于基线，其中 REGS 在所提出的方法中表现最佳。
人工评估表明，在对抗性框架下，单轮和多轮对话的质量均显著提升。
基于蒙特卡罗中间奖励的 REGS 在 AdverSup 实验中比 vanilla REINFORCE 表现更好。
基于采样的解码能提升 AdverSuc，但可能降低机器对随机的判别性，凸显评估中的注意事项。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。