QUICK REVIEW

[论文解读] Deep Reinforcement Learning for Dialogue Generation

Jiwei Li, Will Monroe|arXiv (Cornell University)|Jun 5, 2016

Topic Modeling参考文献 45被引用 421

一句话总结

本论文通过模拟两个虚拟代理，将深度强化学习与Seq2Seq对话模型结合，优化长期奖励以提升信息性、连贯性和回答难易度，从而实现更具互动性和持续性的对话。

ABSTRACT

Recent neural models of dialogue generation offer great promise for generating responses for conversational agents, but tend to be shortsighted, predicting utterances one at a time while ignoring their influence on future outcomes. Modeling the future direction of a dialogue is crucial to generating coherent, interesting dialogues, a need which led traditional NLP models of dialogue to draw on reinforcement learning. In this paper, we show how to integrate these goals, applying deep reinforcement learning to model future reward in chatbot dialogue. The model simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that display three useful conversational properties: informativity (non-repetitive turns), coherence, and ease of answering (related to forward-looking function). We evaluate our model on diversity, length as well as with human judges, showing that the proposed algorithm generates more interactive responses and manages to foster a more sustained conversation in dialogue simulation. This work marks a first step towards learning a neural conversational model based on the long-term success of dialogues.

研究动机与目标

推动超越一轮MLE训练的Seq2Seq对话模型，迈向长期对话成功的必要性。
提出一个使用策略梯度在模拟对话中最大化未来奖励的神经强化学习生成框架。
定义捕捉前瞻性、信息性与连贯性对话属性的奖励成分。
利用两代理对话仿真学习能够产生更具参与性和持续性的对话的策略。

提出的方法

在一个无限行动空间上的编码器-解码器策略中将话语表示为动作。
在两个虚拟代理之间模拟对话以探索状态-行动空间并学习一个策略 p_RL(p_{i+1}|p_i,q_i)。
定义奖励 r(a,[p_i,q_i])，结合三个项：回答难易度 (r1)、信息流 (r2) 与语义连贯性 (r3)。
通过策略梯度训练，采用先以类似MLE的令牌开始、逐渐过渡到RL更新的课程学习策略。
用互信息目标初始化RL策略，然后使用带基线的策略梯度进行优化以降低方差。
通过监督数据进行预训练并再通过对话仿真进行精调，采用AlphaGo式初始化。

实验结果

研究问题

RQ1带有长期奖励的深度强化学习是否能在开放领域对话生成中超越标准Seq2Seq训练？
RQ2前瞻性、信息性与连贯性的奖励成分是否能带来更长时间、互动性更强的对话？
RQ3两代理对话仿真框架是否比传统方法产生更丰富多样且更持续的回应？
RQ4以互信息初始化并结合课程学习对RL性能有何影响？
RQ5自动评估与人工评估如何反映长期对话质量的提升？

主要发现

RL模型产生的模拟对话长度比Seq2Seq和互信息基线更长。
RL生成的回应更具互动性，且更倾向以提问收尾，促进轮流交流。
在人类评判下，RL在多轮对话质量方面有所提升，且比基线更容易回答。
在RL框架下，生成的回复多样性高于标准Seq2Seq和互信息模型。
互信息初始化结合RL在对话持续性方面达到最佳性能。
BLEU和困惑度与长期对话成功不相关；RL展现的收益未被这些指标捕捉。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。