QUICK REVIEW

[论文解读] An Actor-Critic Algorithm for Sequence Prediction

Dzmitry Bahdanau, Philémon Brakel|arXiv (Cornell University)|Jul 24, 2016

Multimodal Machine Learning Applications参考文献 40被引用 224

一句话总结

本文提出了一种用于训练序列生成模型的演员-评论家框架，其中评论家预测标记值以提升测试时的指标（如 BLEU），在拼写纠错和机器翻译任务上优于 MLE 和 REINFORCE。

ABSTRACT

We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-truth tokens. We address this problem by introducing a extit{critic} network that is trained to predict the value of an output token, given the policy of an extit{actor} network. This results in a training procedure that is much closer to the test phase, and allows us to directly optimize for a task-specific score such as BLEU. Crucially, since we leverage these techniques in the supervised learning setting rather than the traditional RL setting, we condition the critic network on the ground-truth output. We show that our method leads to improved performance on both a synthetic task, and for German-English machine translation. Our analysis paves the way for such methods to be applied in natural language generation tasks, such as machine translation, caption generation, and dialogue modelling.

研究动机与目标

推动训练序列模型以优化特定任务分数而不仅仅是对数似然性。
通过在模型生成前缀条件化训练来解决训练-测试不一致性。
引入一个评论家网络以在当前策略下预测每个标记的价值。
在拼写纠错和机器翻译任务上展示相对于标准 MLE 与 REINFORCE 的改进。

提出的方法

将序列生成问题表述为带有演员（解码器）和评论家的随机策略。
为部分序列和候选动作（标记）定义值函数 V 与 Q。
用时序差分目标训练评论家，并通过目标网络和延迟的演员来稳定训练。
使用带有无偏估计的策略梯度，结合 Q 的估计值，以及可选的对数似然梯度项。
对奖励进行 shaping，提供中间反馈以减少稀疏奖励。
在联合演员-评论家训练之前，先对演员和评论家进行预训练以启动学习。

实验结果

研究问题

RQ1与 MLE 与 REINFORCE 相比，演员-评论家训练是否能提升任务特定的序列分数（如 BLEU）？
RQ2在训练中加入真实信息到评论家是否有助于训练而测试时不使用该信息？
RQ3在序列预测中，哪些训练技巧（目标网络、奖励 shaping、价值惩罚）对稳定性和性能是必需的？
RQ4与基线相比，该方法在合成拼写纠错和真实 MT 数据集（IWSLT、WMT）上的表现如何？

主要发现

演员-评论家训练在多种设置下较对数似然训练在拼写纠错方面有所提升。
在 IWSLT 2014 和 WMT14 MT 任务上，演员-评论家方法相对于基线实现 BLEU 的提升，在贪婪解码中尤为显著，仍与束搜索相比具竞争力。
使用目标网络和对评论家输出的方差惩罚对实现稳定学习和更好性能至关重要。
奖励 shaping 与延迟的演员进一步带来额外的性能提升。
在更强或可比的基线下，该方法与先前基于强化学习的方法如 MIXER 相比具有竞争力甚至更优的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。