QUICK REVIEW

[论文解读] Generating Text with Deep Reinforcement Learning

Hongyu Guo|arXiv (Cornell University)|Oct 30, 2015

Topic Modeling参考文献 30被引用 41

一句话总结

本文提出了一种新颖的深度强化学习方法，用于基于深度Q网络（DQN）的序列到序列文本生成，通过迭代方式优化解码输出。该方法利用编码器-解码器LSTM生成状态表示和动作候选，同时通过偏向此前难以处理的序列部分来引导探索，显著优于未见句子上的贪心束搜索LSTM，在分布外测试数据上BLEU分数实现了111%的相对提升。

ABSTRACT

We introduce a novel schema for sequence to sequence learning with a Deep Q-Network (DQN), which decodes the output sequence iteratively. The aim here is to enable the decoder to first tackle easier portions of the sequences, and then turn to cope with difficult parts. Specifically, in each iteration, an encoder-decoder Long Short-Term Memory (LSTM) network is employed to, from the input sequence, automatically create features to represent the internal states of and formulate a list of potential actions for the DQN. Take rephrasing a natural sentence as an example. This list can contain ranked potential words. Next, the DQN learns to make decision on which action (e.g., word) will be selected from the list to modify the current decoded sequence. The newly modified output sequence is subsequently used as the input to the DQN for the next decoding iteration. In each iteration, we also bias the reinforcement learning's attention to explore sequence portions which are previously difficult to be decoded. For evaluation, the proposed strategy was trained to decode ten thousands natural sentences. Our experiments indicate that, when compared to a left-to-right greedy beam search LSTM decoder, the proposed method performed competitively well when decoding sentences from the training set, but significantly outperformed the baseline when decoding unseen sentences, in terms of BLEU score obtained.

研究动机与目标

解决如文本重述和机器翻译等NLP任务中的可变长度序列生成挑战。
通过用强化学习驱动的迭代优化策略替代自左向右的贪心解码，提升对未见序列的泛化能力。
通过偏向此前难以解码的部分来引导探索，使智能体能够聚焦于序列中的困难区域。
探索使用DQN实现端到端文本生成的可行性，利用LSTM进行状态和动作表示。
在未见数据上评估模型性能，其中泛化能力对实际部署至关重要。

提出的方法

编码器-解码器LSTM网络在每个解码步骤处理输入序列，生成固定维数的上下文向量和DQN所需的一组潜在词候选。
DQN从候选列表中选择动作（词）以迭代方式修改当前解码序列，更新后的序列被反馈至DQN以进行下一次迭代。
DQN通过Q-learning学习最大化累积奖励，使用经验回放和目标网络以稳定训练过程。
在DQN的探索策略中引入注意力机制，优先关注此前难以解码的序列部分。
在训练和测试期间采用$ε$-贪心策略，以平衡探索与利用。
最终输出为最后一次迭代的解码序列，性能通过平滑BLEU分数进行评估。

实验结果

研究问题

RQ1深度Q网络（DQN）能否有效学习通过迭代方式优化文本序列，从而超越标准的自左向右解码？
RQ2当探索策略偏向此前难以解码的序列部分时，DQN的探索策略如何影响其在未见数据上的泛化能力？
RQ3与直接建模状态-动作空间相比，使用LSTM生成的状态和动作表示在DQN-based文本生成中能提升多少性能？
RQ4DQN-based解码策略在分布外测试句上的泛化能力是否优于贪心束搜索？
RQ5推理（测试）期间的探索对生成序列的最终BLEU分数有何影响？

主要发现

在训练集中的已见句子上，DQN解码器的平滑BLEU得分为0.494，略高于基线LSTM束搜索的0.425。
在未见句子上，DQN解码器显著优于基线，BLEU得分达到0.228，而基线仅为0.107，相对提升达111%。
DQN在训练期间的探索策略使其能更好地泛化到未见数据，因为它从探索过程中生成的更广泛分布的噪声和合成序列中学习。
在推理（测试）期间启用探索（使用$ε$-贪心策略）会降低性能，表明应在测试时禁用探索。
DQN训练在约6个周期内收敛，表明状态和动作表示函数有效且可训练。
该方法成功在单次迭代内将误解析的句子 'Click here to read more than the New York Times' 修正为 'Click here to read more from the New York Times'，展示了其输出优化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。