QUICK REVIEW

[论文解读] RUDDER: Return Decomposition for Delayed Rewards

Jose A. Arjona-Medina, Michael Gillhofer|arXiv (Cornell University)|Jun 20, 2018

Reinforcement Learning in Robotics参考文献 121被引用 59

一句话总结

RUDDER 引入奖励重新分配和收益分解以解决延迟奖励问题，将强化学习转化为回归任务，采用基于 LSTM 的收益分解，带来显著的加速和在 Atari 上的性能提升。

ABSTRACT

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards. Source code is available at \url{https://github.com/ml-jku/rudder} and demonstration videos at \url{https://goo.gl/EQerZV}.

研究动机与目标

在有限的马尔可夫决策过程（MDP）中处理带有延迟奖励的长期信用分配。
引入奖励重新分配，以创建返回等价的 SDP，使未来奖励的期望值为零。
开发收益分解，将强化学习转化为可高效学习的回归任务。
利用基于 LSTM 的收益分解来识别状态-动作对回报的贡献。
在合成任务和 Atari 游戏上证明相对于 TD、MC、MCTS 和奖励塑形的加速效果。

提出的方法

通过奖励重新分配定义返回等价的序列马尔可夫决策过程（SDP）。
目标是实现最优的重新分配，使未来奖励的期望为零，从而通过即时奖励的均值来估计 Q 值。
使用收益分解来识别状态-动作对序列回报的贡献。
采用基于 LSTM 的收益分解来训练预测整个序列回报的模型，并从预测差异中推导出重新分配的奖励。
基于阶段的学习：安全探索、经验回放缓冲区，然后进行基于 LSTM 的收益分解。
将重新分配的奖励整合到 Q 学习、策略梯度或基于 PPO 的框架中（例如带重新分配奖励的 PPO）。

实验结果

研究问题

RQ1奖励重新分配是否能够产生未来奖励期望为零的返回等价 SDP，同时保持最优策略？
RQ2通过贡献分析的收益分解是否能够通过对完整情节的回归实现对延迟奖励的有效学习？
RQ3在合成的延迟奖励任务和 Atari 游戏上，RUDDER 的性能与 TD、MC、MCTS 和奖励塑形相比如何？
RQ4在此设置中，使用 LSTM 进行收益分解相较于前馈模型有哪些实际优势？

主要发现

与 TD(λ)、MC、MCTS 和奖励塑形相比，RUDDER 在人工延迟奖励任务上实现了显著的加速。
理论上，最优的奖励重新分配产生零的未来奖励期望，将 Q 值估计简化为即时奖励的均值。
收益分解识别出最具贡献的状态-动作对，从而实现对奖励的高效重新分配。
在 Atari 游戏上，RUDDER 提升了以 PPO 为基础的基线，且在具有延迟奖励的游戏中获得最显著的提升。
实验表明，在具有延迟奖励的有限时序 MDPs 中，所提出的基于 LSTM 的方法带来显著的性能提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。