QUICK REVIEW

[论文解读] Learning Self-Imitating Diverse Policies

Tanmay Gangwani, Qiang Liu|arXiv (Cornell University)|May 25, 2018

Reinforcement Learning in Robotics参考文献 49被引用 26

一句话总结

本文提出了一种自模仿学习算法，通过最小化策略的状态动作访问分布与自身经验回放缓冲区中高回报轨迹之间的Jensen-Shannon散度，提升了稀疏奖励和回合制奖励环境下的深度强化学习性能。该方法从自生成的示范中生成密集奖励，实现高效的信用分配；此外，结合Stein变分策略梯度与JS核，学习多样化策略，显著优于基线方法，在具有稀疏奖励的复杂MuJoCo运动控制任务中表现优异。

ABSTRACT

The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.

研究动机与目标

解决深度强化学习在稀疏或回合制奖励环境下样本效率低下和信用分配困难的问题。
通过利用自生成的高回报轨迹作为隐式示范，改进稀疏奖励设置下的策略梯度方法。
克服单策略自模仿的局限性，通过促进策略间的多样性，提升探索能力并避免陷入局部最优。
开发一种可扩展的、基于群体的算法，结合自模仿与多样性正则化，适用于连续控制任务。

提出的方法

将策略优化表述为当前策略的状态动作访问分布与高回报经验回放缓冲区轨迹之间Jensen-Shannon散度的最小化问题。
将散度最小化问题转化为基于自生成专家轨迹导出的密集奖励的策略梯度更新。
引入一种自模仿机制，其中智能体模仿自身过往的高绩效轨迹，从而生成内在的密集监督信号。
采用Stein变分策略梯度（SVPG）结合Jensen-Shannon核，显式鼓励集合中多个策略之间的多样性。
基于策略访问分布之间的JS散度设计排斥项，以促进不同行为模式之间的探索。
在多智能体集合设置中应用该方法，每个智能体从群体的集体经验与多样性中进行学习。

实验结果

研究问题

RQ1使用自生成的高回报轨迹进行自模仿，能否提升稀疏奖励环境下深度强化学习的样本效率？
RQ2在密集奖励与稀疏奖励环境中，基于密集奖励的自模仿方法与标准策略梯度方法相比表现如何？
RQ3能否通过策略空间中的核基排斥项有效诱导策略间的多样性？
RQ4自模仿与多样性学习的结合是否能带来更快的收敛速度和在具有挑战性的探索任务中更好的性能？

主要发现

所提出的自模仿算法在密集奖励环境中表现与标准策略梯度方法相当，而在稀疏和回合制环境中显著优于后者。
在Maze环境中，SI-interact-JS使多个智能体能够探索并抵达高奖励的绿色区域，而SI-independent和PPO-independent智能体则未能发现目标。
在SparseHopper和SparseHalfCheetah任务中，SI-interact-JS比仅依赖动作空间噪声的SI-independent更快发现跳跃和前进运动行为，后者探索效率低下。
SI-interact-JS中使用JS核显著提升了策略多样性，表现为核矩阵中更轻的单元格，表明策略对之间的JS散度更高。
使用RBF核的SI-interact-RBF性能较差，表明JS核更适用于促进策略访问分布中具有意义的多样性。
PPO-independent智能体易陷入局部最优（如为避免能量惩罚而静止不动），而SI-interact-JS通过主动探索多样化行为避免了此类问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。