QUICK REVIEW

[论文解读] Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

Zhang-Wei Hong, Tzu-Yun Shann|arXiv (Cornell University)|Feb 13, 2018

Reinforcement Learning in Robotics参考文献 26被引用 50

一句话总结

本文提出了一种以多样性为驱动的探索方法，在损失中加入基于距离的正则化项，促进策略多样性和更好的探索，并具自适应缩放，适用于离策略和在策略 DRL，并在网格世界、Atari 和 MuJoCo 上进行了测试。

ABSTRACT

Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive local optima, or sparse rewards. To tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima. We further propose an adaptive scaling method for stabilizing the learning process. Our experimental results in Atari 2600 show that our method outperforms baseline approaches in several tasks in terms of mean scores and exploration efficiency.

研究动机与目标

在 DRL 中激发鲁棒探索以克服误导性和稀疏奖励。
开发一种损失函数增广，鼓励与最近策略的发散。
使该方法兼容离策略和在策略算法。
提出自适应缩放策略，以平衡探索与利用。

提出的方法

定义损失 L_D = L - E_{pi' in Pi'}[ alpha D(pi, pi') ] 以促进策略多样性。
在当前策略与最近策略 Pi' 之间使用距离度量 D（KL 散度、L2 或 MSE）。
通过将距离项引入其损失函数，将该方法应用于 Div-DQN 和 Div-DDPG。
通过维护一组最近策略来计算距离项，应用 Div-A2C。
通过基于距离和基于性能的方法引入 alpha 的自适应缩放。
对距离度量 D 进行截断以稳定训练。

实验结果

研究问题

RQ1在具有大状态空间、误导性奖励或稀疏奖励的环境中，多样性驱动的探索是否能改善学习？
RQ2距离基损失项是否能在离策略和在策略 DRL 算法中有效集成？
RQ3自适应缩放策略如何影响学习稳定性和性能？
RQ4不同距离度量（KL、L2、MSE）对探索效率和最终性能的影响有多大？

主要发现

多样性驱动的探索在具有误导性或稀疏奖励的巨大网格世界中实现了更好的探索和策略性能。
Div-DQN 和 Div-A2C 在 Atari 2600 游戏和 MuJoCo 任务中实现了优于或等效于基线的性能，在许多情况下学习更快。
自适应缩放策略（基于距离和基于性能）提高了稳定性和最终性能，尤其是对在策略方法。
通过鼓励代理尝试与最近策略不同的策略来改进探索，增加对新状态的访问。
与标准探索方法相比，所提方法在若干基准中更有效地避免局部最优和误导性奖励。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。