QUICK REVIEW

[论文解读] Softmax Deep Double Deterministic Policy Gradients

Ling Pan, Qingpeng Cai|arXiv (Cornell University)|Oct 19, 2020

Reinforcement Learning in Robotics参考文献 33被引用 45

一句话总结

论文通过对连续控制中的价值更新应用玻尔兹曼 softmax 算子，提出 Softmax Deep Deterministic Policy Gradients（SD2）和 Softmax Deep Double Deterministic Policy Gradients（SD3），从而降低估计偏差并在性能上优于 DDPG、TD3 和 SAC。

ABSTRACT

A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.

研究动机与目标

动机与解决在连续控制的 actor-critic 方法中出现的过估计和低估偏差。
在连续动作空间中对 Boltzmann softmax 操作符进行理论分析。
开发单估计器（SD2）和双估计器（SD3）变体以改进价值估计。
展示对优化景观的平滑化效果及经验性能提升。
与最先进方法进行比较并评估样本效率。

提出的方法

在连续动作空间中对 Q 值定义 softmax 操作符并推导误差界（定理1和定理2）。
将 softmax 纳入单评估者 DDPG 框架以通过重要性采样实现无偏 Q 值估计（式3）。
证明 SD2 能平滑优化景观并降低过估计（定理3）。
扩展到双评估者的 TD3 启发框架以创建 SD3，通过对一个最小组合的 Q 函数取 softmax 来处理低估偏差（式5）。
通过围绕目标策略对采样动作并进行裁剪以控制方差，提供算法细节和实际实现（附录 C）。
在 MuJoCo/OpenAI Gym 任务上经验比较 SD2/SD3 与 DDPG、TD3 和 SAC，包括消融研究（第5节）。

实验结果

研究问题

RQ1在连续动作空间中，softmax 操作符是否将误差相对于最优价值函数进行界定？
RQ2基于 softmax 的更新是否能降低单评估者方法（SD2）中的过估计偏差？
RQ3相较于 TD3，基于 softmax 的更新是否能改善双评估者方法（SD3）中的低估偏差？
RQ4在标准连续控制基准上，SD2/SD3 是否比最先进的基线（TD3 和 SAC）具有更好的样本效率和最终性能？
RQ5softmax 操作符对 actor-critic 学习的优化景观有何影响？

主要发现

SD3 在标准连续控制任务上优于 TD3 和 SAC，具有更高的最终性能和更稳定性。
SD2 降低过估计偏差并提升相对于 DDPG 的样本效率。
softmax 算子平滑了 actor 的优化景观，促进学习。
SD3 通过对双估计器设置使用 softmax 来缓解 TD3 中存在的低估偏差。
理论结果（定理1–4）给出 softmax 误差的界限，并比较 SD2/SD3 与基线方法之间的偏差。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。