QUICK REVIEW

[论文解读] Continuous Deep Q-Learning with Model-based Acceleration

Shixiang Gu, Timothy Lillicrap|arXiv (Cornell University)|Mar 2, 2016

Reinforcement Learning in Robotics参考文献 39被引用 336

一句话总结

该论文推导了带规范化优势函数（NAF）的连续Q学习，以在连续动作空间实现高效的离策略学习，并通过使用局部拟合的线性动力学进行想象回放来加速样本效率。

ABSTRACT

Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.

研究动机与目标

降低深度强化学习在连续控制任务中的样本复杂度。
开发适用于连续动作的Q学习变体，避免双重 actor-critic 的复杂性。
研究在保留模型无关优势的前提下的基于模型的加速技术。
在仿真机器人控制基准上评估所提方法。

提出的方法

提出一种连续Q学习变体（NAF），将 Q(x,u) 分解为 V(x) + A(x,u)，其中 A 关于 (u - mu(x)) 为二次形式。
对 Q 函数进行参数化，使得最大化动作 mu(x) 可以解析地获得。
使用深度网络输出 V、mu，以及定义 A 的正定矩阵 P(x)，其中 A(x,u) = -1/2 (u - mu(x))^T P(x) (u - mu(x))。
使用标准的深度Q学习工具进行训练：经验回放、目标网络和Bellman备份。
引入想象回放：用来自学习的局部线性动力学模型的合成在策略回放来增强真实经验，以加速学习（类似Dyna）。
将动力学局部拟合为时变线性模型，并在采样状态周围使用短回放来生成额外的训练数据。

实验结果

研究问题

RQ1归一化优势函数（NAF）在连续动作空间中是否比如 DDPG 的 actor-critic 方法提供更高样本效率的Q学习？
RQ2使用局部拟合动力学的基于模型的想象回放能否在不影响最终性能的前提下显著加速无模型的Q学习？
RQ3使用真实动力学与学习得到的动力学对想象回放效果的影响？
RQ4离策略规划信号（如 iLQG 路径）与 on-policy 想象回放在加速学习方面的比较？
RQ5想象回放方法对不完美动力学模型的局限性和敏感性？

主要发现

Domains	DDPG reward	DDPG episodes	NAF reward	NAF episodes
Cartpole	-2.1	-0.601	420	-0.604	190
Reacher	-2.3	-0.509	1370	-0.331	1260
Peg	-11	-0.950	690	-0.438	130
Gripper	-29	1.03	2420	1.81	1920
GripperM	-90	-20.2	1350	-12.4	730
Canada2d	-12	-4.64	1040	-4.21	900
Cheetah	-0.3	8.23	1590	7.91	2390
Swimmer6	-325	-174	220	-172	190
Ant	-4.8	-2.54	2450	-2.58	1350
Walker2d	0.3	2.96	850	1.85	1530

在许多操作任务上，NAF 通常优于 DDPG，提供更快的收敛以及在目标状态的更高精度。
在移动任务上，NAF 和 DDPG 的性能更为可比，NAF 有时略好或略差，取决于领域。
对操纵任务如 reacher 和 gripper，使用迭代拟合的时变线性动力学的想象回放显著提高数据效率（2–5 倍）。
使用真实动力学进行想象回放会带来强劲收益，而学习的神经网络动力学可能抵消收益；局部拟合的线性模型更可取。
离策略 iLQG 探索相对于单独的想象回放提供的提升有限或不稳定；在策略想象回放始终有益。
想象回放在早期学习中最有益；随着 Q 函数变得更准确，收益可能减弱，支持混合的模型自由结尾。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。