QUICK REVIEW

[论文解读] Exploring Model-based Planning with Policy Networks

Tingwu Wang, Jimmy Ba|arXiv (Cornell University)|Jun 20, 2019

Reinforcement Learning in Robotics参考文献 39被引用 76

一句话总结

POPLIN 引入了基于模型的策略规划，通过在参数空间进行规划并以策略网络初始化，使用神经网络优化动作序列或策略参数，在 MuJoCo 任务上实现了最先进的样本效率。

ABSTRACT

Model-based reinforcement learning (MBRL) with model-predictive control or online planning has shown great potential for locomotion control tasks in terms of both sample efficiency and asymptotic performance. Despite their initial successes, the existing planning methods search from candidate sequences randomly generated in the action space, which is inefficient in complex high-dimensional environments. In this paper, we propose a novel MBRL algorithm, model-based policy planning (POPLIN), that combines policy networks with online planning. More specifically, we formulate action planning at each time-step as an optimization problem using neural networks. We experiment with both optimization w.r.t. the action sequences initialized from the policy network, and also online optimization directly w.r.t. the parameters of the policy network. We show that POPLIN obtains state-of-the-art performance in the MuJoCo benchmarking environments, being about 3x more sample efficient than the state-of-the-art algorithms, such as PETS, TD3 and SAC. To explain the effectiveness of our algorithm, we show that the optimization surface in parameter space is smoother than in action space. Further more, we found the distilled policy network can be effectively applied without the expansive model predictive control during test time for some environments such as Cheetah. Code is released in https://github.com/WilsonWangTHU/POPLIN.

研究动机与目标

激发在高维步态任务中提升基于模型的强化学习的样本效率。
提出一个规划框架，使用策略网络为在线规划生成良好的建议。
展示在策略参数空间进行规划能够产生更平滑的优化表面和更高的搜索效率。
在 MuJoCo 基准测试上展示最先进的性能，并实现显著的样本效率提升。

提出的方法

将每个时间步的规划定义为对动作序列或策略参数的优化。
POPLIN-A：使用策略网络在动作空间提出动作序列，并在动作空间通过交叉熵法（CEM）对动作序列进行细化。
POPLIN-P：通过扰动网络参数并评估得到的策略，在策略参数空间进行规划。
两条蒸馏路径：策略蒸馏（BC、GAN）以及基于 AVG 的更新以累积规划经验。
比较 MPC 就绪控制（先规划再执行第一个动作）与直接策略控制（执行策略输出）。
提供对优化表面平滑性以及参数空间规划优势的实证分析。

实验结果

研究问题

RQ1将策略网络与在线规划结合是否能在样本效率上超过传统的随机采样 MPC 方法，如 PETS？
RQ2在策略参数空间进行规划（相对于动作空间噪声）是否更容易，因为优化景观更平滑？
RQ3各种策略蒸馏策略如何影响终端任务表现和实时控制的可行性？

主要发现

POPLIN 在 MuJoCo 基准测试上达到最先进的性能，样本效率大约比 PETS、TD3 和 SAC 高出约 3 倍。
在参数空间进行规划（POPLIN-P）比在动作空间规划产生更平滑的优化表面，从而实现更有效的搜索。
在某些环境中，蒸馏后的策略网络在测试时可以不需要大规模在线规划就表现良好（例如 Cheetah）。
POPLIN-A 在简单任务（摆动摆、倒立摆、游鱼）表现出色，但在更复杂的任务（蚂蚁、猎豹、跳跃者）上不如 POPLIN-P 有利。
POPLIN-P 变体（Uni、Sep、Avg、GAN、BC）在不同环境中显示出不同的优势，其中 POPLIN-P-Sep 在规划效率方面常常优于 POPLIN-P-Uni。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。