QUICK REVIEW

[论文解读] V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

Hao Song, Abbas Abdolmaleki|arXiv (Cornell University)|Sep 26, 2019

Reinforcement Learning in Robotics参考文献 33被引用 39

一句话总结

V-MPO 是一种对策略梯度方法的在策略适应的 MPO，使用学习的状态值函数来执行策略迭代，在离散和连续控制中在没有熵正则化或基于群体的调优的情况下取得了很强的结果。

ABSTRACT

Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported.

研究动机与目标

动机：降低在策略强化学习中与策略梯度方法相关的方差和不稳定性。
开发基于在策略 MPO 的算法，利用学习得到的状态值函数进行策略迭代。
展示在离散和连续控制基准上的强性能，且无需额外的正则化或基于群体的调参。

提出的方法

提出 V-MPO，一种对 Maximum a Posteriori Policy Optimization 的在策略自适应。
使用由学习到的状态值函数引导的策略迭代。
在保持稳定学习的同时，避免熵正则化与重要性加权。
展示该方法在离散和连续动作空间中均适用，包括高维任务。

实验结果

研究问题

RQ1相对于先前的在策略方法，V-MPO 在离散与连续控制基准上的表现如何？
RQ2V-MPO 是否能在不使用熵正则化、重要性加权或基于群体的超参数调优的情况下取得强表现？
RQ3V-MPO 在高维动作空间和像素基观测下的扩展性如何？
RQ4在 Atari-57、DMLab-30，以及 OpenAI Gym 任务中的多任务与单任务设置下，经验收益是什么？

主要发现

V-MPO 在多任务设置中超越了在 Atari-57 和 DMLab-30 上的先前报告分数。
该方法在不使用重要性加权、熵正则化或基于群体的超参数调优的情况下实现了这些结果。
在单独的 DMLab 和 Atari 关卡上，分数显著高于此前报道。
V-MPO 适用于高维连续动作空间，已在从完整状态观测和像素观测的人形体任务中得到证明。
OpenAI Gym 任务的渐近分数显著高于此前报告。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。