QUICK REVIEW

[论文解读] Mirror Descent Policy Optimization

Manan Tomar, Lior Shani|arXiv (Cornell University)|May 20, 2020

Reinforcement Learning in Robotics参考文献 37被引用 24

一句话总结

本文提出镜像下降策略优化（MDPO），一种基于镜像下降原理推导出的统一强化学习算法，通过多步梯度更新近似信任区域策略更新。MDPO在连续控制任务中实现了与TRPO、PPO和SAC相当或更优的性能，表明显式信任区域约束并非实现高性能的必要条件。

ABSTRACT

Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable gap between such theoretically analyzed algorithms and the ones used in practice. Inspired by this, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by {\em approximately} solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact {\em not} a necessity for high performance gains in TRPO. We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO. Overall, MDPO is derived from the MD principles, offers a unified approach to viewing a number of popular RL algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a number of continuous control tasks. Code is available at \url{https://github.com/manantomar/Mirror-Descent-Policy-Optimization}.

研究动机与目标

弥合理论基础扎实的信任区域强化学习算法与TRPO和PPO等实用深度强化学习方法之间的差距。
基于镜像下降（MD）原理，开发一种可扩展、实用的连续控制策略优化强化学习算法。
通过推导一种通过梯度步骤求解无约束问题的方法，证明显式信任区域约束并非高性能的必要条件。
在单一基于MD的框架下统一现有算法——TRPO、PPO和SAC，揭示其内在联系。
在MuJoCo基准环境上，通过实证验证MDPO在性能上优于或与最先进算法相当。

提出的方法

MDPO将每次策略更新表述为一个信任区域子问题，包含线性化的强化学习目标和基于Bregman散度（如KL或Tsallis散度）的接近性项。
不精确求解信任区域问题，而是通过在目标函数上执行多步梯度更新来近似解。
在线策略MDPO使用旧策略作为接近性项的参考，通过散度选择和更新机制的选择与TRPO和PPO关联。
离线策略MDPO使用均匀策略作为参考，通过修改散度和更新规则可直接推导出SAC。
该方法支持KL和Tsallis散度，后者引入一个可调超参数 $ q \in [1.0, 2.0] $，可提升性能。
该方法实现为在线和离线两种变体，代码已公开以支持可复现性和对比。

实验结果

研究问题

RQ1能否利用镜像下降原理推导出一种实用且可扩展的强化学习算法，统一TRPO、PPO和SAC？
RQ2是否可能在不显式强制执行信任区域约束（如TRPO中那样）的情况下实现深度强化学习的高性能？
RQ3MDPO中的设计选择——如使用多步梯度更新和散度选择——与最先进算法相比如何影响性能？
RQ4使用Tsallis熵的离线策略MDPO能否超越SAC，$ q $ 超参数在其中起什么作用？
RQ5当使用原生和优化实现时，TRPO、PPO和SAC之间的性能差异由何解释？

主要发现

在MuJoCo基准套件的多个连续控制任务中，在线策略MDPO在性能上优于或匹配TRPO、PPO和SAC。
无论使用原生配置还是优化配置，TRPO始终优于PPO，挑战了PPO更优的普遍认知。
MDPO无需显式信任区域约束即可实现强性能，因其依赖于对信任区域目标的基于梯度的近似。
使用Tsallis熵（$ q \in [1.0, 2.0] $）的离线策略MDPO在所有任务中均优于SAC，且最佳$ q $值因环境而异。
MDPO的离线变体在样本效率和最终性能上均优于其在线变体，与离线学习的一般优势一致。
通过修改散度和参考策略，SAC可被视作离线策略MDPO的一个特例，为SAC提供了新的优化视角。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。