QUICK REVIEW

[论文解读] Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

Cathy Wu, Aravind Rajeswaran|arXiv (Cornell University)|Mar 20, 2018

Reinforcement Learning in Robotics参考文献 15被引用 72

一句话总结

引入无偏行动依赖基线的策略梯度，针对因子化策略以降低方差；展示了理论与经验上的方差减小，可扩展至高维动作并适用于部分可观测环境（POMDP）和多智能体场景。

ABSTRACT

Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline through both theoretical analysis as well as numerical results, including an analysis of the suboptimality of the optimal state-dependent baseline. The result is a computationally efficient policy gradient algorithm, which scales to high-dimensional control problems, as demonstrated by a synthetic 2000-dimensional target matching task. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks. Finally, we show that the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.

研究动机与目标

动机并解决策略梯度估计中的高方差问题，特别是在长时程或高维动作情形。
开发一个无偏的、依赖行动的基线，利用策略因子分解来改善方差减小。
提供理论分析，展示最优的行动依赖基线以及状态仅基线的次优性。
提出可行的基线和可扩展到高维控制任务的算法。
展示在部分观测和多智能体情境中的适用性。

提出的方法

推导在给定状态时行动条件独立的因子化策略分布下的无偏行动依赖基线。
展示如何为每个行动因子 i 计算基线 b_i(s_t, a_t^{-i}) 以在不偏倚的前提下降低方差。
在条件独立假设下推导最优的行动依赖基线 b_i^*(s_t, a_t^{-i})。
将行动依赖基线与状态仅基线进行比较，并分析状态依赖基线的次优性。
提出可行的基线（边缘化的 Q、蒙特卡洛估计、平均行动基线），并整合到策略梯度更新中。
给出一个适用于完全因子化策略的算法，并讨论对一般策略和多智能体/部分可观测环境的扩展。

实验结果

研究问题

RQ1行动依赖的、因子化基线是否能够持续超越状态仅基线，在策略梯度估计中降低方差？
RQ2在 action factors 条件独立下，最优行动依赖基线的形式和收益为何？
RQ3实际基线（边缘化的 Q、平均行动、蒙特卡洛估计）在高维动作空间中的表现如何？
RQ4行动依赖基线是否可以扩展到部分可观测和多智能体的强化学习设置？
RQ5方差约减与传统基线在标准基准测试和高维任务中的比较如何？

主要发现

行动依赖基线在连续控制任务和高维设定中，一直优于状态仅基线，提升策略梯度性能。
最优行动依赖基线 b_i^*(s_t, a_t^{-i}) 在每个行动坐标上不同，能够实现无偏的方差约减，而不退化为状态仅基线。
行动依赖基线的方差约减随动作维度增加而增强，在一个高维目标匹配的合成任务中得到证明。
可行的基线（边缘化 Q、平均行动）提供可扩展的方差约减，计算开销适中。
扩展到部分观测和多智能体任务表明，结合额外基线信息可加速学习。
经验结果表明，在高维手部操作和多智能体通信任务中学习更快、训练速度更高。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。