Skip to main content
QUICK REVIEW

[论文解读] AlgaeDICE: Policy Gradient from Arbitrary Experience

Ofir Nachum, Bo Dai|arXiv (Cornell University)|Dec 4, 2019
Reinforcement Learning in Robotics参考文献 52被引用 82
一句话总结

AlgaeDICE提出了一种离策略策略梯度方法,该方法使用密度正则化和对偶函数,从任意离策略数据中在不进行重要性采样的情况下恢复在策略梯度。

ABSTRACT

In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation over on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. We first derive this result by considering a regularized version of the dual max-return objective before extending our findings to unregularized objectives through the use of a Lagrangian formulation of the linear programming characterization of Q-values. We show that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-policy objective is exactly the on-policy policy gradient, without any use of importance weighting. In addition to revealing the appealing theoretical properties of this approach, we also show that it delivers good practical performance.

研究动机与目标

  • Motivate learning from costly or limited environment interactions by enabling off-policy policy optimization.
  • Reformulate max-return optimization as an off-policy problem using density regularization over state-action occupancies.
  • Derive a saddle-point objective linking a policy (actor) with a dual function (critic) that can be optimized from arbitrary data.
  • Show that optimizing the dual yields the on-policy policy gradient with a regularized reward.
  • Provide theoretical guarantees and empirical validation of the approach.

提出的方法

  • Start from the dual formulation of the max-return objective expressed in terms of normalized state-action occupancies.
  • Introduce a regularizer using an f-divergence between the on-policy and off-policy occupancies to enable off-policy data usage.
  • Apply a change of variables to obtain a purely off-policy objective J_{D,f}(π,ν) that is optimized over the policy and a dual function ν.
  • Use a variational form of the f-divergence and a dual embedding to handle the double-sampling problem.
  • Demonstrate that, when the dual ν is optimized, the gradient w.r.t. policy parameters matches the on-policy policy gradient with a modified reward tilde{r}(s,a) = r(s,a) - α f'(w_{π/ D}(s,a)).
  • Discuss the Lagrangian/LP perspective that yields a single unified objective for policy and value learning and enables behavior-agnostic off-policy optimization.

实验结果

研究问题

  • RQ1Can we express max-return optimization as an off-policy problem without importance weighting?
  • RQ2Does dual optimization yield the on-policy policy gradient when learning from arbitrary off-policy data?
  • RQ3How can f-divergence regularization and dual embeddings enable stable off-policy policy optimization?
  • RQ4What are the theoretical guarantees and practical implications for policy learning using AlgaeDICE under off-policy data?

主要发现

  • The off-policy objective with a dual function reproduces the on-policy policy gradient when the dual is optimized.
  • The regularized dual formulation yields a unified objective that trains policy and critic from off-policy data without importance weights.
  • Choosing a quadratic f leads to an actor-critic-like objective, but with a principled behavior-agnostic off-policy basis.
  • The Lagrangian/LP view provides strong duality and allows recovering a Fenchel AlgaeDICE objective that matches the regularized max-return objective.
  • Empirical results show AlgaeDICE can perform well on benchmark tasks, including offline Four Rooms and continuous control suites.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。