QUICK REVIEW

[论文解读] AlgaeDICE: Policy Gradient from Arbitrary Experience

Ofir Nachum, Bo Dai|arXiv (Cornell University)|Dec 4, 2019

Reinforcement Learning in Robotics参考文献 52被引用 82

一句话总结

AlgaeDICE提出了一种离策略策略梯度方法，该方法使用密度正则化和对偶函数，从任意离策略数据中在不进行重要性采样的情况下恢复在策略梯度。

ABSTRACT

In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation over on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. We first derive this result by considering a regularized version of the dual max-return objective before extending our findings to unregularized objectives through the use of a Lagrangian formulation of the linear programming characterization of Q-values. We show that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-policy objective is exactly the on-policy policy gradient, without any use of importance weighting. In addition to revealing the appealing theoretical properties of this approach, we also show that it delivers good practical performance.

研究动机与目标

Motivate learning from costly or limited environment interactions by enabling off-policy policy optimization.
Reformulate max-return optimization as an off-policy problem using density regularization over state-action occupancies.
Derive a saddle-point objective linking a policy (actor) with a dual function (critic) that can be optimized from arbitrary data.
Show that optimizing the dual yields the on-policy policy gradient with a regularized reward.
Provide theoretical guarantees and empirical validation of the approach.

提出的方法

Start from the dual formulation of the max-return objective expressed in terms of normalized state-action occupancies.
Introduce a regularizer using an f-divergence between the on-policy and off-policy occupancies to enable off-policy data usage.
Apply a change of variables to obtain a purely off-policy objective J_{D,f}(π,ν) that is optimized over the policy and a dual function ν.
Use a variational form of the f-divergence and a dual embedding to handle the double-sampling problem.
Demonstrate that, when the dual ν is optimized, the gradient w.r.t. policy parameters matches the on-policy policy gradient with a modified reward tilde{r}(s,a) = r(s,a) - α f'(w_{π/ D}(s,a)).
Discuss the Lagrangian/LP perspective that yields a single unified objective for policy and value learning and enables behavior-agnostic off-policy optimization.

实验结果

研究问题

RQ1Can we express max-return optimization as an off-policy problem without importance weighting?
RQ2Does dual optimization yield the on-policy policy gradient when learning from arbitrary off-policy data?
RQ3How can f-divergence regularization and dual embeddings enable stable off-policy policy optimization?
RQ4What are the theoretical guarantees and practical implications for policy learning using AlgaeDICE under off-policy data?

主要发现

The off-policy objective with a dual function reproduces the on-policy policy gradient when the dual is optimized.
The regularized dual formulation yields a unified objective that trains policy and critic from off-policy data without importance weights.
Choosing a quadratic f leads to an actor-critic-like objective, but with a principled behavior-agnostic off-policy basis.
The Lagrangian/LP view provides strong duality and allows recovering a Fenchel AlgaeDICE objective that matches the regularized max-return objective.
Empirical results show AlgaeDICE can perform well on benchmark tasks, including offline Four Rooms and continuous control suites.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。