[论文解读] AlgaeDICE: Policy Gradient from Arbitrary Experience
AlgaeDICE提出了一种离策略策略梯度方法,该方法使用密度正则化和对偶函数,从任意离策略数据中在不进行重要性采样的情况下恢复在策略梯度。
In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation over on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. We first derive this result by considering a regularized version of the dual max-return objective before extending our findings to unregularized objectives through the use of a Lagrangian formulation of the linear programming characterization of Q-values. We show that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-policy objective is exactly the on-policy policy gradient, without any use of importance weighting. In addition to revealing the appealing theoretical properties of this approach, we also show that it delivers good practical performance.
研究动机与目标
- Motivate learning from costly or limited environment interactions by enabling off-policy policy optimization.
- Reformulate max-return optimization as an off-policy problem using density regularization over state-action occupancies.
- Derive a saddle-point objective linking a policy (actor) with a dual function (critic) that can be optimized from arbitrary data.
- Show that optimizing the dual yields the on-policy policy gradient with a regularized reward.
- Provide theoretical guarantees and empirical validation of the approach.
提出的方法
- Start from the dual formulation of the max-return objective expressed in terms of normalized state-action occupancies.
- Introduce a regularizer using an f-divergence between the on-policy and off-policy occupancies to enable off-policy data usage.
- Apply a change of variables to obtain a purely off-policy objective J_{D,f}(π,ν) that is optimized over the policy and a dual function ν.
- Use a variational form of the f-divergence and a dual embedding to handle the double-sampling problem.
- Demonstrate that, when the dual ν is optimized, the gradient w.r.t. policy parameters matches the on-policy policy gradient with a modified reward tilde{r}(s,a) = r(s,a) - α f'(w_{π/ D}(s,a)).
- Discuss the Lagrangian/LP perspective that yields a single unified objective for policy and value learning and enables behavior-agnostic off-policy optimization.
实验结果
研究问题
- RQ1Can we express max-return optimization as an off-policy problem without importance weighting?
- RQ2Does dual optimization yield the on-policy policy gradient when learning from arbitrary off-policy data?
- RQ3How can f-divergence regularization and dual embeddings enable stable off-policy policy optimization?
- RQ4What are the theoretical guarantees and practical implications for policy learning using AlgaeDICE under off-policy data?
主要发现
- The off-policy objective with a dual function reproduces the on-policy policy gradient when the dual is optimized.
- The regularized dual formulation yields a unified objective that trains policy and critic from off-policy data without importance weights.
- Choosing a quadratic f leads to an actor-critic-like objective, but with a principled behavior-agnostic off-policy basis.
- The Lagrangian/LP view provides strong duality and allows recovering a Fenchel AlgaeDICE objective that matches the regularized max-return objective.
- Empirical results show AlgaeDICE can perform well on benchmark tasks, including offline Four Rooms and continuous control suites.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。