QUICK REVIEW

[论文解读] GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

Shangtong Zhang, Bo Liu|arXiv (Cornell University)|Jan 29, 2020

Reinforcement Learning in Robotics被引用 35

一句话总结

GradientDICE 提供了一种用于稳态值的离策略密度比估计器，其通过用 Perron-Frobenius 基于的形式替代发散性目标，从而实现对线性函数逼近的收敛，并在经验上优于 GenDICE 和 DualDICE。

ABSTRACT

We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE's original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence. Consequently, nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.

研究动机与目标

通过学习目标分布和行为分布之间的密度比，解决离策略评估中的分布不匹配。
修正 GenDICE 中由非线性和基于发散性的目标所导致的理论/收敛问题。
提出一个在线性函数逼近下可证明收敛的新目标和算法。
提供实证证据，显示在基准任务中相对于 GenDICE 和 DualDICE 的优势。

提出的方法

将 GenDICE 的基于发散性的目标替换为一个二次型 L(τ) = 1/2 ||(Tτ) − Dτ||^2_{D^{-1}} + (λ/2)(d_μ^⊤ τ − 1)^2.
利用 Perron-Frobenius 定理，避免对正性约束的非线性参数化的需求。
证明在使用线性函数逼近时，优化对 τ 是凸的，对最大化变量是凹的，从而实现可证明的收敛。
推导 GradientDICE 的更新：最大化变量的 κ、η，以及线性结构 τ_w = Xw 的 w，更新式为 (21)-(24)。
给出收敛性分析，在使用线性函数逼近和脊回归正则化时，几乎必然收敛到正确的 τ。
通过投影变体（Projected GradientDICE）和平均迭代，讨论有限样本保证。

实验结果

研究问题

RQ1在线性函数逼近下，GradientDICE 是否可以可证明地收敛到真实密度比 τ*？
RQ2去除发散性和非线性参数化是否能解决 GenDICE 在离策略/离线设置下观察到的不稳定性与发散问题？
RQ3在固定值估计的基准任务中，GradientDICE 相对于 GenDICE 和 DualDICE 在表格、线性/神经网络等架构上的表现如何？
RQ4GradientDICE 的有限样本保证有哪些，投影对性能与一致性有何影响？

主要发现

GradientDICE 在线性函数逼近下可证明收敛到真实密度比 τ*。
消除发散项和正性约束，避免非线性参数化的需要，从而避免目标函数中的非凸性问题。
使用脊回归正则化时，对于 γ = 1 的情形，收敛性得到保证，方法在所提出的路径下达到一致估计。
提供了投影变体的有限样本分析，给出平均迭代的概率误差界限。
实证结果显示 GradientDICE 在表格和线性设置下的密度比学习任务中优于 GenDICE 和 DualDICE，并已公开代码以便复现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。