QUICK REVIEW

[论文解读] DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Ofir Nachum, Yinlam Chow|arXiv (Cornell University)|Jun 10, 2019

Reinforcement Learning in Robotics参考文献 44被引用 42

一句话总结

DualDICE 引入一种对行为无关的估计方法，用于在不依赖逐步重要性权重的情况下估计离策略评估的贴现稳态分布修正，并具有理论保证及相较于先前方法的经验改进。

ABSTRACT

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

研究动机与目标

在环境访问受限于固定且可能包含多策略的数据集时，推动离策略评估。
使用贴现的稳态分布比率定义带偏差校正的值估计框架。
开发一种基于优化的方法，在不需要了解行为策略或重要性权重的情况下估计分布修正。
提供理论收敛性保证并在多项基准测试中展示相对于先前方法的经验改进。

提出的方法

将稳态分布修正 w_{ ext{π/𝒟}}(s,a) = d^{π}(s,a)/d^{𝒟}(s,a) 进行表述，并将其与 OPE 目标相关联。
引入一个凸目标 J(ν)，其最小值点给出等于所需修正的 Bellman 残差 (ν* − B^{π}ν*) = w_{ ext{π/𝒟}}。
应用 Fenchel 对偶将平方 Bellman 残差目标转化为涉及 ν 与 ζ 的鞍点问题，从而实现无偏随机梯度。
推导一个最小-最大优化问题 (ν, ζ)，其解通过 ζ*(s,a) = w_{ ext{π/𝒟}}(s,a) 提供稳态修正。
将框架扩展到一般的凸惩罚项 f，得到具有类似优势的一组鞍点目标。

实验结果

研究问题

RQ1当离策略数据集由未知或多种行为策略生成时，如何估计贴现稳态分布修正？
RQ2在不使用逐步重要性权重的情况下估计这些修正是否可行，同时保持收敛性保证和实际优化特性？
RQ3在函数近似条件下，与基于 TD 的和基于 IS 的基线相比，所提出的 DualDICE 目标是否在离策略评估中表现出准确性？
RQ4凸惩罚项 f 的选择如何影响优化稳定性和估计精度？

主要发现

DualDICE 提供了一种对行为无关的贴现稳态分布修正估计器，不依赖逐步重要性权重。
以 ν 参数化目标的最优 Bellman 残差等于所需的分布修正 w_{ ext{π/𝒟}}(s,a)。
一个 Fenchel 对偶化的最小-最大形式产生无偏梯度估计和稳定的优化过程。
该方法在控制任务中对 OPE 的表现与基于 TD 的方法具有竞争力甚至优越性，尤其是在函数近似和未知行为策略的情况下。
扩展到一般的凸惩罚项保留计算优势，并在平衡近似误差和优化误差方面提供灵活性。
经验结果表明在复杂环境中相比 TD 方法具有更好的稳定性和准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。