QUICK REVIEW

[论文解读] CoinDICE: Off-Policy Confidence Interval Estimation

Bo Dai, Ofir Nachum|arXiv (Cornell University)|Jan 1, 2020

Reinforcement Learning in Robotics被引用 11

一句话总结

CoinDICE 提出了一种新颖且高效的强化学习离策略置信区间估计算法，利用广义估计方程和经验似然，在渐近和有限样本两种情形下均能生成有效的置信区间。该方法在多个基准测试中均实现了比现有方法更紧致、更精确的置信区间。

ABSTRACT

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the $Q$-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.

研究动机与目标

开发一种离策略评估方法，能够在不依赖行为策略知识的情况下生成有效的置信区间。
确保置信区间在渐近和有限样本情形下的有效性。
与现有离策略评估方法相比，提升置信区间估计的紧致性和准确性。
仅使用由未知行为策略收集的静态数据集，实现与行为策略无关的估计。

提出的方法

该方法从Q函数线性规划公式的函数空间嵌入开始。
基于Q函数约束，构建包含广义估计方程（GEE）约束的优化问题。
将广义经验似然方法应用于该约束优化问题的拉格朗日函数。
由此产生的算法 CoinDICE 通过求解该变换后的优化问题来计算置信区间。
该方法通过渐近和有限样本设置下的理论保证，确保置信区间的有效性。

实验结果

研究问题

RQ1我们能否构建在未知行为策略下依然有效的离策略评估置信区间？
RQ2如何确保离策略评估中置信区间的有限样本有效性？
RQ3我们能否在紧致性和准确性方面超越现有方法，改进置信区间的估计？
RQ4在此背景下，使用广义估计方程和经验似然会产生何种影响？

主要发现

CoinDICE 生成的置信区间在渐近和有限样本两种情形下均有效。
CoinDICE 生成的置信区间比现有离策略评估方法更紧致。
实证结果表明，CoinDICE 在多个基准测试中均实现了更高的置信区间估计精度。
该方法与行为策略无关，无需了解生成数据集的策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。