QUICK REVIEW

[論文レビュー] Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu, Lihong Li|arXiv (Cornell University)|Oct 29, 2018

Age of Information Optimization被引用数 112

ひとこと要約

本論文は、stationary-state density-ratio に基づく off-policy estimator を提案し、state-visitation distributions に対して importance sampling を適用することで、trajectory-based IS methods に比べて infinite-horizon settings における variance を低減する。RKHS closed-form solutions を備えた mini-max density-ratio estimation framework を提供し、long-horizon tasks での実証的検証をサポートする。

ABSTRACT

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

研究の動機と目的

Motivate and address the high-variance problem of off-policy estimation in long- and infinite-horizon MDPs (the curse of horizon).
Introduce an estimator that applies importance sampling to stationary state visitation distributions rather than whole trajectories.
Develop a mini-max density-ratio estimation framework to compute the stationary ratio between target and behavior policies, with RKHS closed-form results.
Theoretically analyze the proposed loss and its connection to Bellman equations; empirically demonstrate effectiveness on long-horizon tasks.

提案手法

Formulate off-policy evaluation via a density ratio w_pi/pi0(s)=d_pi(s)/d_pi0(s) between stationary visitation distributions.
Derive an importance-sampling estimator R_pi = E_{(s,a)~d_pi0}[ w_pi/pi0(s) beta_pi/pi0(a|s) r(s,a) ].
Propose a mini-max objective to learn w_pi/pi0 by maximizing a discriminator-based loss L(w,f) over a function class F, with normalization to avoid trivial solutions.
Provide a closed-form representation of the max over discriminators when F is the unit ball of an RKHS, enabling practical estimation.
Extend to discounted rewards (gamma<1) and average reward (gamma=1) cases with corresponding equations and normalization.
Propose a theoretical analysis linking L(w,f) to Bellman operators and establish bounds showing how good F leads to bounded estimation error for w_pi/pi0 and R_pi.
Demonstrate empirically that the stationary-density-ratio method achieves lower variance and better performance than trajectory-wise and step-wise IS/WIS on long-horizon tasks.

実験結果

リサーチクエスチョン

RQ1Can off-policy evaluation for infinite-horizon MDPs be made variance-robust by weighting over stationary state-visitation distributions instead of entire trajectories?
RQ2How can we consistently estimate the stationary density ratio w_pi/pi0(s) using only off-policy data from the behavior policy?
RQ3Does a mini-max density-ratio estimation framework with RKHS yield closed-form solutions and theoretical guarantees for off-policy evaluation?
RQ4How does the proposed method perform in long-horizon scenarios compared to traditional IS/WIS approaches across discrete and continuous state spaces?

主な発見

An importance-sampling estimator based on stationary state densities reduces variance and eliminates horizon dependence.
A mini-max density-ratio estimator is derived, with a closed-form RKHS solution for the max-discriminator objective.
The density-ratio estimator provides a meaningful bound connection to Bellman operators, enabling error control for the estimated reward.
Empirical results across Taxi, Pendulum, and SUMO environments show improved performance over trajectory-based IS/WIS, particularly as horizon length increases or discount factors approach 1.
The method remains effective in continuous state spaces by parameterizing w with neural nets and using RKHS-based discriminators

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。