QUICK REVIEW

[論文レビュー] Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Oran Ridel, Alon Cohen|arXiv (Cornell University)|Feb 18, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

この論文は、エピソード的有限ホorizンMDPにおける報酬なしおよび報酬不偏設定の新しい online-MDP ベース探索アルゴリズムを導入し、高次のサンプル複雑性のほぼ最適性を達成し、報酬なし探索の緊密な下限を確認する。

ABSTRACT

We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

研究の動機と目的

報酬なしおよび報酬不偏探索を動機づけ、データ収集中に報酬なしで環境ダイナミクスを学習する。
報酬が明らかにされるまたはクラスから抽出される後の ε-最適ポリシーを得るためのサンプル複雑性を低減する。
サンプルを再利用し、先行研究と比較して低次項を最小化するアルゴリズムを開発する。
時変報酬なし探索について、最適性に関する予測を解決する緊密な下限を提供する。

提案手法

オンラインMDP（OMD）フレームワークを用いて、補助報酬を慎重に構成してポリシーカバーを生み出す単一の探索ランを設計する。
探索ポリシーを、探索が不十分な状態-行動-時間の三重を訪問することを促進する一連の報酬でオンラインアルゴリズムを走らせて構築する。
探索ポリシーによって生成された軌跡からMDPダイナミクスを推定し、固定または報酬クラス依存のサンプル予算を用いる。
報酬の開示後に経験的MDPに対して悲観的計画を行い、不確実性に対するバーンシュタイン型ペナルティを用いて ε-最適ポリシーを計算する。
探索ステップを占有測度上の凸最適化として定式化し、一階最適性とレジャー境界で分析する。
単一のオンライン-MDPランが significant triple に対してほぼ一様な探索を達成し、低次項を制御できることを証明する。

Figure 1 : Multiple states MDP construction for lower bound. Solid lines represent deterministic transition, and dashed lines represent probabilistic transitions. Blue, red and green represent classes of deterministic actions (see Definition C.3 ).

実験結果

リサーチクエスチョン

RQ1時変エピソードMDPにおける報酬なし探索の最適なサンプル複雑性はどれか。
RQ2報酬不偏探索は prior work と比較して大幅に低い低次項でほぼ最適な ε-最適性を達成できるか。
RQ3オンラインMDPアルゴリズムを活用して、 significant state-action-time triple の空間を均一にカバーする探索ポリシーを設計できるか。
RQ4報酬なし探索の緊密な下界はいくらで、既存の上界とどのように一致するか。
RQ5探索データを再利用してダイナミクス推定フェーズの追加採取コストを最小化できるか。

主な発見

アルゴリズム（Algorithm 1）を提案し、報酬なしおよび報酬不偏設定の両方で高次のサンプル複雑性 Õ(H^3|S|^2|A|/ε^2) を達成し、低次項は prior work より小さい。
時変MDPにおける報酬なし探索の厳密な下界を示す：Ω(|S|^2|A|H^3/ε^2) 軌道が期待値で必要。
crafted rewards の sequence に適用された単一の online-MDP ランが、ω-有意三重に対してほぼ一様なカバレッジを生み出す探索ポリシーを提供。
占有測度上の探索目的が実用的に効率的なサンプリング戦略へとつながる convex-optimization 的視点を提供。
提案された探索ポリシーにより、報酬が開示された後のε-最適ポリシーを計算するのに十分なダイナミクス推定が得られることを確立。

Figure 2 : Single state lower bound scheme MDP construction for lower bound. Solid lines represent deterministic transition, and dashed lines represent probabilistic transitions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。