QUICK REVIEW

[論文レビュー] Phased Exploration with Greedy Exploitation in Stochastic Combinatorial Partial Monitoring Games

Sougata Chaudhuri, Ambuj Tewari|arXiv (Cornell University)|Jan 1, 2016

Advanced Bandit Algorithms Research参考文献 7被引用数 52

ひとこと要約

本稿では、確率的組み合わせ的部分観測（CPM）ゲームのための段階的探索とグリーディな活用（PEGE）フレームワークを提案する。このフレームワークは、唯一の最適行動の存在を仮定せず、arg-maxオракルのみを用いることで、O(T^{2/3}√log T)の分布に依存しない、およびO(log²T)の分布に依存するレグレットを達成する。先行研究とは異なり、一意の最適行動を必要とせず、arg-secondmaxオラクルの複雑さを回避することで、トップのみのフィードバックにおけるオンラインランク付けへの効率的応用を可能にする。

ABSTRACT

Partial monitoring games are repeated games where the learner receives feedback that might be different from adversary's move or even the reward gained by the learner. Recently, a general model of combinatorial partial monitoring (CPM) games was proposed \cite{lincombinatorial2014}, where the learner's action space can be exponentially large and adversary samples its moves from a bounded, continuous space, according to a fixed distribution. The paper gave a confidence bound based algorithm (GCB) that achieves $O(T^{2/3}\log T)$ distribution independent and $O(\log T)$ distribution dependent regret bounds. The implementation of their algorithm depends on two separate offline oracles and the distribution dependent regret additionally requires existence of a unique optimal action for the learner. Adopting their CPM model, our first contribution is a Phased Exploration with Greedy Exploitation (PEGE) algorithmic framework for the problem. Different algorithms within the framework achieve $O(T^{2/3}\sqrt{\log T})$ distribution independent and $O(\log^2 T)$ distribution dependent regret respectively. Crucially, our framework needs only the simpler "argmax" oracle from GCB and the distribution dependent regret does not require existence of a unique optimal action. Our second contribution is another algorithm, PEGE2, which combines gap estimation with a PEGE algorithm, to achieve an $O(\log T)$ regret bound, matching the GCB guarantee but removing the dependence on size of the learner's action space. However, like GCB, PEGE2 requires access to both offline oracles and the existence of a unique optimal action. Finally, we discuss how our algorithm can be efficiently applied to a CPM problem of practical interest: namely, online ranking with feedback at the top.

研究の動機と目的

argmaxおよびarg-secondmaxの両方のオラクルを必要とする先行CPMアルゴリズムの制限を解消すること。
指数的かつ巨大な行動空間と連続的な敵対的行動を伴う組み合わせ的部分観測ゲームにおける、レグレット最小化アルゴリズムの開発。
分布に依存するレグレット解析において、一意の最適行動の仮定を排除すること。
トップのみのフィードバック下での実世界の応用、例えばオンラインランク付けへの実用的導入を可能にすること。
既存手法と同等またはそれを上回るレグレットバウンドを達成しつつ、計算的依存関係を低減すること。

提案手法

探索とグリーディな活用のフェーズを交互に繰り返す段階的探索フレームワークを提案する。
先行手法の二重オラクル要件よりも単純なarg-maxオラクルのみを用いる。
現在の報酬推定値に基づいてグリーディな活用を実装し、行動を選択する。
ギャップ推定を組み合わせたPEGE2を導入し、O(log T)の分布に依存するレグレットを達成する。
グローバル観測可能性および報酬関数のリプシッツ連続性を含む、CPMモデルのすべての仮定を満たす。
オンラインランク付けにフレームワークを適用し、置換行動を伴うCPMゲームとしてモデル化する。

実験結果

リサーチクエスチョン

RQ1一意の最適行動の存在を仮定せず、O(log²T)の分布に依存するレグレットを達成できるCPMアルゴリズムは存在するか？
RQ2arg-secondmaxオラクルに依存せずに、O(log T)のレグレットバウンドを達成できるか？
RQ3PEGEフレームワークは、トップのみのフィードバック下でのオンラインランク付けに効率的に適用可能か？
RQ4段階的探索にグリーディな活用を組み合わせた手法は、CPMゲームにおいて信頼区間ベースの手法を上回る性能を示すか？
RQ5連続的な学習者行動空間を扱えるか、同時に低レグレットを維持できるか？

主な発見

PEGEアルゴリズムは、arg-maxオラクルのみを用いて、O(T^{2/3}√log T)の分布に依存しないレグレットとO(log²T)の分布に依存するレグレットを達成する。
PEGEフレームワークは、先行の分布に依存する境界とは異なり、一意の最適行動の存在を仮定しない。
PEGE2はO(log T)の分布に依存するレグレットを達成し、GCBバウンドと同等の性能を示すが、arg-secondmaxオラクルを必要としない。
トップのみのフィードバックを伴うオンラインランク付け問題は、形式的にCPMゲームとしてモデル化され、すべての必要な仮定を満たす。
フレームワークは有限および連続的な学習者行動空間の両方に対応可能であり、ランク付けのための連続スコアベクトルにも適用可能である。
実験的検証により、フィードバック制約下での大規模ランク付け問題に対しても、本手法が実用的であることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。