QUICK REVIEW

[論文レビュー] A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

Yifang Chen, Chung‐Wei Lee|arXiv (Cornell University)|Feb 3, 2019

Advanced Bandit Algorithms Research参考文献 27被引用数 39

ひとこと要約

本論文は、非定常環境に対して初のパラメータフリー、効率的で最適な文脈バンディットアルゴリズムを提案し、SやΔの事前知識なしに再生フェーズを用いて動的後悔境界を O(min{√(KST), K^{1/3} Δ^{1/3} T^{2/3}}) に達成する。

ABSTRACT

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret $\mathcal{O}(\min\{\sqrt{ST}, Δ^{\frac{1}{3}}T^{\frac{2}{3}}\})$ for a contextual bandit problem with $T$ rounds, $S$ switches and $Δ$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $Δ$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O}(\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, Δ^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

研究の動機と目的

時間とともに単一のポリシーが最適でなくなる非定常環境を動機づけ、対処する。
動的後悔保証を持つパラメータフリーな文脈バンディットアルゴリズムを提案する。
環境の切替や変動を事前に知らずとも適応的な性能を達成する。

提案手法

アルゴリズムが過去の決定に従う再生フェーズを導入し、非定常性を検出する。
再生フェーズと通常フェーズで探索-活用のバランスを取るオンライン学習フレームワークを開発する。
Kアクション、S回の切替、Δ総変動を伴うTラウンドでの動的後悔境界を O(min{√(KST), K^{1/3} Δ^{1/3} T^{2/3}}) に証明する。
効率的な実装のためのERM（経験的リスク最小化）オラクルへのアクセスを仮定する。
アルゴリズムはSとΔに対して適応的かつパラメータフリーである。

実験結果

リサーチクエスチョン

RQ1事前にSとΔを知らずに、文脈バンディットで非定常性を効率的に検出・対処するにはどうすればよいか。
RQ2ERMオラクルの下で、非定常性を伴う文脈バンディットでどのような動的後悔保証が達成できるか。
RQ3再生メカニズムは、文脈設定で効率を犠牲にすることなく最適またはほぼ最適な性能をもたらすことができるか。

主な発見

動的後悔境界を O(min{√(KST), K^{1/3} Δ^{1/3} T^{2/3}}) に達成する。
アルゴリズムはSおよびΔが未知でもパラメータフリーで適応的。
再生フェーズは探索-活用のバランスを維持しつつ非定常性検出を可能にする。
関連研究と比較して、文脈なしの二 arm バンディットにおける O(√(ST)) や関連研究の O(S^{1/4} T^{3/4}) または Δ^{1/5} T^{4/5} などの以前の境界を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。