QUICK REVIEW

[論文レビュー] Batched Multi-armed Bandits Problem

Zijun Gao, Yanjun Han|arXiv (Cornell University)|Apr 3, 2019

Advanced Bandit Algorithms Research参考文献 34被引用数 42

ひとこと要約

この論文は Batched MAB に対して近似最適ミニマックスと問題依存の後悔を達成する batched successive elimination policy BaSE を紹介し、静的および適応グリッドの下界と一致する下界を持つ。

ABSTRACT

In this paper, we study the multi-armed bandit problem in the batched setting where the employed policy must split data into a small number of batches. While the minimax regret for the two-armed stochastic bandits has been completely characterized in \cite{perchet2016batched}, the effect of the number of arms on the regret for the multi-armed case is still open. Moreover, the question whether adaptively chosen batch sizes will help to reduce the regret also remains underexplored. In this paper, we propose the BaSE (batched successive elimination) policy to achieve the rate-optimal regrets (within logarithmic factors) for batched multi-armed bandits, with matching lower bounds even if the batch sizes are determined in an adaptive manner.

研究の動機と目的

限られた対話ラウンドの中でデータがバッチ単位で到着する状況での学習を動機づける。
K アーム、M バッチ、そして時間 horizon T の関数として minimax と問題依存の後悔を特徴づける。
バatched 制約の下で多項対数因子内でレート最適な後悔を達成する方針を開発する。

提案手法

最初の M-1 バッチで探索し、最後のバッチでコミットする BaSE を提案する。
ギャップ依存の信頼区間を用いたアクティブアームの排除で、バッチ終端で明らかに最適でない腕を除去する。
上界を証明するために2つの静的グリッド（ミニマックスと幾何）を提供し、それらのグリッドの下で後悔の境界を分析する。
M が T と共に成長するときに BaSE が既知の完全適応レートに上界を一致させる（多項ログ因子まで）ことを示す。
静的グリッドと一般的な適応グリッドの下界を導出してミニマックスと問題依存の限界を確立する。

実験結果

リサーチクエスチョン

RQ1K-armed batched bandits におけるバッチ数 M は minimax と問題依存の後悔にどう影響するか？
RQ2バatched 方針は完全適応の後悔レートに近づけるか、どのグリッド（静的/適応）でそれを達成するのに必要か？
RQ3静的および適応グリッドの下での batched MAB の根本的な下界は何か？
RQ4この設定で適応バッチサイズは固定グリッドより意味のある改善を提供するか？

主な発見

K≥2, T≥1, 1≤M≤T の任意の設定に対して、BaSE 方針は E[R_T] ≤ polylog(K,T) · sqrt(K) · T^{1/(2−2^{1−M})}（minimax グリッド）を達成する。
同じ設定で、BaSE は E[R_T] ≤ polylog(K,T) · (K T^{1/M}) / min_{i≠*} Δ_i（幾何グリッド）を達成する。
Corollary: M = O(log log T) バッチで minimax 後悔 Θ(√(K T))、M = O(log T) バッチで問題依存後悔 Θ(K log T)（対数因子を除く）.
Lower bounds for static grids give R_min−max ≥ c √K · T^{1/(2−2^{1−M})} and R_pro−dep ≥ c K · T^{1/M}.
Adaptive grids incur a polynomial M^{-2} factor in the lower bounds, yielding R_min−max ≥ c M^{-2} √K · T^{1/(2−2^{1−M})} and R_pro−dep ≥ c M^{-2} K · T^{1/M} (still polylogarithmically close to static-grid bounds).
Corollaries: Ω(log log T) バッチは minimax 最適性に必要、Ω(log T / log log T) は問題依存の最適性に必要、どちらのグリッドタイプでも。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。