QUICK REVIEW

[論文レビュー] Sequential Batch Learning in Finite-Action Linear Contextual Bandits

Yanjun Han, Zhengqing Zhou|arXiv (Cornell University)|Apr 14, 2020

Advanced Bandit Algorithms Research参考文献 55被引用数 31

ひとこと要約

本論文は有限アクション線形文脈バンディットにおける逐次的バッチ学習を分析し、敵対的および確率的文脈の下での後悔上限・下限を導出し、対応するアルゴリズムを提案する。

ABSTRACT

We study the sequential batch learning problem in linear contextual bandits with finite action sets, where the decision maker is constrained to split incoming individuals into (at most) a fixed number of batches and can only observe outcomes for the individuals within a batch at the batch's end. Compared to both standard online contextual bandits learning or offline policy learning in contexutal bandits, this sequential batch learning problem provides a finer-grained formulation of many personalized sequential decision making problems in practical applications, including medical treatment in clinical trials, product recommendation in e-commerce and adaptive experiment design in crowdsourcing. We study two settings of the problem: one where the contexts are arbitrarily generated and the other where the contexts are extit{iid} drawn from some distribution. In each setting, we establish a regret lower bound and provide an algorithm, whose regret upper bound nearly matches the lower bound. As an important insight revealed therefrom, in the former setting, we show that the number of batches required to achieve the fully online performance is polynomial in the time horizon, while for the latter setting, a pure-exploitation algorithm with a judicious batch partition scheme achieves the fully online performance even when the number of batches is less than logarithmic in the time horizon. Together, our results provide a near-complete characterization of sequential decision making in linear contextual bandits when batch constraints are present.

研究の動機と目的

バッチ終了時の報酬のみが観測される逐次的バッチ学習の動機づけと形式化。
有限アクションを持つ線形文脈バンディットにおいて、固定数のバッチ M が後悔に与える影響を特徴づける。
敵対的および確率的文脈設定の双方についてアルゴリズムを開発し、後悔の上界と下界を証明する。

提案手法

オンライン文脈バンディットをバッチ制約付きのフィードバックへ拡張し、M個のバッチのグリッドとバッチ方針を用いて逐次的バッチ学習を定式化する。
各バッチの終了時に θ推定を更新し、各バッチ内で上限信頼区間を用いる逐次バッチUCB（SBUCB）アルゴリズムを提案する。
依存性の問題に対処するマスターアルゴリズムを提供し、信頼区間の適用可能性を確立する。
敵対的文脈に対する後悔の上界と下界を導出し、Tに対してポリログ因子とMの依存性を示す。
確率的文脈を、純粋なエクスプロイトアルゴリズムで分析し、対応する後悔境界を導出する。
問題依存の後悔境界を提示し、高確率保証について議論する。

実験結果

リサーチクエスチョン

RQ1敵対的文脈の下で、フィードバックをM個のバatchesに制限することが、有限アクションの線形文脈バンディットの後悔にどのように影響するか？
RQ2敵対的文脈設定において、逐次バッチUCBアルゴリズムで到達可能なほぼ最適な後悔率はどれか？
RQ3確率的文脈は最適なバッチ戦略と達成可能な後悔をどのように変えるか？
RQ4最適な後悔を得るために必要な一定数のバッチの必然性を示す厳密な下界は何か？
RQ5確率的文脈において純粋なエクスプロイト戦略はどのように機能し、どのような後悔特性を持つか？

主な発見

敵対的文脈設定では、ポリログ(T)乗の (sqrt(dT) + dT/M) に期待後悔を達成する逐次バッチアルゴリズムが存在する。
K=2 の場合、後悔は少なくとも c*(sqrt(dT) + min{T sqrt(d)/M, T/ sqrt(M)}) であり、ポリログと定数因子を除けば上界と一致する下界を示す。
これは、完全オンライン後悔を達成するには Theta(sqrt(dT)) バッチで十分であり、低次元では O(sqrt(Td)) バッチで十分であることを意味する。
確率的文脈設定では、純粋なエクスプロイトアルゴリズムが最小最大後悔 tilde Theta(sqrt(dT)) を、少なくとも polylog(T) バッチ、具体的には near log log(T/d^2) で達成できる。
確率的文脈では、上界と下界はポリログ因子の範囲内であり、敵対的ケースよりはるかに少ないバッチ数でほぼミニマックス最適性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。