QUICK REVIEW

[論文レビュー] Bayesian Online Model Selection

Aida Afshar, Yuke Zhang|arXiv (Cornell University)|Feb 20, 2026

Advanced Bandit Algorithms Research被引用数 0

ひとこと要約

ベイズ的オンラインモデル選択アルゴリズムを確率的バンディットに適用し、oracle-best Bayesian regret bound を Õ(d* M √T + √(M T)) で証明。データ共有と事前ミス指定の効果を実験的に示す。

ABSTRACT

Online model selection in Bayesian bandits raises a fundamental exploration challenge: When an environment instance is sampled from a prior distribution, how can we design an adaptive strategy that explores multiple bandit learners and competes with the best one in hindsight? We address this problem by introducing a new Bayesian algorithm for online model selection in stochastic bandits. We prove an oracle-style guarantee of $O\left( d^* M \sqrt{T} + \sqrt{(MT)} \right)$ on the Bayesian regret, where $M$ is the number of base learners, $d^*$ is the regret coefficient of the optimal base learner, and $T$ is the time horizon. We also validate our method empirically across a range of stochastic bandit settings, demonstrating performance that is competitive with the best base learner. Additionally, we study the effect of sharing data among base learners and its role in mitigating prior mis-specification.

研究の動機と目的

環境インスタンスが事前分布から取り出される Bayesian バンディットにおけるオンラインモデル選択を動機づける。
oracle-best 保証を持つ複数のベースバンディット学習器の中から選択するメタラーナーを設計する。
既知の後悔境界を必要とせず、事後サンプリングを活用してデータ駆動型でベース学習器を比較するアプローチを提供する。
事前ミス指定およびベース学習器間のデータ共有に対する実証的性能と頑健性を示す。

提案手法

ベース学習器全体のグローバル後方分布を維持し、それを介して平均報酬をサンプルするベイズオンラインモデル選択（B-MS）アルゴリズムを提案する。
ベース学習器を比較し、最小ポテンシャルを取るものを選択するために、バランシングポテンシャル φt(i) = nt^i * μ̃t* − ∑l∈It^i μ̃t(al) を定義する。
ベース学習器が定常アームである場合、方法はTSに似たベイズ regret境界を回復することを示す。
oracle-best Bayesian regret bound を ḂayesRegret_T ≤ ṫilde{O}(d⋆ M √T + √(M T)) で証明する。
ベース学習器間でデータを共有することによりメタラーナーの性能が向上し、事前指定のミスを緩和するのに役立つことを実証する。

(a) Well-specified meta learner, one well-specified base learner

実験結果

リサーチクエスチョン

RQ1事前から drawn realized environments に対して hindsight で選択された最良のベース学習器と競えるベイズ的メタラーナーは作れるか。
RQ2ベース学習器間のデータ共有は学習効率と事前ミス指定への頑健性にどう影響するか。
RQ3提案されたオンラインモデル選択アルゴリズムのベイズ regret はいくつかのスケール因子（ horizon T、ベース学習器数 M、最適ベース学習器の regret coefficient d⋆）とともにどうなるか。
RQ4ベイズオンラインモデル選択フレームワークは stationary arms における Thompson Sampling とどう関連し、一般化するか。
RQ5特定の条件下で、提案手法は古典的 TS の保証をどのように回復するか。

主な発見

提案された B-MS アルゴリズムは Bayesian regret bound を Õ(d⋆ M √T + √(M T)) で達成する。
この手法は Thompson Sampling を一般化し、各ベース学習器がアームを固定する場合、 Õ(√(K T)) の Bayesian regret を回復する。
ベース学習器間のデータ共有は実験全体でメタラーナーの性能を改善する。
少なくとも1つのベース学習器が正しく仕様されていればメタラーナーが誤指定から回復できる、すなわち頑健性を示す。
実験結果は B-MS が UCB や LinTS 設定およびさまざまな事前分布の下で最良のベース学習器と競合することを示す。

(b) Mis-specified meta learner, one well-specified base learner

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。