QUICK REVIEW

[論文レビュー] Bayesian decision-making under misspecified priors with applications to meta-learning

Max Simchowitz, Christopher Tosh|arXiv (Cornell University)|Dec 6, 2021

Advanced Bandit Algorithms Research被引用数 2

ひとこと要約

本稿では、誤った事前分布を用いたトマソンサンプリングが性能を著しく低下させないことを確立しており、期待報酬の差が最大で $\tilde{\mathcal{O}}(H^2 \epsilon)$ 以内に抑えられることを示している。ここで $\epsilon$ は事前分布間の全変動距離、$H$ は学習ホライズンである。この解析は、メタラーニングやPOMDPを含むベイジアン意思決定に広く適用可能であり、行動空間のサイズに依存しないタイトな非パrametricな境界を提供する。行動空間のサイズに依存しないタイトな非パラメトリック境界を提供する。

ABSTRACT

Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified prior differs by at most $ ilde{\mathcal{O}}(H^2 \epsilon)$ from TS with a well specified prior, where $\epsilon$ is the total-variation distance between priors and $H$ is the learning horizon. Our bound does not require the prior to have any parametric form. For priors with bounded support, our bound is independent of the cardinality or structure of the action space, and we show that it is tight up to universal constants in the worst case. Building on our sensitivity analysis, we establish generic PAC guarantees for algorithms in the recently studied Bayesian meta-learning setting and derive corollaries for various families of priors. Our results generalize along two axes: (1) they apply to a broader family of Bayesian decision-making algorithms, including a Monte-Carlo implementation of the knowledge gradient algorithm (KG), and (2) they apply to Bayesian POMDPs, the most general Bayesian decision-making setting, encompassing contextual bandits as a special case. Through numerical simulations, we illustrate how prior misspecification and the deployment of one-step look-ahead (as in KG) can impact the convergence of meta-learning in multi-armed and contextual bandits with structured and correlated priors.

研究の動機と目的

トマソンサンプリングのようなベイジアン逐次意思決定アルゴリズムにおける事前分布の誤りの影響を理解すること。
行動空間の構造に依存しない、事前分布の誤りに対する一般化された感度境界を確立すること。
これらの境界をベイジアンメタラーニング設定に拡張し、さまざまな事前分布族に対するPAC保証を導出すること。
文脈的バンディットを超えて、より広範なベイジアンPOMDPのクラスにまで結果を一般化すること。
シミュレーションを通じて、事前分布の誤りと1ステップ先読みの実用的影響を示すこと。

提案手法

全変動距離 $\epsilon$ とホライズン $H$ を用いて、誤った事前分布と正しい事前分布を用いたトマソンサンプリングの期待報酬差の非漸近的境界を導出する。
事前分布の特定のパラメトリックな形を仮定しない非パラメトリックなアプローチを用いる。
有界サポートを持つ事前分布に対して、最悪ケースにおいて普遍定数の程度で境界がタイトであることを示す。
モンテカルロ実装を含む知識勾配（KG）アルゴリズムを含む、より広範なベイジアン意思決定アルゴリズムの感度解析を拡張する。
結果を応用して、ベイジアンメタラーニング設定における一般的なPAC学習保証を導出する。
構造的・相関的な事前分布の下でメタラーニングの収束と1ステップ先読みの効果を評価する数値シミュレーションを実施する。

実験結果

リサーチクエスチョン

RQ1事前分布の誤りは、トマソンサンプリングにおける期待報酬にどのように影響を及ぼし、その劣化は境界化可能か？
RQ2事前分布のパラメトリックな形を仮定せずに、ベイジアン意思決定の事前分布誤りに対する感度を定量化可能か？
RQ3ベイジアン設定におけるメタラーニングアルゴリズムに、事前分布の誤りが及える影響は何か？
RQ4導出された境界は、文脈的バンディットを超えて一般ベイジアンPOMDPにどのように拡張可能か？
RQ5KGのような1ステップ先読みは、メタラーニングにおける事前分布誤りの影響を軽減するか、悪化させるか？

主な発見

誤った事前分布を用いたトマソンサンプリングと正しい事前分布を用いた場合の期待報酬差は、$\tilde{\mathcal{O}}(H^2 \epsilon)$ 以内に抑えられる。ここで $\epsilon$ は事前分布間の全変動距離、$H$ は学習ホライズンである。
事前分布が有界サポートを持つ場合、行動空間の濃度や構造に依存しない。
有界サポートを持つ事前分布に対して、最悪ケースにおいて普遍定数の程度で境界がタイトである。
感度解析は、知識勾配アルゴリズムのモンテカルロ実装を含む、より広範なベイジアン意思決定アルゴリズムのクラスに一般化可能である。
結果として、さまざまな事前分布族に適用可能な一般的なPAC保証が得られる。
数値的シミュレーションにより、構造的・相関的な事前分布の下で、事前分布の誤りと1ステップ先読みがメタラーニングの収束に顕著な影響を与えることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。