QUICK REVIEW

[論文レビュー] Hedging the Drift: Learning to Optimize under Non-Stationarity

Wang Chi Cheung, David Simchi‐Levi|arXiv (Cornell University)|Mar 4, 2019

Advanced Bandit Algorithms Research参考文献 54被引用数 35

ひとこと要約

データ駆動アルゴリズムを導入した非定常バンディットで、最先端のダイナミックレグレット境界を達成。スライディングウィンドウ UCB (SW-UCB) および Bandit-over-Bandit (BOB) フレームワークを含み、いくつかのバンディットモデルへの拡張と経験的検証。

ABSTRACT

We introduce data-driven decision-making algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown \emph{a priori} and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe for a wide variety of non-stationary bandit problems. Specifically, we design and analyze the sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound for each of the settings when we know the respective underlying \emph{variation budget}, which quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, we can further enjoy the (nearly) optimal dynamic regret bounds in a (surprisingly) parameter-free manner. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the "forgetting principle" in the learning processes, which is vital in changing environments. Our extensive numerical experiments on both synthetic and real world online auto-loan datasets show that our proposed algorithms achieve superior empirical performance compared to existing algorithms.

研究の動機と目的

バンディット学習の非定常性に対処、報酬分布が時間とともに変動する。
変化に適応的にヘッジしつつ探索と利用のバランスを取るアルゴリズムを開発。
ダイナミックレグレットを定量化し、既知の変動予算と未知の変動予算の下で（ほぼ）最適境界を確立。
drifting linear bandits から関連バンディット設定（MAB、GLM、組合せ）へフレームワークを拡張。
合成データと実データセットで既存手法に対する実証的性能向上を示す。

提案手法

最近のデータに適応するためのスライディングウィンドウ正則化最小二乗推定（SW-RLSE）を導入。
スライディングウィンドウ-UCB（SW-UCB）を提案、不確実性に対する楽観性とデータ依存の信頼半径を組み込む。
ウィンドウサイズ w と変動予算 B_T に依存するダイナミックレグレット境界を導出、B_T が既知の場合最適性（対数因子まで）を達成。
Bandit-over-Bandit (BOB) を開発、SW-UCB のウィンドウサイズを適応的に調整し B_T を知らなくてもほぼ最適なダイナミックレグレットを達成。
アプローチを複数のバンディット変種（MAB、一般化線形バンディット、組合せ半バンディット）へ拡張し、非定常設定における忘却原理を論じる。
drifting linear bandits におけるダイナミックレグレットの理論的下界と、それに一致する上界を提供（対数的因子を除く）。
合成データとオンライン自動ローンデータセットでアルゴリズムを評価し、経験的利得を示す。

実験結果

リサーチクエスチョン

RQ1変動予算 B_T が既知の場合、 drifting linear bandits でどの程度のダイナミックレグレット境界が得られるか？
RQ2B_T が未知の場合、 adaptive フレームワークは B_T の知識なしでほぼ最適な性能を達成できるか？
RQ3SW-UCB フレームワークを線形バンディット以外の設定（MAB、GLM、組合せ半バンディット）へ適用できるか？
RQ4忘却原理と適応ウィンドウを組み込むことで非定常環境での性能は向上するか？
RQ5提案手法は synthetic および real datasets で既存の非定常バンディットアルゴリズムと比べて経験的にどうか？

主な発見

調整済みウィンドウサイズを用いた SW-UCB は、B_T が既知のときダイナミックレグレットが最適に近い（対数因子まで）。
BOB フレームワークは SW-UCB のウィンドウサイズを適応的に調整し、B_T が未知でもほぼ最適なダイナミックレグレットを達成し、従来手法より改善。
忘却原理を楽観的学習に組み込むと、非定常性を扱う際のレグレット保証を与えつつ有効。
MAB、一般化線形バンディット、組合せ半バンディットへの拡張は、様々な運用研究問題への適用性を広げる。
合成データとオンライン自動ローンデータセットで、既存アルゴリズムに対して卓越した実験的性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。