QUICK REVIEW

[論文レビュー] Linear Stochastic Bandits Under Safety Constraints

Sanae Amani, Mahnoosh Alizadeh|arXiv (Cornell University)|Jan 1, 2019

Machine Learning and Algorithms被引用数 25

ひとこと要約

本稿では、未知のパラメーターベクトルに線形に依存する安全制約を満たす線形確率的バンディット問題に対するUCBベースのアルゴリズム、Safe-LUCBを提案する。アルゴリズムは2段階に分けられる：まず、安全行動集合の推定を目的とした純粋な探索フェーズを実行し、その後、安全な探索・活用フェーズに移行する。このフェーズでは、安全を高確率で保証しつつ、レグレットを最小化する。その結果、最適行動が安全集合内に位置する位置に依存する問題依存レグレットバウンドを達成する。

ABSTRACT

Bandit algorithms have various application in safety-critical systems, where it is important to respect the system constraints that rely on the bandit's unknown parameters at every round. In this paper, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend (linearly) on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that her actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose a new UCB-based algorithm called Safe-LUCB, which includes necessary modifications to respect safety constraints. The algorithm has two phases. During the pure exploration phase the learner chooses her actions at random from a restricted set of safe actions with the goal of learning a good approximation of the entire unknown safe set. Once this goal is achieved, the algorithm begins a safe exploration-exploitation phase where the learner gradually expands their estimate of the set of safe actions while controlling the growth of regret. We provide a general regret bound for the algorithm, as well as a problem dependent bound that is connected to the location of the optimal action within the safe set. We then propose a modified heuristic that exploits our problem dependent analysis to improve the regret.

研究の動機と目的

未知のパラメータに依存する制約を満たす必要がある安全が重要なアプリケーションに対処すること。
初期段階で安全集合が未知である状況において、各ラウンドで安全を保証するバンディットアルゴリズムを開発すること。
段階的に安全行動の推定集合を学習・拡張しながら、レグレットを最小化すること。
最適行動が安全集合内に位置する位置に依存する理論的レグレットバウンドを提供すること。

提案手法

アルゴリズムは2段階のアプローチを採用する：まず、安全集合の推定を目的とした制限された行動集合における純粋な探索。
2番目のフェーズでは、安全集合の良好な近似が得られると、安全な探索・活用が開始される。
未知のパラメーターベクトルの信頼区間を維持することで、安全を高確率で保証する。
推定された安全集合内の行動を優先するように変更されたUCBスタイルの選択ルールを採用する。
線形制約における統計的信頼区間を用いて、安全集合の推定を反復的に改善する。
問題依存の分析に基づくヒューリスティックを導入し、さらにレグレットを低減する。

実験結果

リサーチクエスチョン

RQ1安全集合が未知のパラメータに依存する状況において、線形確率的バンディットアルゴリズムはどのように各ラウンドで安全を保証できるか？
RQ2このような制約付き設定において、安全のための探索とレグレット最小化の最適なトレードオフは何か？
RQ3最適行動が安全集合内に位置する位置が、達成可能なレグレットにどのように影響するか？
RQ4純粋な探索と安全な活用を組み合わせた2段階アプローチは、安全制約下で非線形レグレットを達成できるか？

主な発見

Safe-LUCBアルゴリズムは、安全制約のもとで成り立つ一般的なレグレットバウンドを達成する。
最適行動の安全集合に対する幾何的配置に依存する問題依存レグレットバウンドが導出された。
提案されたアルゴリズムは、学習プロセスの全期間にわたり、すべての行動が高確率で安全であることを保証する。
問題依存の分析に基づくヒューリスティックにより、レグレット性能が向上した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。