QUICK REVIEW

[論文レビュー] Efficient Algorithms for Adversarial Contextual Learning

Vasilis Syrgkanis, Akshay Krishnamurthy|arXiv (Cornell University)|Feb 8, 2016

Advanced Bandit Algorithms Research参考文献 30被引用数 45

ひとこと要約

本稿は、最適化オラクルを用いたFollow-the-Perturbed-Leaderフレームワークを活用し、敵対的文脈的バンディットおよびオンラインコンビナトリアル最適化のための、初めてのオラクル効率的でサブリニアなリグレットを達成するアルゴリズムを提示する。逐次的設定では $O(T^{3/4}\sqrt{K\log N})$ のリグレットを達成し、小セパレータ設定では $O(T^{2/3}d^{3/4}K\sqrt{\log N})$ を達成する。ここで $T$ は時間、$K$ は行動数、$N$ はポリシー数、$d$ はセパレータサイズを表す。

ABSTRACT

We provide the first oracle efficient sublinear regret algorithms for adversarial versions of the contextual bandit problem. In this problem, the learner repeatedly makes an action on the basis of a context and receives reward for the chosen action, with the goal of achieving reward competitive with a large class of policies. We analyze two settings: i) in the transductive setting the learner knows the set of contexts a priori, ii) in the small separator setting, there exists a small set of contexts such that any two policies behave differently in one of the contexts in the set. Our algorithms fall into the follow the perturbed leader family \cite{Kalai2005} and achieve regret $O(T^{3/4}\sqrt{K\log(N)})$ in the transductive setting and $O(T^{2/3} d^{3/4} K\sqrt{\log(N)})$ in the separator setting, where $K$ is the number of actions, $N$ is the number of baseline policies, and $d$ is the size of the separator. We actually solve the more general adversarial contextual semi-bandit linear optimization problem, whilst in the full information setting we address the even more general contextual combinatorial optimization. We provide several extensions and implications of our algorithms, such as switching regret and efficient learning with predictable sequences.

研究の動機と目的

敵対的文脈的学習における統計的性能と計算効率のギャップを埋めること。
ポリシー空間が指数的に巨大であっても計算的に効率的なアルゴリズムを開発すること。
バッチ最適化問題へのオラクルアクセスのみを用いて、敵対的設定でサブリニアなリグレットを達成すること。
Follow-the-Perturbed-Leaderフレームワークを敵対的文脈的およびセミバンディット設定に拡張すること。

提案手法

ポリシー選択に最適化オラクルのみに依存する、新しいFollow-the-Perturbed-Leader（FTPL）アルゴリズムを提案する。
すべての文脈が事前に既知であるという逐次的設定に、FTPLフレームワークを適用する。
小セパレータ設定を導入し、任意の2つのポリシーを区別する最小の文脈集合を定義する。
ポリシークラスの複雑さの指標としてNatarajan次元を用い、VC次元を一般化する。
Neu & Bartók（2013）の技術を用いて、アルゴリズムをセミバンディットおよびバンディット設定に適応する。
確率的摂動とオラクルベースのポリシー選択を用いて、計算効率を維持する。

実験結果

リサーチクエスチョン

RQ1ポリシー空間が巨大な場合に、計算効率を保ちつつ敵対的文脈的バンディットでサブリニアなリグレットを達成できるか？
RQ2オラクルアクセスのみを用いて、Follow-the-Perturbed-Leaderフレームワークを敵対的文脈的およびセミバンディット設定に拡張できるか？
RQ3ポリシークラスのどのような構造的性質が、敵対的文脈的設定での効率的学習を可能にするか？
RQ4最小セパレータのサイズが、オンライン学習におけるリグレットバウンドにどのように影響するか？
RQ5非逐次的設定において、敵対的文脈および損失系列に対してサブリニアなリグレットを達成できるか？

主な発見

逐次的設定では、$T$ が時間、$K$ が行動数、$N$ がポリシー数であるとき、$O(T^{3/4}\sqrt{K\log N})$ のリグレットを達成する。
小セパレータ設定では、$d$ が最小セパレータのサイズであるとき、リグレットバウンドは $O(T^{2/3}d^{3/4}K\sqrt{\log N})$ である。
ポリシークラスのNatarajan次元が有界であれば、適応的かつ敵対的な文脈および損失系列に対しても、サブリニアなリグレットを維持する。
Natarajan次元が $\nu$ のポリシークラスでは、$\epsilon$ を最適に選ぶと、リグレットは $O((d\nu\log K\log(dK/\nu))^{1/4}\sqrt{T})$ となる。
VC次元が1のポリシークラスでは、非逐次的設定において、いかなるアルゴリズムでもサブリニアなリグレットを達成できない。
この結果は、一般のポリシークラスにおいて、敵対的文脈的学習でサブリニアなリグレットを達成するには、逐次的知識が不可欠であることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。