QUICK REVIEW

[论文解读] Efficient Algorithms for Adversarial Contextual Learning

Vasilis Syrgkanis, Akshay Krishnamurthy|arXiv (Cornell University)|Feb 8, 2016

Advanced Bandit Algorithms Research参考文献 30被引用 45

一句话总结

本文提出了对抗性上下文Bandits和在线组合优化问题的首个Oracle高效、次线性遗憾的算法，利用仅依赖于优化Oracle的Follow-the-Perturbed-Leader框架。在归纳设置下达到$O(T^{3/4}\sqrt{K\log N})$的遗憾，在小分离器设置下达到$O(T^{2/3}d^{3/4}K\sqrt{\log N})$的遗憾，其中$T$为时间，$K$为动作数，$N$为策略数量，$d$为分离器大小。

ABSTRACT

We provide the first oracle efficient sublinear regret algorithms for adversarial versions of the contextual bandit problem. In this problem, the learner repeatedly makes an action on the basis of a context and receives reward for the chosen action, with the goal of achieving reward competitive with a large class of policies. We analyze two settings: i) in the transductive setting the learner knows the set of contexts a priori, ii) in the small separator setting, there exists a small set of contexts such that any two policies behave differently in one of the contexts in the set. Our algorithms fall into the follow the perturbed leader family \cite{Kalai2005} and achieve regret $O(T^{3/4}\sqrt{K\log(N)})$ in the transductive setting and $O(T^{2/3} d^{3/4} K\sqrt{\log(N)})$ in the separator setting, where $K$ is the number of actions, $N$ is the number of baseline policies, and $d$ is the size of the separator. We actually solve the more general adversarial contextual semi-bandit linear optimization problem, whilst in the full information setting we address the even more general contextual combinatorial optimization. We provide several extensions and implications of our algorithms, such as switching regret and efficient learning with predictable sequences.

研究动机与目标

弥合对抗性上下文学习中统计性能与计算效率之间的差距。
开发即使在策略空间指数级庞大时也具有计算效率的算法。
仅通过访问批量优化问题的Oracle，实现在对抗性设置下的次线性遗憾。
将Follow-the-Perturbed-Leader框架扩展至对抗性上下文和半Bandit设置。

提出的方法

提出一种新型的Follow-the-Perturbed-Leader（FTPL）算法，其策略选择仅依赖于一个优化Oracle。
将FTPL框架应用于所有上下文预先已知的归纳设置。
引入小分离器设置，其中一组小规模的上下文可区分任意两个策略。
使用Natarajan维数作为策略类的复杂度度量，推广VC维数。
通过Neu & Bartók（2013）提出的技术，将算法适配至半Bandit和Bandit设置。
采用随机扰动和基于Oracle的策略选择，以保持计算效率。

实验结果

研究问题

RQ1当策略空间较大时，是否能在计算效率的前提下实现在对抗性上下文Bandits中的次线性遗憾？
RQ2是否能仅通过Oracle访问，将Follow-the-Perturbed-Leader框架扩展至对抗性上下文和半Bandit设置？
RQ3策略类的何种结构特性可使对抗性上下文设置下的高效学习成为可能？
RQ4最小分离器的大小如何影响在线学习中的遗憾界？
RQ5在非归纳设置下，是否能针对对抗性上下文和损失序列实现次线性遗憾？

主要发现

在归纳设置下，算法实现$O(T^{3/4}\sqrt{K\log N})$的遗憾，其中$T$为时间，$K$为动作数，$N$为策略数。
在小分离器设置下，遗憾界为$O(T^{2/3}d^{3/4}K\sqrt{\log N})$，其中$d$为最小分离器的大小。
当策略类具有有界Natarajan维数时，该算法即使在对抗性且自适应的上下文和损失序列下，仍能保持次线性遗憾。
对于Natarajan维数为$\nu$的策略类，当$\epsilon$被最优选择时，遗憾为$O((d\nu\log K\log(dK/\nu))^{1/4}\sqrt{T})$。
对于VC维数为1的策略类，不存在算法能在非归纳设置下对抗自适应对手实现次线性遗憾。
结果表明，在一般策略类的对抗性上下文学习中，归纳知识对于实现次线性遗憾是必不可少的。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。