QUICK REVIEW

[论文解读] Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles

Dylan J. Foster, Alexander Rakhlin|arXiv (Cornell University)|Feb 12, 2020

Advanced Bandit Algorithms Research被引用 52

一句话总结

本文提出 SquareCB，一种将上下文赌博（contextual bandits）统一、最优地降维为在线回归的通用方法，利用回归 oracle，在 realizability 条件下实现极小极大后悔界限，并且除了 realizability 之外不需要任何分布假设。

ABSTRACT

A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.

研究动机与目标

开发具有实用运行时间和内存成本、可与监督学习相媲美的一般性上下文赌博算法。
通过回归 oracle 将上下文赌博降维为在线回归，以获得强后悔保证。
刻画丰富函数类的 minimax 速率，并确立 SquareCB 降维的最优性。
在 realizability 与对抗性上下文下，为具体函数类（线性、核、GLMs）提供端到端的保证。

提出的方法

引入在线回归 oracle（SqAlg）的概念及平方损失后悔保证。
给出 SquareCB，一种利用回归 oracle 以与分数差距相关的倒数概率选择动作的降维方法。
证明 Reg_CB(T) ≤ C * sqrt(K T * Reg_Sq(T)) 具有高概率界，并继承 oracle 的内存/运行时边界。
表明 SquareCB 在适当的 SqAlg 和函数类选择下是 minimax 最优的。
将 SquareCB 实例化为多种函数类（线性、高维线性、核、GLMs）以推导具体的后悔保证。
讨论对模型失配的鲁棒性以及对大动作空间的扩展。

实验结果

研究问题

RQ1在 realizability 条件下，对于丰富（潜在非参数）函数类的情境赌博，minimax 后悔率是多少？
RQ2当存在一个间隙时，是否可以在 RichCBs 中实现接近对数的后悔，在广义函数类和大规模动作集合上？
RQ3如何在没有除 realizability 之外的分布假设下，将情境赌博降维到在线回归，同时仍能保持高效计算？
RQ4在实际函数类（线性、核、GLMs）下，SquareCB 在后悔和计算效率方面的表现如何？

主要发现

SquareCB 将在线回归后悔转化为情境赌博后悔，Reg_CB(T) = O( sqrt(K T Reg_Sq(T)) )，并继承 oracle 的运行时间和内存。
SquareCB 是通用的：对任意函数类，存在一个 SqAlg 能达到 minimax 速率，且与下界仅在常数和 K 的依赖上匹配。
对于有限 F，当 SqAlg 的 Reg_Sq(T) = O(log|F|) 时，SquareCB 产生 Reg_CB(T) ≤ O( sqrt(K T log|F|) )。
具体实例化表明在线性、高维线性、核以及广义线性模型在 realizability 下具有有利的后悔，并具有可扩展的一轮成本。
该框架对模型失配具有鲁棒性，在 realizability 近似时表现出温和的退化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。