QUICK REVIEW

[论文解读] Sequential Batch Learning in Finite-Action Linear Contextual Bandits

Yanjun Han, Zhengqing Zhou|arXiv (Cornell University)|Apr 14, 2020

Advanced Bandit Algorithms Research参考文献 55被引用 31

一句话总结

该论文分析有限行动线性上下文赌博中的序列批量学习，在对抗性和随机上下文下推导遗憾的上界和下界，并提出相应的算法。

ABSTRACT

We study the sequential batch learning problem in linear contextual bandits with finite action sets, where the decision maker is constrained to split incoming individuals into (at most) a fixed number of batches and can only observe outcomes for the individuals within a batch at the batch's end. Compared to both standard online contextual bandits learning or offline policy learning in contexutal bandits, this sequential batch learning problem provides a finer-grained formulation of many personalized sequential decision making problems in practical applications, including medical treatment in clinical trials, product recommendation in e-commerce and adaptive experiment design in crowdsourcing. We study two settings of the problem: one where the contexts are arbitrarily generated and the other where the contexts are extit{iid} drawn from some distribution. In each setting, we establish a regret lower bound and provide an algorithm, whose regret upper bound nearly matches the lower bound. As an important insight revealed therefrom, in the former setting, we show that the number of batches required to achieve the fully online performance is polynomial in the time horizon, while for the latter setting, a pure-exploitation algorithm with a judicious batch partition scheme achieves the fully online performance even when the number of batches is less than logarithmic in the time horizon. Together, our results provide a near-complete characterization of sequential decision making in linear contextual bandits when batch constraints are present.

研究动机与目标

激励并形式化在仅观测到每个批次结束时的奖励的情况下的序列批量学习。
刻画一个固定批次数量 M 如何影响有限行动的线性上下文赌博中的遗憾。
发展算法并证明对抗性与随机上下文设置下的遗憾上界与下界。

提出的方法

用一个包含 M 个批次的网格和分批策略来形式化序列批量学习，将在线上下文赌博扩展到受批次约束的反馈。
提出一个序列批量 UCB (SBUCB) 算法，在每个批次结束时更新 theta 的估计，并在每个批次内使用上置信界。
提供一个主算法以处理相关性问题并建立置信界的可行性。
推导对抗性上下文下的遗憾上界和下界，显示出对 T 的多对数因子以及对 M 的依赖。
用纯利用算法分析随机上下文，并推导相应的遗憾界。
给出与问题相关的遗憾界并讨论高概率保证。

实验结果

研究问题

RQ1在对抗性上下文下，将反馈限制为 M 个批次如何影响有限行动线性上下文赌博中的遗憾？
RQ2在对抗性上下文设定下，使用序列批量 UCB 算法可以达到的近似最优遗憾率是多少？
RQ3随机上下文如何改变最优批量策略和可实现的遗憾？
RQ4给出紧密的下界，显示达到最优遗憾所必需的某个批次数量？
RQ5在随机上下文下，纯利用策略如何表现及其遗憾特征？

主要发现

在对抗性上下文设定下，存在一个序列批量算法实现了多对数(T)乘以 (sqrt(dT) + dT/M) 的期望遗憾。
一个下界表明当 K=2 时，遗憾至少为 c*(sqrt(dT) + min{T sqrt(d)/M, T/ sqrt(M)})，与上界在多对数和常量因子上吻合。
这意味着 Theta(sqrt(dT)) 个批次足以达到完全在线遗憾，在低维情况下 O(sqrt(Td)) 个批次就足够。
在随机上下文设定下，纯利用算法可在仅有多对数(T) 个批次，具体接近 log log(T/d^2) 的情况下实现极小极大遗憾 tilde Theta(sqrt(dT))。
对于随机上下文，上界和下界在多对数因子内，表明在比对抗情况显著更少的批次数量下实现近似极小极大最优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。