QUICK REVIEW

[论文解读] Finite-Time Analysis of Kernelised Contextual Bandits

Michal Vaľko, Nathan Korda|arXiv (Cornell University)|Sep 26, 2013

Advanced Bandit Algorithms Research参考文献 28被引用 91

一句话总结

本文提出 KernelUCB，一种用于上下文Bandits的核化上置信度边界算法，通过再生核希尔伯特空间（RKHS）利用动作上下文之间的相似性。该方法建立了有限时间 regret 上界，其在对抗情形下优于 GP-UCB，且在线性核情形下达到下界，为在具有结构化上下文相似性的大动作空间中实现高效探索提供了理论依据。

ABSTRACT

We tackle the problem of online reward maximisation over a large finite set of actions described by their contexts. We focus on the case when the number of actions is too big to sample all of them even once. However we assume that we have access to the similarities between actions' contexts and that the expected reward is an arbitrary linear function of the contexts' images in the related reproducing kernel Hilbert space (RKHS). We propose KernelUCB, a kernelised UCB algorithm, and give a cumulative regret bound through a frequentist analysis. For contextual bandits, the related algorithm GP-UCB turns out to be a special case of our algorithm, and our finite-time analysis improves the regret bound of GP-UCB for the agnostic case, both in the terms of the kernel-dependent quantity and the RKHS norm of the reward function. Moreover, for the linear kernel, our regret bound matches the lower bound for contextual linear bandits.

研究动机与目标

解决在动作空间较大、无法对所有动作进行采样的情况下在线奖励最大化的问题。
利用上下文相似性，将奖励函数建模为再生核希尔伯特空间（RKHS）中的任意线性函数。
设计一种核化 UCB 算法，以在此设定下高效平衡探索与利用。
提供一个有限时间 regret 上界，其在对抗情形下优于 GP-UCB 等现有方法。
建立紧致的理论保证，包括在线性核情形下匹配已知下界。

提出的方法

提出 KernelUCB，一种利用 RKHS 范数基于上下文相似性建模期望奖励的核化 UCB 算法。
采用频率学派分析方法，推导该算法的累积 regret 上界。
将奖励函数建模为位于再生核希尔伯特空间（RKHS）中，以实现非参数函数逼近。
使用核函数编码动作上下文之间的相似性，从而实现在动作间的泛化。
基于 RKHS 范数和经验方差估计推导上置信度边界，以指导探索。
证明当使用特定核函数时，GP-UCB 是 KernelUCB 的特例，从而可直接比较 regret 上界。

实验结果

研究问题

RQ1我们能否设计一种上下文 Bandits 算法，仅利用上下文相似性即可高效处理大规模动作空间？
RQ2核化 UCB 算法的 regret 如何随核相关量和奖励函数的 RKHS 范数变化？
RQ3当使用线性核时，KernelUCB 是否能达到上下文线性 Bandits 已知的下界？
RQ4在对抗设定下，KernelUCB 的有限时间分析如何优于 GP-UCB 算法？
RQ5所提出的方法是否能在不完全采样动作集的情况下有效实现跨动作泛化？

主要发现

所提出的 KernelUCB 算法在对抗情形下实现了优于 GP-UCB 的累积 regret 上界，无论在核相关量还是奖励函数的 RKHS 范数方面均有所改进。
在线性核情形下，KernelUCB 的 regret 上界匹配了上下文线性 Bandits 的已知下界，表明在此设定下具有理论最优性。
有限时间分析提供了比先前方法更紧致的 regret 上界，尤其在高维或复杂上下文结构的情境下表现更优。
KernelUCB 统一了 GP-UCB 作为其特例，将两者纳入同一理论框架。
该方法通过核函数利用上下文相似性，避免了对所有动作的采样，从而在大动作空间中实现高效学习。
理论结果表明，即使奖励函数事先未知，该算法仍能有效平衡探索与利用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。