QUICK REVIEW

[论文解读] Phased Exploration with Greedy Exploitation in Stochastic Combinatorial Partial Monitoring Games

Sougata Chaudhuri, Ambuj Tewari|arXiv (Cornell University)|Jan 1, 2016

Advanced Bandit Algorithms Research参考文献 7被引用 52

一句话总结

该论文提出了一种用于随机组合部分监控（CPM）博弈的分阶段探索与贪心利用（PEGE）框架，仅使用一个argmax预言机，即实现了O(T^{2/3}√log T)的分布无关 regret 和 O(log²T)的分布相关 regret。与以往工作不同，该方法无需唯一最优动作假设，也避免了复杂的arg-secondmax预言机，从而可高效应用于仅反馈顶部结果的在线排序问题。

ABSTRACT

Partial monitoring games are repeated games where the learner receives feedback that might be different from adversary's move or even the reward gained by the learner. Recently, a general model of combinatorial partial monitoring (CPM) games was proposed \cite{lincombinatorial2014}, where the learner's action space can be exponentially large and adversary samples its moves from a bounded, continuous space, according to a fixed distribution. The paper gave a confidence bound based algorithm (GCB) that achieves $O(T^{2/3}\log T)$ distribution independent and $O(\log T)$ distribution dependent regret bounds. The implementation of their algorithm depends on two separate offline oracles and the distribution dependent regret additionally requires existence of a unique optimal action for the learner. Adopting their CPM model, our first contribution is a Phased Exploration with Greedy Exploitation (PEGE) algorithmic framework for the problem. Different algorithms within the framework achieve $O(T^{2/3}\sqrt{\log T})$ distribution independent and $O(\log^2 T)$ distribution dependent regret respectively. Crucially, our framework needs only the simpler "argmax" oracle from GCB and the distribution dependent regret does not require existence of a unique optimal action. Our second contribution is another algorithm, PEGE2, which combines gap estimation with a PEGE algorithm, to achieve an $O(\log T)$ regret bound, matching the GCB guarantee but removing the dependence on size of the learner's action space. However, like GCB, PEGE2 requires access to both offline oracles and the existence of a unique optimal action. Finally, we discuss how our algorithm can be efficiently applied to a CPM problem of practical interest: namely, online ranking with feedback at the top.

研究动机与目标

解决先前CPM算法同时依赖argmax和arg-secondmax预言机的局限性。
为具有指数级动作空间和连续对手动作的组合部分监控博弈设计一个最小化 regret 的算法。
在分布相关 regret 分析中消除对唯一最优动作假设的需求。
实现在在线排序等实际应用中仅限反馈场景下的高效部署。
在减少计算依赖的同时，实现与现有方法相当或更优的 regret 边界。

提出的方法

提出一种分阶段探索框架，交替进行探索与贪心利用阶段。
仅使用argmax预言机——相比先前方法所需的双预言机机制更为简单。
利用当前的奖励估计值进行贪心利用，以选择动作。
提出PEGE2，结合间隙估计与PEGE，实现O(log T)的分布相关 regret。
满足所有CPM模型假设，包括全局可观测性以及奖励函数的Lipschitz连续性。
将该框架应用于仅反馈顶部结果的在线排序问题，将其建模为具有排列动作的CPM博弈。

实验结果

研究问题

RQ1能否设计一种CPM算法，在无需唯一最优动作假设的前提下，实现O(log²T)的分布相关 regret？
RQ2能否在避免依赖arg-secondmax预言机的同时，将 regret 边界提升至O(log T)？
RQ3PEGE框架能否高效应用于仅反馈顶部结果的在线排序问题？
RQ4与基于置信区间的方法相比，分阶段探索结合贪心利用在CPM博弈中是否表现更优？
RQ5该框架能否在保持低 regret 的前提下处理连续的学习者动作空间？

主要发现

PEGE算法仅使用argmax预言机，即实现了O(T^{2/3}√log T)的分布无关 regret 和 O(log²T)的分布相关 regret。
与先前的分布相关边界不同，PEGE框架无需存在唯一最优动作。
PEGE2实现了O(log T)的分布相关 regret，与GCB边界相当，但无需使用arg-secondmax预言机。
仅反馈顶部结果的在线排序问题被正式建模为CPM博弈，满足所有必需假设。
该框架适用于有限和连续的学习者动作空间，包括用于排序的连续得分向量。
实验验证表明，该方法在反馈受限的大规模排序问题中具有实际可行性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。