QUICK REVIEW

[论文解读] A Contextual Bandit Bake-off

Alberto Bietti, Alekh Agarwal|arXiv (Cornell University)|Feb 12, 2018

Advanced Bandit Algorithms Research参考文献 35被引用 53

一句话总结

使用监督数据集进行的大规模经验评估，以比较对监督学习或 oracle 的还原，其中 RegCB、Greedy 和 Cover 变体在不同设置中表现最佳。

ABSTRACT

Contextual bandit algorithms are essential for solving many real-world interactive machine learning problems. Despite multiple recent successes on statistically and computationally efficient methods, the practical behavior of these algorithms is still poorly understood. We leverage the availability of large numbers of supervised learning datasets to empirically evaluate contextual bandit algorithms, focusing on practical methods that learn by relying on optimization oracles from supervised learning. We find that a recent method (Foster et al., 2018) using optimism under uncertainty works the best overall. A surprisingly close second is a simple greedy baseline that only explores implicitly through the diversity of contexts, followed by a variant of Online Cover (Agarwal et al., 2014) which tends to be more conservative but robust to problem specification by design. Along the way, we also evaluate various components of contextual bandit algorithm design such as loss estimators. Overall, this is a thorough study and review of contextual bandit methodology.

研究动机与目标

评估依赖于来自监督学习的优化 oracle 的上下文带权赌博算法的实际性能。
比较损失估计量以及对离线策略学习的降维在现实、高维环境中的效果。
识别哪些方法在现实世界部署中最健壮、最实用。
为从业者在算法设计选择和评估方法上提供指导。

提出的方法

通过隐藏未被选择动作的损失，在大量带有代价敏感和多类数据集上对上下文赌博进行仿真。
评估在线实现的损失估计量（IPS、DR、IWR）以及通过在线 oracle（CSC 和回归）进行的优化。
实现并比较多种算法：RegCB（基于置信度）、Cover-NU 与 Cover、epsilon-greedy 变体、Bag/Online BTS，以及 Greedy。
使用 Vowpal Wabbit 进行在线更新，采用自适应、归一化及对重要性权重敏感的梯度方法。
探索损失编码的选择以及对离线策略学习的替代降维。
分析这些方法在具有五个及以上动作的数据集上的表现。

实验结果

研究问题

RQ1哪些实用的上下文带权赌博算法在大型且多样化的数据集集合上实现了最佳的整体性能？
RQ2不同的损失估计量与对监督学习的降维在实际中如何影响探索与后悔？
RQ3损失编码和降维机制在上下文赌博的经验有效性中扮演何种角色？
RQ4哪些方法对问题规格和数据集特征具有鲁棒性，以及实际权衡是什么？

主要发现

RegCB 在多种实验条件下通常表现最好。
一个简单的 Greedy 基线在实践中常常匹配或优于许多探索方法。
Online Cover（Cover-NU）的一个变体在大量数据集上具有竞争力且设计健壮。
损失编码的选择和降维技术（如基于重要性权重的回归）显著影响性能和方差。
从这些方法部署中得到的日志可能不适用于离线策略评估，强调了实际部署中的考量。
研究表明应理论性关注理解贪婪策略并利用易于探索的数据集。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。