QUICK REVIEW

[论文解读] A Large-scale Open Dataset for Bandit Algorithms

Yuta Saito, Shunsuke Aihara|arXiv (Cornell University)|Aug 17, 2020

Advanced Bandit Algorithms Research参考文献 50被引用 5

一句话总结

本论文提出一个大规模开放数据集及标准化流程，用于在现实世界中的上下文Bandit算法中进行离策略评估（OPE），数据源自ZOZOTOWN时尚电商平台的真实用户交互。该数据集支持对OPE估计器的公平基准测试，并表明表现良好的估计器能够识别出在真实推荐系统性能上显著优于历史策略的反事实策略。

ABSTRACT

We build and publicize the Open Bandit Dataset and Pipeline to facilitate scalable and reproducible research on bandit algorithms. They are especially suitable for off-policy evaluation (OPE), which attempts to predict the performance of hypothetical algorithms using data generated by a different algorithm. We construct the dataset based on experiments and implementations on a large-scale fashion e-commerce platform, ZOZOTOWN. The data contain the ground-truth about the performance of several bandit policies and enable the fair comparisons of different OPE estimators. We also provide a pipeline to make its implementation easy and consistent. As a proof of concept, we use the dataset and pipeline to implement and evaluate OPE estimators. First, we find that a well-established estimator fails, suggesting that it is critical to choose an appropriate estimator. We then select a well-performing estimator and use it to improve the platform's fashion item recommendation. Our analysis succeeds in finding a counterfactual policy that significantly outperforms the historical ones. Our open data and pipeline will allow researchers and practitioners to easily evaluate and compare their bandit algorithms and OPE estimators with others in a large, real-world setting.

研究动机与目标

为解决Bandit算法中离策略评估（OPE）估计器缺乏大规模真实世界数据集的问题。
利用生产级电商平台上真实世界的数据，实现OPE估计器之间公平且可复现的比较。
提供一个标准化流程，确保Bandit算法与OPE方法在实现与评估上的一致性。
通过识别出在真实世界推荐系统中显著优于历史策略的反事实策略，展示OPE的实际影响。
支持研究人员与实践者在可扩展的真实世界环境中评估与改进Bandit算法。

提出的方法

数据集基于部署在ZOZOTOWN（一个大规模时尚电商平台）上的多个Bandit策略的记录交互数据构建。
数据集包含上下文信息、采取的动作以及观测到的奖励，支持对假设性策略的反事实评估。
提供标准化流程，确保在不同研究环境中OPE估计器的实现与评估保持一致。
作者利用该数据集评估多种OPE估计器，识别出主流估计器在性能上的差距。
选择一个高性能的OPE估计器，并通过反事实分析优化平台的时尚商品推荐策略。
该流程支持端到端评估，从数据加载到估计器比较，确保可复现性与可扩展性。

实验结果

研究问题

RQ1在来自生产级电商平台的真实世界大规模Bandit数据上，哪些OPE估计器表现稳定可靠？
RQ2一个表现良好的OPE估计器是否能够识别出在真实世界推荐系统中显著优于历史策略的反事实策略？
RQ3当应用于具有复杂行为策略的真实世界数据时，现有主流OPE估计器的性能会如何退化？
RQ4OPE在多大程度上能够实现在无需在线A/B测试的情况下安全且可扩展地改进策略？
RQ5所提出的数据集与流程在实现Bandit算法一致且可复现的评估方面有多有效？

主要发现

一个广为人知的OPE估计器在真实世界数据集上未能提供准确的性能估计，凸显了实际应用中估计器选择的重要性。
另一个表现优异的OPE估计器成功识别出在推荐系统中显著优于历史策略的反事实策略。
所提出的数据集与流程能够在不同研究环境中实现OPE估计器的一致且可复现的评估。
通过OPE发现的反事实策略带来了可测量的推荐性能提升，证明了离策略评估的实际价值。
该数据集与流程支持大规模真实世界Bandit算法的基准测试，实现了OPE方法的公平比较与持续进步。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。