QUICK REVIEW

[论文解读] RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising

David Rohde, Stephen Bonner|arXiv (Cornell University)|Aug 2, 2018

Advanced Bandit Algorithms Research参考文献 13被引用 60

一句话总结

RecoGym 引入一个与 OpenAI Gym 兼容的强化学习环境，用于在线广告中的产品推荐，建模有机互动和 bandit 用户交互，以实现离线评估和在线评估的一致性。

ABSTRACT

Recommender Systems are becoming ubiquitous in many settings and take many forms, from product recommendation in e-commerce stores, to query suggestions in search engines, to friend recommendation in social networks. Current research directions which are largely based upon supervised learning from historical data appear to be showing diminishing returns with a lot of practitioners report a discrepancy between improvements in offline metrics for supervised learning and the online performance of the newly proposed models. One possible reason is that we are using the wrong paradigm: when looking at the long-term cycle of collecting historical performance data, creating a new version of the recommendation model, A/B testing it and then rolling it out. We see that there a lot of commonalities with the reinforcement learning (RL) setup, where the agent observes the environment and acts upon it in order to change its state towards better states (states with higher rewards). To this end we introduce RecoGym, an RL environment for recommendation, which is defined by a model of user traffic patterns on e-commerce and the users response to recommendations on the publisher websites. We believe that this is an important step forward for the field of recommendation systems research, that could open up an avenue of collaboration between the recommender systems and reinforcement learning communities and lead to better alignment between offline and online performance metrics.

研究动机与目标

通过强调离线指标与在线性能之间的差距，推动从纯监督推荐系统向强化学习的转变。
提供一个可调参数的强化学习环境，建模有机（网站浏览）和 bandit（广告）交互，以研究推荐的长期影响。
在受控的模拟器中，利用用户流量模式和广告曝光效应来评估策略。

提出的方法

定义一个参数化的用户流量模型，包含有机会话和 bandit 发布者会话。
创建一个符合 OpenAI Gym 的环境，具备 Reset 和 Step 例程以供 RL 代理使用。
纳入可控的有机与 bandit 行为之间的相关性，以及可调的隐藏用户-物品簇。
建模广告曝光对点击率的影响，并允许时间变化的非平稳性，如广告疲劳。

实验结果

研究问题

RQ1在不同量级的 bandit 数据下，如何将有机信息与 bandit 信息结合以提升推荐性能？
RQ2有机与 bandit 行为之间相关性的水平如何影响不同学习策略的有效性？
RQ3在中等数据规模下，单一组合模型是否能超越纯有机或纯 bandit 方法？
RQ4哪些健全性检查可以验证利用两种数据源的 RL 代理在 RecoGym 内学习出合理的策略？
RQ5哪些基线代理可以为该环境中的 RL 方法提供合理的基准？

主要发现

首次提出 RecoGym 作为在线广告中推荐的强化学习环境。
同时支持有机和 bandit 交互，具备可调相关性和用户-物品聚类维度。
提供基线代理（Random、Logistic、Supervised-Prod2Vec）与仿真器交互。
提供健全性检查框架，将有机数据和 bandit 数据与不同数据规模下的预期性能联系起来。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。