QUICK REVIEW

[論文レビュー] RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising

David Rohde, Stephen Bonner|arXiv (Cornell University)|Aug 2, 2018

Advanced Bandit Algorithms Research参考文献 13被引用数 60

ひとこと要約

RecoGym はオンライン広告における商品推薦のための OpenAI Gym 互換の RL 環境を導入し、オーガニックとバンディットのユーザー対話をモデル化して、オフライン評価とオンライン評価を整合させる。

ABSTRACT

Recommender Systems are becoming ubiquitous in many settings and take many forms, from product recommendation in e-commerce stores, to query suggestions in search engines, to friend recommendation in social networks. Current research directions which are largely based upon supervised learning from historical data appear to be showing diminishing returns with a lot of practitioners report a discrepancy between improvements in offline metrics for supervised learning and the online performance of the newly proposed models. One possible reason is that we are using the wrong paradigm: when looking at the long-term cycle of collecting historical performance data, creating a new version of the recommendation model, A/B testing it and then rolling it out. We see that there a lot of commonalities with the reinforcement learning (RL) setup, where the agent observes the environment and acts upon it in order to change its state towards better states (states with higher rewards). To this end we introduce RecoGym, an RL environment for recommendation, which is defined by a model of user traffic patterns on e-commerce and the users response to recommendations on the publisher websites. We believe that this is an important step forward for the field of recommendation systems research, that could open up an avenue of collaboration between the recommender systems and reinforcement learning communities and lead to better alignment between offline and online performance metrics.

研究の動機と目的

オフライン指標とオンラインパフォーマンスのギャップを強調し、純粋に監視学習のみを用いた推奨システムから強化学習への移行を促す。
オーガニック（サイト閲覧）とバンディット（広告）双方の対話をモデル化し、推奨の長期効果を研究する tunable な RL 環境を提供する。
制御されたシミュレータ内で、ユーザートラフィックパターンと広告露出効果の両方を考慮したポリシーの評価を可能にする。）

提案手法

オーガニックセッションとバンディット出版社セッションを含むパラメータ化されたユーザトラフィックモデルを定義する。
RLエージェント向けに Reset および Step ルーチンを備えた OpenAI Gym 準拠の環境を作成する。
オーガニックとバンディット行動間の制御可能な相関と、調整可能な隠れたユーザー-アイテムクラスタを組み込む。
広告露出がクリック率に及ぼす影響をモデル化し、広告疲労などの時間変化する非定常性を許容する。

実験結果

リサーチクエスチョン

RQ1バンディットデータの量が変動する中で、オーガニック情報とバンディット情報をどのように組み合わせて推薦性能を向上させることができるか？
RQ2オーガニック行動とバンディット行動間の相関レベルが、さまざまな学習戦略の有効性にどう影響するか？
RQ3中程度のデータ状況で、単一の組み合わせモデルが純粋なオーガニックまたは純粋なバンディット手法を上回ることができるか？
RQ4RecoGym 内で両データソースを活用する RL エージェントが適切なポリシーを学習していることを検証する妥当性検証とは？
RQ5この環境での RL 手法に対して妥当なベンチマークを提供するベースラインエージェントは何か？

主な発見

RecoGym をオンライン広告における推奨の最初の RL 環境として導入する。
オーガニックとバンディットの対話の両方をサポートし、相関とユーザー-アイテムクラスタリングの次元を調整可能。
シミュレータと相互作用するためのベースラインエージェント（Random、Logistic、Supervised-Prod2Vec）を提供。
オーガニックデータとバンディットデータをデータレジーム全体での期待パフォーマンスに結びつける妥当性検証フレームワークを提供。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。