QUICK REVIEW

[论文解读] Deep Reinforcement Learning for List-wise Recommendations

Xiangyu Zhao, Liang Zhang|arXiv (Cornell University)|Dec 30, 2017

Recommender Systems and Techniques参考文献 32被引用 109

一句话总结

本文提出了一种用于列表式推荐的深度强化学习框架（LIRD），它利用带在线环境模拟器的 actor-critic 架构在离线阶段进行训练和评估后再在线部署，在真实电商数据上相较基线显示出提升。

ABSTRACT

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.

研究动机与目标

动机：在推荐系统中超越静态、短期策略，促使实现动态、长期优化的需求。
将用户–推荐系统交互建模为一个 MDP，以在时间上最大化累计回报。
开发一个在线环境模拟器，以实现离线预训练和上线前的评估。
引入一个列表式、可扩展的 RL 框架（LIRD），能够处理大规模且动态变化的物品空间。
在真实电商数据上证明列表式推荐的有效性。

提出的方法

将推荐系统建模为一个 MDP，状态 s 为用户浏览历史，动作 a 为一个长度为 K 的推荐项列表，奖励 r 来自用户反馈，折扣因子 γ。
使用一个在线环境模拟器，它利用历史记忆和余弦相似度将 (state, action) 对映射到奖励，从而实现离线训练。
使用一个 actor-critic 架构，其中 Actor 生成状态特定的权重向量来对项目进行打分，产生一个列表式动作；Critic 通过深度 Q 网络近似来估计 Q(s,a)。
使用深度确定性策略梯度（DDPG）进行训练，采用经验回放、目标网络和优先采样。
采用两阶段训练过程：先从交互中生成转移，再用小批次更新 Actor 与 Critic 网络。
通过改变 K 来评估列表式策略，并在真实数据集上与 CF、FM、DNN、RNN、DQN 基线进行比较。

实验结果

研究问题

RQ1所提出的框架在物品推荐任务中是否优于具代表性的基线？
RQ2列表式推荐（变化的 K）在长期场景中对性能有何影响？
RQ3在线模拟器是否能够提供可靠的离线预训练，从而缩小上线部署的差距？

主要发现

所提出的框架在短会话和长会话中都优于基线，在长会话中由于对长期奖励的优化而获得更大提升。
列表式推荐（K=4）比其他 K 值表现更好，表明在捕捉项相关性与避免噪声之间取得了平衡。
LIRD 的训练速度比 DQN 快，同时性能相近或更好，因为通过避免对所有动作进行评估来减少计算量。
在线模拟器使离线训练和评估成为可能，缓解离线–在线差距，并便于上线使用的参数初始化。
使用历史用户-项目嵌入和项目历史有助于提升对用户偏好的建模和可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。