QUICK REVIEW

[论文解读] Reinforcement Learning based Recommender System using Biclustering Technique

Sungwoon Choi, Heonseok Ha|arXiv (Cornell University)|Jan 17, 2018

Recommender Systems and Techniques参考文献 18被引用 62

一句话总结

一篇论文提出一个基于强化学习的推荐系统，将其框定为网格世界，使用双聚类来缩小状态/动作空间，并通过在线更新实现冷启动推荐并具备可解释性。

ABSTRACT

A recommender system aims to recommend items that a user is interested in among many items. The need for the recommender system has been expanded by the information explosion. Various approaches have been suggested for providing meaningful recommendations to users. One of the proposed approaches is to consider a recommender system as a Markov decision process (MDP) problem and try to solve it using reinforcement learning (RL). However, existing RL-based methods have an obvious drawback. To solve an MDP in a recommender system, they encountered a problem with the large number of discrete actions that bring RL to a larger class of problems. In this paper, we propose a novel RL-based recommender system. We formulate a recommender system as a gridworld game by using a biclustering technique that can reduce the state and action space significantly. Using biclustering not only reduces space but also improves the recommendation quality effectively handling the cold-start problem. In addition, our approach can provide users with some explanation why the system recommends certain items. Lastly, we examine the proposed algorithm on a real-world dataset and achieve a better performance than the widely used recommendation algorithm.

研究动机与目标

激发在序列推荐中使用强化学习的动机，并解决基于 RL 的推荐系统中的大行动空间。
引入双聚类以创建网格世界样的马尔可夫决策过程，从而减少状态和行动空间。
实现在线更新，使用户反馈动态改变奖励与策略。
通过将推荐与特定双聚簇（状态）关联，提供可解释的推荐。
在 Movielens 数据集上进行实证评估，并与标准基线进行比较。

提出的方法

将推荐系统表述为一个网格世界的MDP，其中 n^2 个双聚簇构成状态，最多四个方向性动作。
使用用户向量的二维嵌入和贪心最近放置算法，将每个状态映射到一个双聚簇（U,I）。
使用 Q-learning 或 SARSA 搭配 epsilon-贪婪探索来学习 Q 函数。
将奖励定义为相邻状态用户集合的雅卡距离，以鼓励相似的用户群体。
通过选择前-k 个起始状态并遵循 epsilon-贪婪策略访问状态并提出物品来生成推荐。
通过观察的满意度调整状态中的用户集合，在线更新模型，从而改变奖励与策略。

实验结果

研究问题

RQ1双聚类是否能显著减少状态和行动空间，使 RL 在推荐系统中成为可行方案？
RQ2在冷启动条件下，以网格世界形式的 RL 方法是否比标准方法在排序指标上有改进？
RQ3在这种基于双聚簇的 RL 设置中，Q-learning 和 SARSA 是否会产生不同的性能？
RQ4系统是否能基于双聚簇状态对其推荐进行解释？
RQ5在线更新用户状态关联对推荐随时间的影响如何？

主要发现

所 proposed 方法在 Movielens 数据集的冷启动条件下，在 P@30 和 R@30 上击败全局均值、基于用户和基于物品的基线。
在 Movielens_100k 上，所提方法得到 P@30=0.246，R@30=0.169；在 Movielens_1M 上，P@30=0.277，R@30=0.155。
Q-learning 和 SARSA 在该环境中呈现相似的学习曲线，性能相近。
系统可以通过指明对应的双聚簇状态及其物品/用户群来解释推荐。
基于用户反馈对状态定义的在线更新可调整奖励并实现实时自适应推荐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。