QUICK REVIEW

[论文解读] Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions

Peter Sunehag, Evans, Richard|arXiv (Cornell University)|Dec 3, 2015

Reinforcement Learning in Robotics参考文献 13被引用 32

一句话总结

本文提出了一种用于高维状态与动作空间的Slate马尔可夫决策过程（slate-MDPs）的深度强化学习框架，结合注意力机制与风险偏好型训练，以优化动作的组合排列。通过联合建模序列与组合价值，该方法在推荐系统中（动作维度最高达2000维）显著优于基线模型。

ABSTRACT

Many real-world problems come with action spaces represented as feature vectors. Although high-dimensional control is a largely unsolved problem, there has recently been progress for modest dimensionalities. Here we report on a successful attempt at addressing problems of dimensionality as high as $2000$, of a particular form. Motivated by important applications such as recommendation systems that do not fit the standard reinforcement learning frameworks, we introduce Slate Markov Decision Processes (slate-MDPs). A Slate-MDP is an MDP with a combinatorial action space consisting of slates (tuples) of primitive actions of which one is executed in an underlying MDP. The agent does not control the choice of this executed action and the action might not even be from the slate, e.g., for recommendation systems for which all recommendations can be ignored. We use deep Q-learning based on feature representations of both the state and action to learn the value of whole slates. Unlike existing methods, we optimize for both the combinatorial and sequential aspects of our tasks. The new agent's superiority over agents that either ignore the combinatorial or sequential long-term value aspect is demonstrated on a range of environments with dynamics from a real-world recommendation system. Further, we use deep deterministic policy gradients to learn a policy that for each position of the slate, guides attention towards the part of the action space in which the value is the highest and we only evaluate actions in this area. The attention is used within a sequentially greedy procedure leveraging submodularity. Finally, we show how introducing risk-seeking can dramatically improve the agents performance and ability to discover more far reaching strategies.

研究动机与目标

解决现实应用中常见于推荐系统等场景的高维组合动作空间中的强化学习问题。
形式化Slate马尔可夫决策过程（slate-MDPs），其中仅从动作合集（slate）中执行一个动作，而智能体需优化整个动作合集。
克服标准强化学习智能体将动作独立处理或对所有动作合集进行完全枚举评估的局限性。
开发一种可扩展的方法，利用注意力机制与深度Q学习聚焦于动作空间中的高价值区域，避免完全枚举。
证明通过奖励变换实现的风险偏好型训练可发现长期、高回报的策略。

提出的方法

提出Slate-MDPs作为正式框架，用于智能体从动作合集（有序元组）中选择动作，但环境仅执行其中一项动作的问题。
使用带注意力机制的深度Q网络，基于状态与动作的特征表示来学习整个动作合集的价值。
通过注意力机制实现顺序贪婪的评估过程，聚焦于最有希望的动作子集，利用子模性提升效率。
使用深度确定性策略梯度训练参数化策略网络，引导注意力聚焦于动作空间中的高价值区域。
通过将训练奖励变换为 $ r^\alpha $（其中 $ \alpha > 1 $）引入风险偏好行为，受前景理论启发，以鼓励探索高方差、高回报的路径。
结合最近邻查找与在受限候选集上的价值函数评估，降低计算成本，同时保持性能。

实验结果

研究问题

RQ1深度强化学习能否有效优化动作空间高达2000维的Slate-MDPs，其中仅从合集中执行一个动作？
RQ2在组合动作合集设置中，基于注意力的价值函数近似是否优于独立动作价值估计？
RQ3通过注意力与确定性策略梯度引导的策略网络能否在不完全枚举的情况下高效聚焦于高价值动作子集？
RQ4通过 $ r^\alpha $ 变换实现的风险偏好型训练是否能帮助智能体发现优于标准训练的长期优越策略？
RQ5在不同合集大小与动作空间维度下，完整合集智能体的性能与简单Top-K基线相比如何？

主要发现

在所有合集大小与动作空间维度下，完整合集智能体显著优于简单Top-K基线，且随着合集大小增加，优势更加明显。
当合集大小为1时，完整合集智能体与Top-K智能体一致，因为所有动作均被评估，确认了该基线在此情况下的等价性。
仅评估10%候选动作的智能体，其性能几乎与评估全部动作的智能体相当，证明了注意力驱动剪枝的有效性。
最近邻智能体虽性能略低且波动较大，但探索能力更强，因更高的变异性使其在特定场景中表现优于其他方法。
在 $ \alpha > 1 $ 的风险偏好型训练下，最大环境（N=13138）中的性能大幅提升，超越了最优短视策略。
在N=13138环境中采用风险偏好型训练的智能体，其长期奖励远超标准训练，证实了非短视探索的价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。