QUICK REVIEW

[论文解读] SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

Kimin Lee, Michael Laskin|arXiv (Cornell University)|Jul 9, 2020

Reinforcement Learning in Robotics参考文献 60被引用 47

一句话总结

SUNRISE 是一种简单的统一集成方法，用于离策略深度强化学习，它通过集成不确定性对目标Q值进行重加权，并使用带自举多样性的上置信界探索来改进 SAC 和 Rainbow DQN，在连续和离散任务中。

ABSTRACT

Off-policy deep reinforcement learning (RL) has been successful in a range of challenging domains. However, standard off-policy RL algorithms can suffer from several issues, such as instability in Q-learning and balancing exploration and exploitation. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration. By enforcing the diversity between agents using Bootstrap with random initialization, we show that these different ideas are largely orthogonal and can be fruitfully integrated, together further improving the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.

研究动机与目标

在离策略深度强化学习中动机与解决不稳定性和采样效率低下的问题。
提出一个与 SAC 和 Rainbow DQN 兼容的统一集成框架以提升性能。
利用集成不确定性对 Bellman 备份进行再加权并引导探索，以在学习中获得更好的信噪比。

提出的方法

引入基于集成的加权 Bellman 备份，其中每个智能体 i 使用一个权重 w(s,t)，该权重由集成目标 Q 的标准差驱动：w(s,a)=sigmoid(-Qstd_bar(s,a)*T)+0.5 (Equation 6)。
应用带有随机初始化的引导（bootstrap）来在更新过程中通过二进制掩码 m_{t,i} 强制智能体多样性。
在跨 Q 函数的上置信界（mean+lambda*std）用于选择探索动作：a_t = argmax_a [Q_mean(s_t,a) + lambda Q_std(s_t,a)].
将加权 Bellman 备份与现有的离策略方法结合（连续控制用 SAC；离散控制用 Rainbow DQN）。
提供一个算法（SUNRISE），详细描述基于 SAC 的训练过程，包含 WBB、引导掩码和 UCB 探索（Algorithm 1）。
展示与连续和离散任务的可扩展性和兼容性，并分析集成规模的影响。

实验结果

研究问题

RQ1SUNRISE 是否在连续和离散任务上改进了像 SAC 和 Rainbow DQN 这样的离策略强化学习算法？
RQ2加权 Bellman 备份在提高学习稳定性和数据效率方面有多关键？
RQ3在奖励稀疏或嘈杂的环境中，基于 UCB 的探索是否有益？
RQ4SUNRISE 的收益是否不仅仅通过使用单一更大的网络或更多更新来实现？
RQ5集成规模如何影响性能，饱和点在哪里？

主要发现

SUNRISE 在持续的连续控制基准上稳定提升 SAC，并在 OpenAI Gym 和 DeepMind Control Suite 的若干模型基线之上表现优秀。
SUNRISE 在 Atari 游戏上也提升了 Rainbow DQN，超过了 CURL 和 SimPLe 在多款游戏上的表现。
加权 Bellman 备份显著提升学习稳定性和数据效率，尤其在奖励嘈杂的设置中；其收益超过在复杂环境中的 DisCor。
带集成的 UCB 探索在奖励稀疏任务中提升了性能。
集成收益并非仅仅来自更多更新或更大的网络；五个集合提供稳健提升，超过五时回报递减。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。