QUICK REVIEW

[论文解读] Multi-task Deep Reinforcement Learning with PopArt

Matteo Hessel, Hubert Soyer|arXiv (Cornell University)|Sep 12, 2018

Reinforcement Learning in Robotics参考文献 51被引用 24

一句话总结

本文提出PopArt归一化用于多任务深度强化学习，使单一智能体能够同时学习多个任务，并在各任务间实现均衡的更新贡献。通过使价值函数更新对奖励尺度和稀疏性保持不变，该方法实现了最先进性能——在57款Atari游戏中超越中位数人类表现，并在30项DeepMind Lab任务中取得72.8%的平均得分，仅使用一个共享权重的策略。

ABSTRACT

The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this work, we study the problem of learning to master not one but multiple sequential-decision tasks at once. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve. Such tasks appear more salient to the learning process, for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality. We propose to automatically adapt the contribution of each task to the agent's updates, so that all tasks have a similar impact on the learning dynamics. This resulted in state of the art performance on learning to play all games in a set of 57 diverse Atari games. Excitingly, our method learned a single trained policy - with a single set of weights - that exceeds median human performance. To our knowledge, this was the first time a single agent surpassed human-level performance on this multi-task domain. The same approach also demonstrated state of the art performance on a set of 30 tasks in the 3D reinforcement learning platform DeepMind Lab.

研究动机与目标

解决多任务强化学习中因不同任务间奖励尺度和稀疏性差异导致的学习动态失衡问题。
使单一智能体能够同时学习多个多样化任务，而不会牺牲任一任务的性能。
开发一种方法，自动调节每个任务对学习更新的贡献，确保所有任务在策略优化中具有相等的影响。
通过使价值函数更新对奖励幅度和稀疏性保持不变，提升并行多任务RL中的数据效率和训练稳定性。
证明单一共享策略可在广泛环境集合中超越中位数人类表现，标志着多任务强化学习的一个里程碑。

提出的方法

该方法将PopArt归一化应用于演员-critic网络的价值函数头，保持状态值估计的尺度不变性。
PopArt使用回报的运行均值和标准差估计对价值函数输出进行归一化，其更新具有自适应性，无需反向传播。
归一化参数（μ和σ）在训练过程中在线更新，衰减率β = 3×10⁻⁴，确保稳定并避免数值问题。
通过引入线性变换的改进损失函数更新价值函数，保留原始输出尺度，维持价值估计的完整性。
该方法集成于IMPALA框架中，归一化统计量在线更新，标准演员-critic更新按顺序执行。
超参数通过基于种群的训练（PBT）进行调优，β或归一化边界无需手动调整。

实验结果

研究问题

RQ1单一深度强化学习智能体能否在保持所有任务性能均衡的前提下，同时掌握多个多样化任务？
RQ2如何缓解任务间奖励尺度和稀疏性的差异，以防止某些任务主导学习动态？
RQ3与标准价值函数更新相比，PopArt归一化是否能提升多任务深度强化学习中的数据效率和训练稳定性？
RQ4单一共享策略是否能在大量环境（如Atari-57和DmLab-30）中实现超人类性能？
RQ5尺度不变的价值函数学习在多任务强化学习设置中在多大程度上实现了更好的泛化能力和性能提升？

主要发现

所提出的基于PopArt的方法在57款Atari基准测试中实现了110%的中位数人类归一化得分，使用单一共享策略超越了中位数人类表现。
在30个任务的DeepMind Lab基准测试中，该方法实现了72.8%的平均人类归一化得分，创下多任务强化学习的新SOTA记录。
由于价值函数更新的自适应归一化，该方法表现出更高的数据效率，且计算开销极低。
该方法成功平衡了具有高度可变奖励幅度和稀疏性的任务之间的学习，防止任一任务主导训练过程。
结果表明，单一智能体能够泛化至多样化环境，并在多数任务上同时超越人类水平表现。
该方法与现有多任务强化学习框架（如IMPALA）兼容，可与策略蒸馏、主动采样等其他技术结合使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。