QUICK REVIEW

[论文解读] Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction

Hongyao Tang, Jianye Hao|arXiv (Cornell University)|Sep 25, 2018

Reinforcement Learning in Robotics参考文献 38被引用 33

一句话总结

本文提出一种具有时间抽象的分层深度多智能体强化学习方法，以解决合作型多智能体环境中的稀疏奖励与延迟奖励问题。通过将任务分解为高层协调与低层技能，引入一种新型经验回放机制（ACER），该方法能够在不同时间尺度上实现高效学习，在Fever Basketball Defense和Multiagent Trash Collection等稀疏奖励任务中显著优于标准MARL方法。

ABSTRACT

Multiagent reinforcement learning (MARL) is commonly considered to suffer from non-stationary environments and exponentially increasing policy space. It would be even more challenging when rewards are sparse and delayed over long trajectories. In this paper, we study hierarchical deep MARL in cooperative multiagent problems with sparse and delayed reward. With temporal abstraction, we decompose the problem into a hierarchy of different time scales and investigate how agents can learn high-level coordination based on the independent skills learned at the low level. Three hierarchical deep MARL architectures are proposed to learn hierarchical policies under different MARL paradigms. Besides, we propose a new experience replay mechanism to alleviate the issue of the sparse transitions at the high level of abstraction and the non-stationarity of multiagent learning. We empirically demonstrate the effectiveness of our approaches in two domains with extremely sparse feedback: (1) a variety of Multiagent Trash Collection tasks, and (2) a challenging online mobile game, i.e., Fever Basketball Defense.

研究动机与目标

为解决合作型多智能体强化学习（MARL）中稀疏与延迟奖励带来的有效策略学习挑战。
探索在深度学习设置下，具有时间抽象的分层MARL，以实现在多时间尺度上的学习。
通过一种新型经验回放机制，缓解多智能体训练中的非平稳性与稀疏高层转移问题。
在类现实世界环境中，验证分层架构在不同MARL范式下的有效性。

提出的方法

提出三种分层深度MARL架构：h-IL（分层独立学习器）、h-Comm（分层通信网络）和h-Qmix（分层Qmix），分别适配不同的MARL范式。
设计两级层次结构：低层策略学习基础技能，高层策略基于子目标与子转移进行协调。
通过在高层转移中引入子转移信息，并支持并发回放，设计增强型并发经验回放（ACER），以稳定学习过程。
h-Qmix与h-Comm采用集中训练、分散执行（CTDE）机制，而h-IL采用独立学习并辅以高层协调。
利用时间抽象将长时程任务分解为可管理的子任务，降低信用分配与探索的难度。
在h-Qmix中使用联合动作值函数，在h-Comm中引入显式通信，以提升高层协调能力。

实验结果

研究问题

RQ1具有时间抽象的分层深度MARL能否在稀疏与延迟奖励环境中有效学习合作策略？
RQ2在多时间尺度上学习——即低层技能与高层协调——如何提升MARL的样本效率与性能表现？
RQ3所提出的ACER机制在多大程度上缓解了稀疏高层转移与多智能体训练中非平稳性的问题？
RQ4在稀疏奖励环境下，不同MARL范式（独立型、基于通信型、基于价值型）在分层抽象下的表现如何？

主要发现

h-IL优于IL-DQN与Low-Level-Only，证明了在稀疏奖励环境中，分层学习结合时间抽象具有显著价值。
h-Comm与h-Qmix在性能上优于h-IL，其中h-Comm在Fever Basketball Defense中实现36%的封堵率，h-Qmix达到37%的封堵率。
ACER显著提升了高层策略学习效果：h-IL-ACER将封堵率从0.27提升至0.36，优于基础h-IL，并接近h-Comm的性能水平。
h-Comm与h-Qmix采用不同防守策略——联合防守（更高封堵率）与一对一防守（更优覆盖范围），其性能与策略差异充分体现了这一点。
异步终止设置导致性能下降3–5%，表明异步分层MARL中非平稳性仍具挑战。
ACER在h-IL上的改进效果优于h-Comm，表明其在稳定独立学习方面比在通信或基于价值的架构中更有效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。