QUICK REVIEW

[论文解读] Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration

Lulu Zheng, Jiarui Chen|arXiv (Cornell University)|Nov 22, 2021

Reinforcement Learning in Robotics被引用 40

一句话总结

EMC 引入基于预测个体 Q 值的好奇心驱动内在奖励，并利用 episodic memory 提升样本效率，在 StarCraft II 微管理基准上实现强协调性并超越 MARL 基线。

ABSTRACT

Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. We leverage an insight of popular factorized MARL algorithms that the "induced" individual Q-values, i.e., the individual utility functions used for local execution, are the embeddings of local action-observation histories, and can capture the interaction between agents due to reward backpropagation during centralized training. Therefore, we use prediction errors of individual Q-values as intrinsic rewards for coordinated exploration and utilize episodic memory to exploit explored informative experience to boost policy training. As the dynamics of an agent's individual Q-value function captures the novelty of states and the influence from other agents, our intrinsic reward can induce coordinated exploration to new or promising states. We illustrate the advantages of our method by didactic examples, and demonstrate its significant outperformance over state-of-the-art MARL baselines on challenging tasks in the StarCraft II micromanagement benchmark.

研究动机与目标

在 CTDE 下推动协作型多智能体强化学习的高效协同与探索。
提出一种基于预测个体 Q 值以引导探索的好奇心驱动机制。
使用 episodic memory 对学习进行正则化并重复利用有信息的过去经验。
使其与线性价值因子化框架如 VDN/QMIX/QPLEX 兼容，以提升可扩展性。

提出的方法

在一个线性价值因子化框架内，将好奇心定义为个体 Q 值预测误差。
将内在奖励 r^int 计算为跨代理的预测 Q 值与外在 Q 值之间的平均 L2 距离。
使用单步时序差分目标，结合外部奖励和内在奖励来训练推断模块。
对全局状态维护 episodic memory，存储记忆中最好的回报，并用它来形成用于正则化的记忆目标 H。
对目标进行软更新以稳定学习。
将好奇心模块和记忆集成到基于 CTDE 的 MARL 算法中（如 VDN/QMIX/QPLEX），用于 EMC。

实验结果

研究问题

RQ1为好奇心预测个体 Q 值是否比预测观测历史更能带来更好的协同探索？
RQ2在具有挑战性的 MARL 任务（如 SMAC）中，EMC 是否能相比最先进基线取得更优的绩效？
RQ3好奇心模块和 episodic memory 对学习效率与稳定性的影响是什么？
RQ4在 CTDE 与价值因子化下，EMC 如何随智能体数量的增加而扩展？

主要发现

在困难的 SMAC 任务上，EMC 显著超越了最先进的 MARL 基线。
在难地图上，EMC 在如 corridor、3s5z_vs_3s6z 等场景上取得了最佳性能，并且学习进展迅速。
EMC 在 17 个 SMAC 场景中表现强劲，通常在中位获胜率指标上名列前茅，并在多个地图上取得最佳结果。
消融实验表明，基于好奇心的探索对于具有挑战性的任务至关重要，而 episodic memory 主要提升样本效率。
该方法在 CTDE 范式下与多种因子化方案（VDN/QMIX/QPLEX）兼容。
示范性实验表明，当预测 Q 值而非观测历史时，协同探索的优势更加明显。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。