QUICK REVIEW

[论文解读] Deep Reinforcement Learning for Swarm Systems

Maximilian Hüttenrauch, Adrian Šošić|arXiv (Cornell University)|Jul 17, 2018

Insect Pheromone Research and Control被引用 153

一句话总结

本文提出均值嵌入来表示深度 MARL 的群体邻居信息，实现置换不变、可扩展的策略；在 rendezvous 和 pursuit-evasion 任务上使用 TRPO 进行评估。

ABSTRACT

Recently, deep reinforcement learning (RL) methods have been applied successfully to multi-agent scenarios. Typically, these methods rely on a concatenation of agent states to represent the information content required for decentralized decision making. However, concatenation scales poorly to swarm systems with a large number of homogeneous agents as it does not exploit the fundamental properties inherent to these systems: (i) the agents in the swarm are interchangeable and (ii) the exact number of agents in the swarm is irrelevant. Therefore, we propose a new state representation for deep multi-agent RL based on mean embeddings of distributions. We treat the agents as samples of a distribution and use the empirical mean embedding as input for a decentralized policy. We define different feature spaces of the mean embedding using histograms, radial basis functions and a neural network learned end-to-end. We evaluate the representation on two well known problems from the swarm literature (rendezvous and pursuit evasion), in a globally and locally observable setup. For the local setup we furthermore introduce simple communication protocols. Of all approaches, the mean embedding representation using neural network features enables the richest information exchange between neighboring agents facilitating the development of more complex collective strategies.

研究动机与目标

解决群体 MARL 中高维、变大小观察的挑战。
提出基于均值嵌入的状态表示来编码邻居信息。
评估均值嵌入的神经网络、直方图和径向基函数特征空间。
展示在群体设置中使用 TRPO 的集中学习/分散执行学习。

提出的方法

将群体代理建模为同质、部分可观测的执行者，具有共用策略。
将邻近观察表示为来自分布的样本，并计算均值嵌入作为输入给策略。
探索均值嵌入的特征空间：神经网络、直方图和径向基函数。
在全局与局部可观测场景中，将均值嵌入与拼接和基于池化的方法进行比较。
使用带有集中学习、分散执行的信赖域策略优化（TRPO）来训练策略。
在局部可观测中实现简单的通信协议以增强观测。

实验结果

研究问题

RQ1均值嵌入是否可以为深度 MARL 提供置换不变、可扩展的群体邻居信息表示？
RQ2神经网络、直方图和 RBF 均值嵌入在学习有效群体策略方面的比较？
RQ3基于均值嵌入的输入是否比拼接或其他池化方法提升学习速度和策略质量？
RQ4全局可观测与局部可观测对学习到的群体行为和性能有何影响？
RQ5局部可观测中的通信协议如何影响策略性能？

主要发现

具有神经网络特征的均值嵌入在邻近代理之间实现了最丰富的信息交换。
均值嵌入使学习更快、策略质量更高，相较于基线在群体任务中。
神经网络嵌入能够在不增加输入维度的情况下引入更具信息量的观察。
直方图和 RBF 嵌入面临更高的维度挑战，可能会模糊或离散邻居信息。
局部通信协议可以在局部可观测设置中增强均值嵌入输入并提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。