QUICK REVIEW

[论文解读] Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents

Kaiqing Zhang, Zhuoran Yang|arXiv (Cornell University)|Feb 23, 2018

Distributed Control Multi-Agent Systems参考文献 71被引用 250

一句话总结

这篇论文开发了两种在时变网络上实现完全去中心化的多智能体强化学习的 actor-critic 算法，具有基于共识的 critic 和函数近似，并在线性近似下给出收敛保证。

ABSTRACT

We consider the problem of \emph{fully decentralized} multi-agent reinforcement learning (MARL), where the agents are located at the nodes of a time-varying communication network. Specifically, we assume that the reward functions of the agents might correspond to different tasks, and are only known to the corresponding agent. Moreover, each agent makes individual decisions based on both the information observed locally and the messages received from its neighbors over the network. Within this setting, the collective goal of the agents is to maximize the globally averaged return over the network through exchanging information with their neighbors. To this end, we propose two decentralized actor-critic algorithms with function approximation, which are applicable to large-scale MARL problems where both the number of states and the number of agents are massively large. Under the decentralized structure, the actor step is performed individually by each agent with no need to infer the policies of others. For the critic step, we propose a consensus update via communication over the network. Our algorithms are fully incremental and can be implemented in an online fashion. Convergence analyses of the algorithms are provided when the value functions are approximated within the class of linear functions. Extensive simulation results with both linear and nonlinear function approximations are presented to validate the proposed algorithms. Our work appears to be the first study of fully decentralized MARL algorithms for networked agents with function approximation, with provable convergence guarantees.

研究动机与目标

激励并形式化在时变网络中，代理仅使用本地奖励和邻居通信以最大化全局平均回报的完全去中心化 MARL 设置。
提出两种无需中央控制器的、带函数近似的去中心化 actor-critic 算法。
通过局部策略和基于共识的价值估计实现对大状态与大量智能体空间的可扩展应用。
为所提算法在线性函数近似下建立理论收敛保证。
通过仿真进行经验验证以支撑理论。

提出的方法

构建一个带时变通信图和局部奖励的网络化多智能体 MDP。
推导一个跨代理分解、使用局部策略的 MARL 策略梯度定理。
提出两种去中心化的 actor-critic 算法，其 actor 更新是局部的，critic 更新在邻居间基于共识。
对 Q 和 V 使用带有局部参数的函数近似，并通过共识步骤在网络中共享估计。
提供两种在线、增量更新的算法，带可选的状态值 TD-误差或动作值 TD-误差变体。
在线性函数近似下建立收敛保证，并分析共识更新。

实验结果

研究问题

RQ1如何为具有本地奖励且无中央控制器的网络化代理系统，形成完全去中心化的 MARL？
RQ2在时变网络拓扑下使用函数近似时，去中心化的 actor-critic 算法能否收敛？
RQ3共识更新在实现 MARL 网络级最优中的作用是什么？
RQ4线性函数近似如何影响所提框架的收敛保证？
RQ5提出的算法是否可扩展到大量智能体和高维状态-动作空间，同时保持在线可操作性？

主要发现

提出两种带函数近似的去中心化 actor-critic 算法，适用于时变图的网络化 MARL。
建立了 MARL 的策略梯度定理，使得局部 actor 更新与基于共识的 critic 估计相结合。
对两个算法在线性函数近似情况下证明了收敛保证。
算法完全增量化并可在线实现，通过避免传输单个奖励来保护代理隐私。
使用线性和非线性函数近似的经验仿真验证了所提方法并支持理论。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。