QUICK REVIEW

[论文解读] Factored Value Functions for Graph-Based Multi-Agent Reinforcement Learning

Ahmed Rashwan, Keith Briggs|arXiv (Cornell University)|Jan 16, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

简述：引入 Diffusion Value Function (DVF)，一种用于基于图的多智能体强化学习的因式化评论家，并构建 DA2C 与 LD-GNN，以实现具有通信感知策略的可扩展、去中心化学习；在消防任务和分布式计算任务上显示性能提升。

ABSTRACT

Credit assignment is a core challenge in multi-agent reinforcement learning (MARL), especially in large-scale systems with structured, local interactions. Graph-based Markov decision processes (GMDPs) capture such settings via an influence graph, but standard critics are poorly aligned with this structure: global value functions provide weak per-agent learning signals, while existing local constructions can be difficult to estimate and ill-behaved in infinite-horizon settings. We introduce the Diffusion Value Function (DVF), a factored value function for GMDPs that assigns to each agent a value component by diffusing rewards over the influence graph with temporal discounting and spatial attenuation. We show that DVF is well-defined, admits a Bellman fixed point, and decomposes the global discounted value via an averaging property. DVF can be used as a drop-in critic in standard RL algorithms and estimated scalably with graph neural networks. Building on DVF, we propose Diffusion A2C (DA2C) and a sparse message-passing actor, Learned DropEdge GNN (LD-GNN), for learning decentralised algorithms under communication costs. Across the firefighting benchmark and three distributed computation tasks (vector graph colouring and two transmit power optimisation problems), DA2C consistently outperforms local and global critic baselines, improving average reward by up to 11%.

研究动机与目标

在以图表示的局部交互下解决大规模 MARL 的信用分配问题。
提出一种因式化价值函数，通过对影响图进行时序折扣和空间衰减扩散奖励。
提供一个可直接嵌入标准 RL 算法的评论家，并通过图神经网络实现可扩展估计。
开发基于扩散的算法（DA2C）和用于在通信成本下实现去中心化学习的稀疏信息传递 Actor（LD-GNN）。

提出的方法

将 DVF 定义为通过在影响图上扩散奖励、并结合时序折扣与空间衰减，得到的每个智能体的价值组件。
证明 DVF 的良定性、存在 Bellman 固定点，并通过平均化性质将全局折扣价值分解。
将 DVF 作为标准 RL 算法中的可直接使用的评论家，并用图神经网络进行估计。
引入 Diffusion A2C (DA2C) 与 Learned DropEdge GNN (LD-GNN)，实现具通信约束的去中心化学习。

实验结果

研究问题

RQ1由对影响图上扩散奖励得到的因式化价值函数，是否能为每个智能体提供比全局或朴素局部评论家更好的学习信号？
RQ2DVF 是否良定、存在 Bellman 固定点，并能通过平均化将全局价值分解？
RQ3在标准 RL 设置以及具去中心化学习者的通信成本下，扩散式评论家表现如何？
RQ4相较于基线，DA2C 与 LD-GNN 是否能在基于图的 MARL 任务上提升性能？

主要发现

DVF 通过在影响图上扩散奖励、结合时序折扣与空间衰减，为每个智能体提供价值组件。
DVF 是良定的，存在 Bellman 固定点，并通过平均化性质将全局折扣价值分解。
DA2C 与 LD-GNN 使在通信感知策略下的可扩展去中心化学习成为可能，同时将 DVF 作为评论家使用。
在消防任务基准与分布式计算问题等任务中，DA2C 始终优于局部与全局评论家基线，平均奖励提升可达 11%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。