QUICK REVIEW

[论文解读] Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning

Ying Wen, Yaodong Yang|arXiv (Cornell University)|Jan 26, 2019

Reinforcement Learning in Robotics被引用 51

一句话总结

介绍 PR2，一种用于多智能体深度强化学习的概率递归推理框架，使用变分贝叶斯来建模对手的条件策略，并在自我博弈中推导出具收敛保证的去中心化 PR2-Q 和 PR2-AC。

ABSTRACT

Humans are capable of attributing latent mental contents such as beliefs or intentions to others. The social skill is critical in daily life for reasoning about the potential consequences of others' behaviors so as to plan ahead. It is known that humans use such reasoning ability recursively by considering what others believe about their own beliefs. In this paper, we start from level-$1$ recursion and introduce a probabilistic recursive reasoning (PR2) framework for multi-agent reinforcement learning. Our hypothesis is that it is beneficial for each agent to account for how the opponents would react to its future behaviors. Under the PR2 framework, we adopt variational Bayes methods to approximate the opponents' conditional policies, to which each agent finds the best response and then improve their own policies. We develop decentralized-training-decentralized-execution algorithms, namely PR2-Q and PR2-Actor-Critic, that are proved to converge in the self-play scenarios when there exists one Nash equilibrium. Our methods are tested on both the matrix game and the differential game, which have a non-trivial equilibrium where common gradient-based methods fail to converge. Our experiments show that it is critical to reason about how the opponents believe about what the agent believes. We expect our work to contribute a new idea of modeling the opponents to the multi-agent reinforcement learning community.

研究动机与目标

推动使用递归推理来建模对手对代理未来行动的反应。
提出一个概率框架（PR2），通过学习的条件策略来考虑对手对代理的信念。
基于该框架开发去中心化训练-去中心化执行的算法（PR2-Q 和 PR2-AC）。
在自我博弈中，当存在单一纳什均衡时，提供理论收敛保证。
在矩阵博弈、微分博弈和粒子世界环境中相较基线展示更好的性能。

提出的方法

用一级递归分解对联合策略进行建模，捕捉对手如何对代理的行动作出回应。
使用变分推断近似对手的条件策略，记为 rho^{-i}_{phi^{-i}}(a^{-i}|s,a^{i})。
推导一个多智能体策略梯度，包含在对手的条件策略下的期望 Q 值（PR2-GD 更新）。
提供一个不需要访问对手策略参数的去中心化训练-去中心化执行算法（PR2-AC 和 PR2-Q）。
通过对 PR2 软值迭代的压缩算子，在自我博弈存在唯一纳什均衡时证明 PR2 的收敛性。
在连续动作空间中，采用摊销的 Stein 变分梯度下降（SVGD）来从对手条件策略中采样。

实验结果

研究问题

RQ1关于对手信念的递归推理是否能够在多智能体强化学习中超越非相关因式分解的学习？
RQ2如何在可控、去中心化的训练环境中使用变分推断来建模对手的条件策略？
RQ3当存在单一纳什均衡时，PR2-Q 与 PR2-AC 在自我博弈中是否收敛到均衡？
RQ4PR2 方法是否在矩阵博弈、微分博弈和粒子世界环境中优于常规基线？
RQ5对对手信念的推理对连续动作空间中的探索与收敛有何影响？

主要发现

PR2 使代理能够考虑对手会如何回应其行动，从而带来比基线更好的学习效果。
在存在一个纳什均衡的自我博弈情景中，PR2-Q 与 PR2-AC 会收敛。
在迭代矩阵博弈中，PR2 避免了 Infinitesimal Gradient Ascent 观察到的非收敛旋转动力学，并达到中心均衡。
在两个二次函数的微分博弈中的最大值情形，PR2-AC 收敛到全局均衡，而许多基线陷入局部最优。
PR2 方法在粒子世界环境中的协作与竞争设置中表现出色，特别是在去中心化执行方面。
变分推断提供了近似对手条件策略的实用方法，使多智能体推理具有可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。