QUICK REVIEW

[论文解读] Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Sriram Srinivasan, Marc Lanctot|arXiv (Cornell University)|Oct 21, 2018

Reinforcement Learning in Robotics被引用 71

一句话总结

这篇论文把 actor-critic 策略梯度与部分可观测的多智能体博弈中的遗憾最小化联系起来，并提出基于遗憾的策略更新，在扑克领域进行评估，显示收敛到近似纳什均衡。

ABSTRACT

Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero sum games, without any domain-specific state space reductions.

研究动机与目标

鼓励并形式化多智能体、部分可观测环境中的策略梯度和 actor-critic 方法。
将 actor-critic 更新与遗憾最小化和对照遗憾在博弈论中的术语联系起来。
提出并分析若干基于遗憾的策略更新规则。
展示在部分可观测的对抗性序贯决策问题中的无模型、在线学习。

提出的方法

定义若干以遗憾最小化为灵感的策略更新规则：Regret Policy Gradient (RPG)、Regret Matching Policy Gradient (RMPG)，以及它们的 Q-learning 风格对应物。
通过贝叶斯归一化，在部分可观测下将对照值与标准 Q 值关联，推导出对照遗憾的近似。
使用带有神经网络函数近似器的 actor-critic 架构来近似策略和值，端到端以无模型、在线方式训练。
通过 PGPI/ACPI 动态提供理论联系，并给出表格化两人零和情形下子线性遗憾界限的证明。
在零和、部分可观测的多智能体博弈（Kuhn 和 Leduc poker）中对比基线代理和 CFR 基准进行评估。

实验结果

研究问题

RQ1能否在部分可观测的多智能体环境中将 actor-critic 方法建立在遗憾最小化之上？
RQ2在部分可观测和多智能体交互下，对照遗憾如何与标准优势估计相关？
RQ3当在线学习且无模型地学习时，基于遗憾的 actor-critic 更新是否收敛到近似纳什均衡在零和扑克情境下？
RQ4在对抗性序贯决策问题中，提出的更新（RPG、RMPG、QPG）中哪一个在实际中表现最好？
RQ5与 CFR 基线相比，在收敛速度和鲁棒性方面，这些方法有何差异？

主要发现

Actor-critic 变体在 Kuhn 和 Leduc poker 中收敛到近似纳什均衡，性能可与 CFR 基线相当，甚至更好。
QPG 和 RPG 一般在报道的扑克领域实验中优于 RMPG。
这些方法是无模型、在线的，避免存储大型转移缓冲区，同时仍实现良好收敛。
RPG 和 QPG 在对固定 CFR 派生的机器人方面显示出长期有利的表现，在自对弈中常常超过 NFSP 基线。
该工作在部分可观测的 MARL 中建立了遗憾最小化与标准策略梯度更新之间的理论联系。
所有方法在不进行领域特定状态空间简化的情况下运行，展示了对抗性多智能体环境中的泛化潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。