QUICK REVIEW

[论文解读] Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning

Natasha Jaques, Angeliki Lazaridou|arXiv (Cornell University)|Oct 19, 2018

Experimental Behavioral Economics Studies被引用 262

一句话总结

本文提出了一种用于多智能体强化学习的社会影响力内在奖励，它通过反事实推理衡量一个智能体在多大程度上能因果地影响其他智能体，从而协同智能体并在没有集中训练的情况下实现有意义的涌现式通信。

ABSTRACT

We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents' actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could have taken, and computes their effect on the behavior of other agents. Actions that lead to bigger changes in other agents' behavior are considered influential and are rewarded. We show that this is equivalent to rewarding agents for having high mutual information between their actions. Empirical results demonstrate that influence leads to enhanced coordination and communication in challenging social dilemma environments, dramatically increasing the learning curves of the deep RL agents, and leading to more meaningful learned communication protocols. The influence rewards for all agents can be computed in a decentralized way by enabling agents to learn a model of other agents using deep neural networks. In contrast, key previous works on emergent communication in the MARL setting were unable to learn diverse policies in a decentralized manner and had to resort to centralized training. Consequently, the influence reward opens up a window of new opportunities for research in this area.

研究动机与目标

通过基于社会影响力的内在奖励来激发多智能体强化学习中的协作与沟通。
使用反事实推理定义并计算因果影响，以量化一个智能体对他人的影响。
证明影响奖励与最大化智能体行动之间的互信息以促进协作之间的一致性。
证明影响力可以通过其他智能体的内部模型（MOA）实现独立训练，同时仍能实现协同。

提出的方法

定义一个内在的影响奖励，用反事实行动量化一个智能体改变另一个智能体行动分布的程度。
将影响奖励与智能体行动之间的互信息相关联，并通过实证验证改进的协调性。
扩展框架，包括一个由影响奖励引导的显式通信通道，并评估涌现通信质量。
引入其他智能体模型（MOA），以实现独立训练并在无集中访问的情况下计算影响。
使用循环结构从像素端对端训练策略，并采用 A3C 风格更新，同时对影响权重进行课程学习。

实验结果

研究问题

RQ1基于因果影响的内在奖励是否能够在无需集中训练的情况下改善多智能体环境中的协调？
RQ2最大化智能体之间的因果影响是否会带来更有意义的涌现式通信？
RQ3具备 MOA 的智能体是否可以独立训练并仍实现协同行为？
RQ4在实际中，影响奖励是否与最大化智能体行动之间的互信息相关？

主要发现

在顺序社会困境（SSDs）中，使用社会影响奖励训练的智能体比基线和被删减的智能体获得更高的集体奖励。
基于影响的通信带来更快的学习和更高的集体奖励，以及更有意义和更协调的消息传递。
使用 MOA 的智能体可以在内部计算影响并在没有集中控制的情况下实现协调，优于基线。
被通信影响与获得更高个体奖励之间存在显著相关性，支持信息性通信。
通过在智能体行动之间创建明确的依赖关系，影响可以在大规模 MARL 设置中降低梯度方差。
影响机制可以引发与听者环境奖励对齐的涌现式通信，并改善合作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。