QUICK REVIEW

[论文解读] Learning Social Conventions in Markov Games.

Adam Lerer, Alexander Peysakhovich|arXiv (Cornell University)|Jun 26, 2018

Opinion Dynamics and Social Influence被引用 10

一句话总结

本文提出了一种强化学习框架，将模仿学习整合到自对弈训练中，以帮助智能体在多智能体马尔可夫游戏中学习社会规范。通过在训练期间利用有限的社会行为观察，该方法显著提高了在测试时收敛到兼容均衡的可能性，即使在标准独立多智能体强化学习无法找到正确规范的环境中亦是如此。

ABSTRACT

Social conventions - arbitrary ways to organize group behavior - are an important part of social life. Any agent that wants to enter an existing society must be able to learn its conventions (e.g. which side of the road to drive on, which language to speak) from relatively few observations or risk being unable to coordinate with everyone else. We consider the game theoretic framework of David Lewis which views the selection of a social convention as the selection of an equilibrium in a coordination game. We ask how to construct reinforcement learning based agents that can solve the convention learning task in the self-play paradigm: at training time the agent has access to a good model of the environment and a small amount of observations about how individuals in society act. The agent then has to construct a policy that is compatible with the test-time social convention. We study three environments from the literature which have multiple conventions: traffic, communication, and risky coordination. In each of these we observe that adding a small amount of imitation learning during self-play training greatly increases the probability that the strategy found by self-play fits well with the social convention the agent will face at test time. We show that this works even in an environment where standard independent multi-agent RL very rarely finds the correct test-time equilibrium.

研究动机与目标

解决智能体如何仅通过有限的社会行为观察学习社会规范（如特定道路右侧通行等任意协调规则）的挑战。
探究将模仿学习与自对弈结合是否能提高自对弈训练智能体在测试时与正确社会规范对齐的可能性。
在包含多个均衡的环境中评估该方法，包括交通协调、通信和高风险协调博弈。
证明该方法在寻找正确测试时均衡方面优于标准独立多智能体强化学习。
展示即使仅提供少量关于社会行为的观察数据，也能显著提升自对弈中的规范学习效果。

提出的方法

该方法使用修改后的学习目标进行自对弈训练，其中融入了从少量观察到的社会行为中提取的模仿学习。
智能体通过结合自对弈强化学习与基于观察轨迹的行为克隆来训练，这些轨迹来自社会规范中的个体行为。
模仿组件鼓励策略匹配社会中个体的行为模式，即使这些行为在孤立情况下并非最优。
该框架应用于三个基准环境：交通协调、基于语言的通信以及高风险协调博弈。
训练过程确保最终策略不仅在自对弈中有效，也与测试时的社会规范兼容。
该方法无需事先知晓正确均衡；相反，它通过观察数据学习推断规范。

实验结果

研究问题

RQ1将模仿学习与自对弈结合是否能提高自对弈训练智能体在测试时采用正确社会规范的概率？
RQ2在标准独立多智能体强化学习常无法收敛到正确规范的多均衡环境中，该方法的有效性如何？
RQ3与纯自对弈相比，少量关于社会行为的观察数据是否能显著改善规范学习？
RQ4在哪些类型的协调博弈中，增加模仿学习能带来最显著的规范对齐改进？
RQ5该方法能否在不同社会规范任务（如交通规则、语言使用和风险协调）之间实现泛化？

主要发现

在自对弈训练中加入模仿学习，显著提高了所有三个环境中所学策略与测试时社会规范对齐的概率。
在标准独立多智能体强化学习极少能找到正确均衡的高风险协调博弈中，所提方法成功收敛到正确规范。
即使仅提供少量行为示范，该方法也能实现与观察到的社会规范高度兼容。
模仿学习的整合带来了比纯自对弈更快的收敛速度和更稳定的策略学习。
该方法在所有评估环境中均优于基线独立多智能体强化学习，尤其在均衡多重性较高的场景中表现更优。
结果表明，关于社会行为的观察数据已足够引导智能体趋向社会兼容的均衡，即使没有显式的奖励塑形。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。