QUICK REVIEW

[论文解读] Emergent Tool Use From Multi-Agent Autocurricula

Bowen Baker, Ingmar Kanitscheider|arXiv (Cornell University)|Sep 17, 2019

Reinforcement Learning in Robotics参考文献 70被引用 335

一句话总结

本文显示，在基于物理的捉迷藏环境中进行多智能体自我博弈会诱发自监督的自我课程（autocurriculum）并出现六种新兴策略，包括使用工具，并提出基于迁移的评估和定向智能测试。

ABSTRACT

Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.

研究动机与目标

在开放式、物理上扎实的环境中推动无监督技能发现。
证明多智能体竞争会诱发具有进展性策略的自我课程。
展示人类相关技能的涌现，如工具使用和协调。
提出迁移学习和定向智能测试以评估开放式智能体。
开源环境和代码以促进进一步研究。

提出的方法

使用混合竞争/合作的基于物理的捉迷藏环境。
在去中心化执行、集中式训练下，使用近端策略优化（PPO）和广义优势估计（GAE）训练代理。
采用自我中心、基于实体的注意力策略架构，对可变数量的实体进行自注意力。
通过自我对弈观察到多达六个策略阶段的涌现，完全由捉迷藏目标驱动。
在领域特定测试中，将多智能体自我课程与内在动机基线和随机初始化进行比较。
提出将迁移与微调作为评估框架，使用一系列智能任务。

实验结果

研究问题

RQ1多智能体竞争是否会诱发自我课程，从而在物理上扎实的环境中产生复杂的、使用工具的行为？
RQ2当代理相互竞争进行训练时，策略的涌现阶段有哪些？
RQ3多智能体自我课程是否随环境复杂性而扩展，与仅靠内在动机相比有何差异？
RQ4迁移学习和定向智能测试能否量化开放式学习进展？
RQ5相比基线，预训练代理在领域特定的操作和认知任务中的表现如何？

主要发现

代理在训练过程中表现出多达六个不同的策略与对策阶段。
躲藏者学会用可移动的箱子和墙壁搭建庇护所；寻找者学会利用斜坡穿透据点。
寻找者与躲藏者的策略包括使用斜坡、斜坡防守、箱子冲浪以及冲浪防守。
多智能体自我课程随环境复杂性扩大，并产生比内在动机基线更符合人类相关的行为。
迁移实验表明，与基线相比，捉迷藏预训练代理在5个目标任务中的3个上收敛更快或表现更好。
该工作提供开源环境和代码，以支持进一步研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。