QUICK REVIEW

[论文解读] Emergent Complexity via Multi-Agent Competition

Trapit Bansal, Jakub Pachocki|arXiv (Cornell University)|Oct 10, 2017

Reinforcement Learning in Robotics参考文献 31被引用 146

一句话总结

本文表明，在竞争性多智能体环境中的自我博弈可以在简单的 3D 物理任务中产生高度复杂的行为，使用具有探索课程与对手采样策略的分布式 PPO 训练框架。

ABSTRACT

Reinforcement learning algorithms can train agents that solve problems in complex, interesting environments. Normally, the complexity of the trained agent is closely related to the complexity of the environment. This suggests that a highly capable agent requires a complex environment for training. In this paper, we point out that a competitive multi-agent environment trained with self-play can produce behaviors that are far more complex than the environment itself. We also point out that such environments come with a natural curriculum, because for any skill level, an environment full of agents of this level will have the right level of difficulty. This work introduces several competitive multi-agent environments where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and interesting skills, even though the environment themselves are relatively simple. The skills include behaviors such as running, blocking, ducking, tackling, fooling opponents, kicking, and defending using both arms and legs. A highlight of the learned behaviors can be found here: https://goo.gl/eR7fbX

研究动机与目标

动机：解释为什么竞争性多智能体自我博弈能够产生超越环境本身的复杂行为。
介绍四个具有简单规则和物理规律的竞争性 3D 环境。
展示从与相近技能水平的对手对战中产生的自然学习进度（课程）。
证明探索课程在稀疏奖励下可促进学习。

提出的方法

在去中心化、分布式的训练设置中使用 Proximal Policy Optimization (PPO)，在多块 GPU 上进行大规模 rollouts。
在四个任务中使用两个 3D 智能体（ant 和 humanoid）：Run to Goal, You Shall Not Pass, Sumo, Kick and Defend。
通过在训练过程中将密集探索奖励退火至零来融入探索课程。
采用随机选取较旧对手进行采样，以稳定自我博弈训练并避免快速失衡。
在需要时使用带裁剪的 PPO 目标的 GAE，并为非对称游戏训练单独的策略。

实验结果

研究问题

RQ1竞争性多智能体环境中的自我博弈是否能产生超越环境固有复杂性的涌现、复杂行为？
RQ2探索课程是否提高学习效率并在稀疏奖励下发现非平凡的运动技能？
RQ3哪些训练策略（例如对手采样、随机化中的课程）能够在竞争性 3D 任务中产生鲁棒策略？
RQ4所学策略如何转移到非 episodic 或受扰动条件下（鲁棒性测试）？

主要发现

竞争性多智能体训练产生多样化的涌现技能，如阻挡、躲避、铲球、踢击和防守。
探索课程对于在稀疏奖励下的学习至关重要，并提高样本效率。
对随机抽取的较老对手进行训练稳定学习并促进持续进步。
将多策略进行集成能够提高鲁棒性，尤其是对于 humanoid 代理，相较于单一策略的自我博弈。
在环境参数中的随机化课程有助于泛化策略，而不会牺牲早期学习进展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。