QUICK REVIEW

[論文レビュー] Emergent Tool Use From Multi-Agent Autocurricula

Bowen Baker, Ingmar Kanitscheider|arXiv (Cornell University)|Sep 17, 2019

Reinforcement Learning in Robotics参考文献 70被引用数 335

ひとこと要約

要約: 本研究は、物理ベースのかくれんぼ環境でのマルチエージェント自己プレイが、ツールの使用を含む六つの出現戦略を伴う自己教師付きの自カリキュラムを誘発し、転移ベースの評価と標的型の知能テストを提案することを示している。

ABSTRACT

Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.

研究の動機と目的

開放的で物理的に基づく環境における教師なしスキル発見を動機づける。
マルチエージェント競争が発展的な戦略を伴う autocurricula を誘発することを実証する。
ツールの使用や協調といった人間に関連するスキルの出現を示す。
開放的なエージェントを評価するための転移学習と標的型知能テストを提案する。
さらなる研究を可能にするために環境とコードをオープンソース化する。

提案手法

混合的な競争/協力の物理ベースのかくれんぼ環境を用いる。
分散実行と集中学習のもと、Proximal Policy Optimization (PPO) と Generalized Advantage Estimation (GAE) でエージェントを訓練する。
可変数のエンティティに対して自己注意を用いた自己中心的でエンティティベースのアテンションポリシーアーキテクチャを採用する。
かくれんぼの目的だけで駆動される自己プレイを通じて、最大六つの戦略段階の出現を観察する。
ドメイン固有のテストで、マルチエージェントautocurriculaを intrinsic motivation のベースラインおよびランダム初期化と比較する。
一連の知能タスクを用いた評価フレームワークとして転移学習とファインチューニングを提案する。

実験結果

リサーチクエスチョン

RQ1マルチエージェント競争は、物理的に基づく環境で複雑でツールを使う行動を生み出す autocurricula を誘発できるか？
RQ2エージェント同士が対戦して訓練する中で、出現する戦略のフェーズは何か？
RQ3マルチエージェント autocurriculum は環境の複雑さとともにスケールするか、単なる intrinsic motivation のみとどう比較されるか？
RQ4転移学習と標的型知能テストはオープンエンドな学習の進捗を定量化できるか？
RQ5事前訓練済みエージェントは、ベースラインと比較して、ドメイン固有の操作および認知タスクでどう性能を示すか？

主な発見

エージェントは訓練中に最大六つの戦略とカウンターストラテジーの異なる段階を示す。
隠れる側は可動する箱や壁から shelters を作ることを学び、追跡者は要塞を貫くために ramp を使うことを学ぶ。
追跡者と隠れる者の戦略にはスロープの使用、スロープ防衛、箱サーフィン、サーフ防衛を含む。
マルチエージェント autocurricula は環境の複雑さとともにスケールし、intrinsic motivation のベースラインよりも人間に関連する挙動を多く生み出す。
転移実験は、5つの標的タスクのうち3つで、hide-and-seek 事前訓練エージェントがベースラインと比較して改善または収束の速さを示すことを示している。
本研究はさらなる研究を支援するために、オープンソースの環境とコードを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。