QUICK REVIEW

[論文レビュー] MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies

Xue Bin Peng, Michael Chang|arXiv (Cornell University)|May 23, 2019

Human Pose and Action Recognition被引用数 55

ひとこと要約

MCP は複数のプリミティブを同時に活性化する乗法的組成ポリシーを学習し、転移可能なスキルを実現し、高次自由度エージェントで複雑な連続制御タスクを解決する。

ABSTRACT

Humans are able to perform a myriad of sophisticated tasks by drawing upon skills acquired through prior experience. For autonomous agents to have this capability, they must be able to extract reusable skills from past experience that can be recombined in new ways for subsequent tasks. Furthermore, when controlling complex high-dimensional morphologies, such as humanoid bodies, tasks often require coordination of multiple skills simultaneously. Learning discrete primitives for every combination of skills quickly becomes prohibitive. Composable primitives that can be recombined to create a large variety of behaviors can be more suitable for modeling this combinatorial explosion. In this work, we propose multiplicative compositional policies (MCP), a method for learning reusable motor skills that can be composed to produce a range of complex behaviors. Our method factorizes an agent's skills into a collection of primitives, where multiple primitives can be activated simultaneously via multiplicative composition. This flexibility allows the primitives to be transferred and recombined to elicit new behaviors as necessary for novel tasks. We demonstrate that MCP is able to extract composable skills for highly complex simulated characters from pre-training tasks, such as motion imitation, and then reuse these skills to solve challenging continuous control tasks, such as dribbling a soccer ball to a goal, and picking up an object and transporting it to a target location.

研究の動機と目的

autonomous agents からの前経験を活用して転移可能・再利用可能なスキルの学習を動機づける。
同時プリミティブ活性化を可能にすることでスキル組み合わせの組合せ爆発を抑制する。
柔軟で再利用可能なアクション空間を生み出す乗法的組成フレームワークを提案する。
事前学習済みプリミティブを難易度の高い下流タスクへ移転させることを示す。
任务の複雑さが増すにつれ乗法的組成が優れた性能を発揮することを示す。

提案手法

エージェントの振る舞いをアクション分布としてモデル化されたプリミティブの集合に因子分解する。
単一タイムステップで複数のプリミティブがアクションに影響を与えられるよう、プリミティブを乗法的に組み合わせる。
ガウスプリミティブを用い、複合平均と分散の閉形式表現（式3）を導出する。
不等価なモデルを用いたモーション模倣コーパスでプリミティブを事前学習させ、専門化を促す（プリミティブは状態のみを見、ゲーティングはゴールを用いる）。
新しいタスクに対してプリミティブを凍結し新しいゲーティングネットワークを訓練してそれらを組み合わせることで転移を図る。
連続制御タスクで30 HzのPPOでエンドツーエンドに訓練する。

実験結果

リサーチクエスチョン

RQ1複数のプリミティブの乗法的組成は、加法的混合よりも豊かな挙動集合を生み出すか。
RQ2事前学習済みの再利用可能なプリミティブは、異なるゴールや形態を持つ新規タスクへ効果的に転移するか。
RQ3MCP は高自由度キャラクターや長期的タスクに対して、従来の階層/潜在空間手法よりスケールするか。
RQ4事前学習中に学習されたプリミティブの探索と専門化の特性はどうなるか。
RQ5転移タスクにおいて、スクラッチ、ファインチューニング、階層、MOE、潜在空間ベースのベースラインと比べて MCP はどの程度性能を発揮するか。

主な発見

Environment	Scratch	Finetune	Hierarchical	Option-Critic	MOE	Latent Space	MCP (Ours)
Heading: Biped	0.927±0.032	0.970±0.002	0.834±0.001	0.952±0.012	0.918±0.002	0.970±0.001	0.976±0.002
Carry: Biped	0.027±0.035	0.324±0.014	0.001±0.002	0.346±0.011	0.013±0.013	0.456±0.031	0.575±0.032
Dribble: Biped	0.072±0.012	0.651±0.025	0.546±0.024	0.046±0.008	0.073±0.021	0.768±0.012	0.782±0.008
Dribble: Humanoid	0.076±0.024	0.598±0.030	0.198±0.002	0.058±0.007	0.043±0.021	0.751±0.006	0.805±0.006
Dribble: T-Rex	0.065±0.032	0.074±0.011	-	0.098±0.013	0.070±0.017	0.115±0.013	0.781±0.021
Holdout: Ant	0.951±0.093	0.885±0.062	-	-	-	0.745±0.060	0.812±0.030

MCP は複数のプリミティブの同時活性化を可能にし、加法的手法より表現力を高める。
Ant、Biped、Humanoid、T-Rex の転移タスクで、MCP は一貫してベースラインを上回り、最も困難な Dribble: T-Rex タスクを解決する。
タスクの複雑さが増すにつれて MCP は学習速度が速く、漸近性能が高くなる。
プリミティブは歩法相に専門化し、異なるアクションクラスタを生成して意味のある技能分解を示す。
潜在空間モデルは事前学習に過適合する可能性があるのに対し、MCP はプリミティブ平均の有効な凸結合を提供し転移を助ける。
MCP はいくつかのホールドアウト転移シナリオで最高の性能を達成し、組織的な探索挙動を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。