QUICK REVIEW

[논문 리뷰] MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies

Xue Bin Peng, Michael Chang|arXiv (Cornell University)|2019. 05. 23.

Human Pose and Action Recognition인용 수 55

한 줄 요약

MCP는 다중 프리미티브를 승법적으로 구성하여 동시에 활성화하며, 전이 가능한 기술을 가능하게 하고 높은 자유도를 가진 에이전트의 복잡한 연속 제어 작업을 해결합니다.

ABSTRACT

Humans are able to perform a myriad of sophisticated tasks by drawing upon skills acquired through prior experience. For autonomous agents to have this capability, they must be able to extract reusable skills from past experience that can be recombined in new ways for subsequent tasks. Furthermore, when controlling complex high-dimensional morphologies, such as humanoid bodies, tasks often require coordination of multiple skills simultaneously. Learning discrete primitives for every combination of skills quickly becomes prohibitive. Composable primitives that can be recombined to create a large variety of behaviors can be more suitable for modeling this combinatorial explosion. In this work, we propose multiplicative compositional policies (MCP), a method for learning reusable motor skills that can be composed to produce a range of complex behaviors. Our method factorizes an agent's skills into a collection of primitives, where multiple primitives can be activated simultaneously via multiplicative composition. This flexibility allows the primitives to be transferred and recombined to elicit new behaviors as necessary for novel tasks. We demonstrate that MCP is able to extract composable skills for highly complex simulated characters from pre-training tasks, such as motion imitation, and then reuse these skills to solve challenging continuous control tasks, such as dribbling a soccer ball to a goal, and picking up an object and transporting it to a target location.

연구 동기 및 목표

autonomous agents의 이전 경험에서 전이 가능하고 재사용 가능한 기술 학습 동기화.
프리미티브의 동시 활성화를 가능하게 하여 기술 조합의 조합 폭 증가 문제 해결.
유연하고 재사용 가능한 행동 공간을 만들기 위한 승법적 구성 프레임워크 제안.
사전 학습된 프리미티브를 도전적인 다운스트림 작업에 전달(전이)시키기.
작업의 복잡성이 증가할수록 승법적 구성이 더 우수한 성능을 보임과 같은 증거 제시

제안 방법

에이전트의 행동을 프리미티브 집합으로 묘사된 행동 분포의 집합으로 분해한다.
한 타임스텝에 여러 프리미티브가 행동에 영향을 미치도록 프리미티브를 승법적으로 구성한다.
가우시안 프리미티브를 사용하고 합성 평균 및 분산에 대한 닫힌 형 expressions를 도출한다(방정식 3).
비대칭 모델을 사용한 모션 모방 말뭉치에서 특화(프리미티브가 상태만 보고, 게이팅은 목표를 사용) 학습으로 프리미티브를 예비 학습한다.
프리미티브를 고정하고 새로운 작업에 맞춰 이를 구성하는 새로운 게이팅 네트워크를 학습시켜 전달(전이)한다.
연속 제어 작업에서 30 Hz로 PPO로 엔드투엔드 학습한다.

실험 결과

연구 질문

RQ1다중 프리미티브의 승법적 구성은 선형 혼합보다 더 풍부한 행동 세트를 만들 수 있는가?
RQ2사전 학습된 재사용 가능한 프리미티브가 서로 다른 목표나 형태를 가진 새로운 작업에 효과적으로 전달되는가?
RQ3MCP가 고자유도 캐릭터와 장기 гориз의 작업에서 이전의 계층적/잠재공간 방법보다 규모 확장이 잘 되는가?
RQ4사전 학습 중 학습된 프리미티브의 탐색 및 특화 특성은 무엇인가?
RQ5전이 작업에서 scratch, finetune, hierarchical, MOE, latent-space 기반대비 MCP의 성능 차이는 어떠한가?

주요 결과

환경	Scratch	Finetune	Hierarchical	Option-Critic	MOE	Latent Space	MCP (Ours)
Heading: Biped	0.927±0.032	0.970±0.002	0.834±0.001	0.952±0.012	0.918±0.002	0.970±0.001	0.976±0.002
Carry: Biped	0.027±0.035	0.324±0.014	0.001±0.002	0.346±0.011	0.013±0.013	0.456±0.031	0.575±0.032
Dribble: Biped	0.072±0.012	0.651±0.025	0.546±0.024	0.046±0.008	0.073±0.021	0.768±0.012	0.782±0.008
Dribble: Humanoid	0.076±0.024	0.598±0.030	0.198±0.002	0.058±0.007	0.043±0.021	0.751±0.006	0.805±0.006
Dribble: T-Rex	0.065±0.032	0.074±0.011	-	0.098±0.013	0.070±0.017	0.115±0.013	0.781±0.021
Holdout: Ant	0.951±0.093	0.885±0.062	-	-	-	0.745±0.060	0.812±0.030

MCP는 다중 프리미티브의 동시 활성화를 가능하게 하여 덧셈적 방법에 비해 표현력을 향상시킨다.
Ant, Biped, Humanoid 및 T-Rex에서의 전달 작업에서 MCP는 일관되게 베이스라인보다 우수하게 성능을 보여주며 Dribble: T-Rex 작업을 해결한다.
작업의 복잡성이 증가함에 따라 학습 속도가 빨라지고 최대 성능도 더 높아진다.
프리미티브는 보행 단계에 특화되고 서로 다른 행동 클러스터를 생성하며 의미 있는 기술 분해를 시사한다.
잠재 공간 모델은 사전 학습에 과적합될 수 있으며, MCP는 프리미티브 평균의 유연한 볼록 조합을 제공하여 전달을 돕는다.
MCP는 여러 보류 전달 시나리오에서 최상위 성능을 달성하고 구조화된 탐색 행동을 시연한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.