QUICK REVIEW

[논문 리뷰] FACMAC: Factored Multi-Agent Centralised Policy Gradients

Bei Peng, Tabish Rashid|arXiv (Cornell University)|2020. 03. 14.

Reinforcement Learning in Robotics참고 문헌 70인용 수 106

한 줄 요약

FACMAC은 협력적 MARL을 위한 중앙집중식이면서 팩터링된 크리틱과 중앙집중식 그래디언트 추정기를 도입하여 연속 및 이산 행동 과제를 가능하게 하고, 다수 영역에서 MADDPG 및 베이스라인을 능가합니다.

ABSTRACT

We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic policy gradients to learn policies. However, FACMAC learns a centralised but factored critic, which combines per-agent utilities into the joint action-value function via a non-linear monotonic function, as in QMIX, a popular multi-agent Q-learning algorithm. However, unlike QMIX, there are no inherent constraints on factoring the critic. We thus also employ a nonmonotonic factorisation and empirically demonstrate that its increased representational capacity allows it to solve some tasks that cannot be solved with monolithic, or monotonically factored critics. In addition, FACMAC uses a centralised policy gradient estimator that optimises over the entire joint action space, rather than optimising over each agent's action space separately as in MADDPG. This allows for more coordinated policy changes and fully reaps the benefits of a centralised critic. We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks. Empirical results demonstrate FACMAC's superior performance over MADDPG and other baselines on all three domains.

연구 동기 및 목표

협력적 다중 에이전트 RL을 위한 확장 가능하고 중앙집중식이지만 팩터링된 크리틱을 동기화하고 개발한다.
협력 조정을 향상시키기 위해 전체 결합 행동 공간에 걸친 정책 최적화를 가능하게 한다.
비단조적(nonmonotonic) 팩터라이제이션의 이점과 더 큰 표현 용량을 시연한다.
도전적인 과제에서 이산 및 연속 행동 공간 모두에 대한 적용 가능성을 보여준다.

제안 방법

per-agent 유틸리티를 (비)선형 혼합 함수를 통해 결합하는 중앙집중식이지만 팩터링된 크리틱을 정의한다.
전체 결합 행동 공간에 대해 최적화하는 중앙집중식 그래디언트 추정기를 도입한다.
크리틱의 단조적(monotonic, QMIX-스타일) 및 비단조적(nonmonotonic) 팩터라이제이션을 탐구한다.
Straight-Through 추정기가 있는 Gumbel-Softmax를 사용하여 이산 행동에 적응한다.
MAMuJoCo, Continuous Predator-Prey, 및 SMAC를 포함한 연속 및 이산 MARL 벤치마크에서 평가한다.

실험 결과

연구 질문

RQ1MARL에서 중앙집중식이지만 팩터링된 크리틱이 모놀리식 크리틱보다 조정(coordination)을 개선하는가?
RQ2비단조적 팩터라이제이션이 복잡한 과제를 해결하는 데 더 큰 표현 용량을 제공하는가?
RQ3개별 에이전트 그레이디언트와 비교하여 중앙집중식 정책 그레이디언트 추정이 학습에 이득을 주는가?
RQ4FACMAC는 연속 대 이산 행동 도메인에서 얼마나 잘 작동하며 더 많은 에이전트일 때 확장되는가?

주요 결과

FACMAC은 연속 및 이산 협력 과제에서 MADDPG 및 기타 베이스라인을 능가한다.
크리틱의 팩터링은 에이전트/동작 수가 증가함에 따라 더 나은 확장성을 가능하게 한다.
비단조적 팩터라이제이션은 단조적이거나 모놀리식 크리틱이 해결할 수 없는 과제를 해결할 수 있다.
중앙집중식 그래디언트 추정은 조정을 개선하고 단순한 과제와 복잡한 과제에서 지역 최적해에 빠지지 않도록 돕는다.
FACMAC는 더 많은 에이전트 수와 MAMuJoCo 및 SMAC와 같은 복합 도메인까지 확장되며, 여러 맵에서 베이스라인보다 더 강한 성능을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.