QUICK REVIEW

[논문 리뷰] Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

Qian Long, Zihan Zhou|arXiv (Cornell University)|2020. 03. 23.

Reinforcement Learning in Robotics참고 문헌 52인용 수 38

한 줄 요약

본 논문은 Evolutionary Population Curriculum(EPC)을 소개한다. 이는 에이전트 인구를 점진적으로 증가시키고 진화적 선택을 사용하여 단계 간 적응성을 유지함으로써 다중 에이전트 강화 학습을 확장하는 커리큘럼 학습 프레임워크다.

ABSTRACT

In multi-agent games, the complexity of the environment can grow exponentially as the number of agents increases, so it is particularly challenging to learn good policies when the agent population is large. In this paper, we introduce Evolutionary Population Curriculum (EPC), a curriculum learning paradigm that scales up Multi-Agent Reinforcement Learning (MARL) by progressively increasing the population of training agents in a stage-wise manner. Furthermore, EPC uses an evolutionary approach to fix an objective misalignment issue throughout the curriculum: agents successfully trained in an early stage with a small population are not necessarily the best candidates for adapting to later stages with scaled populations. Concretely, EPC maintains multiple sets of agents in each stage, performs mix-and-match and fine-tuning over these sets and promotes the sets of agents with the best adaptability to the next stage. We implement EPC on a popular MARL algorithm, MADDPG, and empirically show that our approach consistently outperforms baselines by a large margin as the number of agents grows exponentially.

연구 동기 및 목표

지수적으로 증가하는 에이전트 인구를 가진 환경에서 학습의 도전을 동기 부여한다.
가변적인 에이전트 수에 일반화되는 인구 불변의 정책/비평가 아키텍처를 제안한다.
커리큘럼 단계 간 목표 불일치를 해소하기 위한 진화적 선택 메커니즘을 도입한다.
다양한 다중 에이전트 작업에 MADDPG에 적용하여 EPC의 확장성 및 강건성을 입증한다.

제안 방법

임의의 에이전트 수를 처리하기 위해 Q-함수와 정책에 대해 self-attention 기반의 인구 불변 아키텍처를 채택한다.
훈련을 에이전트 인구 수가 증가하는 단계로 나누어 커리큘럼을 형성한다.
역할당마다 K개의 병렬 에이전트 세트를 유지하고 세트 간 믹스 앤 매치(크로스오버)를 수행하여 확장된 인구를 생성한다.
커리큘럼 성장 중에 MARL 파인튜닝을 가이드된 변이 연산자로 사용한다.
확장된 환경에서의 적합도에 기반해 차기 단계에 가장 잘 적응하는 에이전트 세트를 선택하는 진화적 선택 과정을 적용한다.
MADDPG에서 EPC를 시연하고 세 가지 환경에서 베이스라인과 비교한다.

실험 결과

연구 질문

RQ1MARL에서 안정성이나 성능을 잃지 않으면서 어떻게 에이전트 인구 규모 확장을 수행할 수 있는가?
RQ2간단한 클론링과 비교하여 진화적 믹스 앤 매치 접근법이 더 큰 인구에 대한 적응을 개선하는가?
RQ3주목 기반의 인구 불변 아키텍처가 임의의 에이전트 수에 걸쳐 확장 가능한 MARL 훈련을 지원할 수 있는가?
RQ4에이전트 수가 기하급수적으로 증가함에 따라 vanilla 인구 커리큘럼 및 비커리큘럼 MARL 베이스라인에 비해 EPC가 어떤 이점을 제공하는가?

주요 결과

EPC가 에이전트 수가 증가함에 따라 베이스라인보다 일관되게 우수한 성능을 보이며, 인구 증가가 지수적으로 증가해도 마찬가지다.
Attention-based, population-invariant architectures improve MADDPG performance compared to baseline MADDPG and mean-field methods.
Vanilla population curricula degrade as population scales, while EPC maintains superior performance across scales.
EPC yields higher survival and more grass consumption in Grassland, and better collaboration and resource collection in Adversarial Battle and Food Collection.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.