QUICK REVIEW

[논문 리뷰] Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents

Kaiqing Zhang, Zhuoran Yang|arXiv (Cornell University)|2018. 02. 23.

Distributed Control Multi-Agent Systems참고 문헌 71인용 수 250

한 줄 요약

이 논문은 시간 변화 네트워크에서 완전히 분산된 다중 에이전트 강화학습(MARL)을 위해 합의 기반 크리틱 및 함수 근사를 포함한 두 가지 분산형 액터-크리틱 알고리즘을 개발하고, 선형 근사에서의 수렴 보장을 제시한다.

ABSTRACT

We consider the problem of \emph{fully decentralized} multi-agent reinforcement learning (MARL), where the agents are located at the nodes of a time-varying communication network. Specifically, we assume that the reward functions of the agents might correspond to different tasks, and are only known to the corresponding agent. Moreover, each agent makes individual decisions based on both the information observed locally and the messages received from its neighbors over the network. Within this setting, the collective goal of the agents is to maximize the globally averaged return over the network through exchanging information with their neighbors. To this end, we propose two decentralized actor-critic algorithms with function approximation, which are applicable to large-scale MARL problems where both the number of states and the number of agents are massively large. Under the decentralized structure, the actor step is performed individually by each agent with no need to infer the policies of others. For the critic step, we propose a consensus update via communication over the network. Our algorithms are fully incremental and can be implemented in an online fashion. Convergence analyses of the algorithms are provided when the value functions are approximated within the class of linear functions. Extensive simulation results with both linear and nonlinear function approximations are presented to validate the proposed algorithms. Our work appears to be the first study of fully decentralized MARL algorithms for networked agents with function approximation, with provable convergence guarantees.

연구 동기 및 목표

에이전트들이 시간 변화 네트워크에서 로컬 보상과 이웃 간의 통신만으로 전역 평균 보상을 최대화하는 완전히 분산된 MARL 설정을 동기로 삼고 formalize한다.
중앙 컨트롤러 없이 동작하는 함수 근사와 함께 두 가지 분산형 액터-크리틱 알고리즘을 제안한다.
로컬 정책과 합의 기반 가치 추정을 통해 대규모 상태 공간 및 에이전트 공간에 대한 확장 가능성을 enable한다.
제안된 알고리즘에 대해 선형 함수 근사하에서의 이론적 수렴 보장을 확립한다.
이론을 뒷받침하기 위한 시뮬레이션을 통해 실증적 검증을 Demonstrate한다.

제안 방법

시간 변화하는 통신 그래프와 로컬 보상을 갖는 네트워크화된 다중 에이전트 MDP를 형식화한다.
로컬 정책을 사용하고 에이전트 간에 분해되는 MARL용 정책 그래디언트 정리를 도출한다.
활성자 업데이트가 로컬이고 크리틱 업데이트가 이웃 간 합의 기반으로 이루어지는 두 가지 분산형 액터-크리틱 알고리즘을 제안한다.
로컬 매개변수와 합의 단계로 네트워크 전체에 추정치를 공유하기 위한 Q와 V에 대한 함수 근사를 사용한다.
선택적으로 상태-가치 TD-오류 또는 행동-가치 TD-오류 변형을 가진 두 개의 온라인, 점진적 업데이트 알고리즘을 제공한다.
선형 함수 근사하에서의 수렴 보장을 확립하고 합의 업데이트를 분석한다.

실험 결과

연구 질문

RQ1로컬 보상과 중앙 컨트롤러 없이 네트워크 에이전트에 대해 완전히 분산된 MARL을 어떻게 공식화할 수 있는가?
RQ2시간 변화하는 네트워크 토폴로지에서 함수 근사를 사용할 때 분산된 액터-크리틱 알고리즘은 수렴할 수 있는가?
RQ3합의 업데이트가 MARL에서 네트워크 전체 최적화를 달성하는 데 어떤 역할을 하는가?
RQ4제안된 프레임워크에서 선형 함수 근사가 수렴 보장에 어떤 영향을 미치는가?
RQ5제안된 알고리즘은 온라인 동작성을 유지하면서 많은 수의 에이전트와 고차원 상태-행동 공간으로 확장 가능한가?

주요 결과

함수 근사를 갖는 두 가지 분산형 액터-크리틱 알고리즘이 시간 변화 그래프를 갖는 네트워크 MARL에 대해 제안된다.
로컬 액터 업데이트를 가능하게 하고 합의 기반 크리틱 추정치를 허용하는 MARL용 정책 그래디언트 정리가 확립된다.
두 알고리즘에 대해 선형 함수 근사 경우의 수렴 보장이 증명된다.
알고리즘은 완전히 점진적이며 온라인으로 구현 가능하고 개별 보상의 전송을 피함으로써 에이전트의 프라이버시를 보존한다.
선형 및 비선형 함수 근사로의 실험적 시뮬레이션이 제안된 방법을 검증하고 이론을 지지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.