QUICK REVIEW

[논문 리뷰] Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning

Jakob Foerster, Francis Song|arXiv (Cornell University)|2018. 11. 04.

Reinforcement Learning in Robotics참고 문헌 29인용 수 49

한 줄 요약

BAD는 협력적 부분 관측 다중 에이전트 강화학습에서 확장 가능한 반사사실(counterfactual) 추론을 가능하게 하는 공개-신념 프레이임워크(PuB-MDP)를 도입하여 최첨단 Hanabi 성능을 달성합니다.

ABSTRACT

When observing the actions of others, humans make inferences about why they acted as they did, and what this implies about the world; humans also use the fact that their actions will be interpreted in this manner, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of two-player, zero-sum games, scalable multi-agent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. We present the Bayesian action decoder (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. BAD introduces a new Markov decision process, the public belief MDP, in which the action space consists of all deterministic partial policies, and exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over all partial policies mapping private information into environment actions. The Bayesian update is closely related to the theory of mind reasoning that humans carry out when observing others' actions. We first validate BAD on a proof-of-principle two-step matrix game, where it outperforms policy gradient methods; we then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where, in the two-player setting, it surpasses all previously published learning and hand-coded approaches, establishing a new state of the art.

연구 동기 및 목표

협력적이고 부분적으로 관측 가능한 다중 에이전트 설정에서 효과적인 소통과 관습의 학습을 자극한다.
공개-신념 프레임워크(PuB-MDP)를 도입하여 사적 정보를 가진 에이전트를 조정한다.
깊은 네트워크를 이용해 사적 관찰에 의존하는 결정론적 부분 정책을 학습하는 방법을 개발한다.
toy 및 Hanabi 실험에서 기준선 대비 성능 향상을 보인다.

제안 방법

공공 신념 B_t를 P(f Pri | f pub≤t)로 정의하고 상태를 (B_t, f pub)이며 행위 공간을 결정론적 부분 정책들로 구성된 PuB-MDP를 구성한다.
공개 에이전트 BAD는 B_t와 f pub에 따라 부분 정책을 선택하고, 작동 에이전트는 사적 관찰을 이용해 환경 동작을 선택한다.
특징별 가능도와 샘플 기반 업데이트를 이용해 B_t를 근사적이고 요인화된 Bayesian 업데이트로 유지한다.
사적 관찰에 걸친 부분 정책에 대해 요인화된 구조의 BAD 정책을 매개화하여 깊은 네트워크를 이용한 확장 가능한 학습을 가능하게 한다.
모든 에이전트가 동일한 BAD 정책을 샘플링하도록 공통 난수 시드를 공유하여 팀 단위의 조정된 탐색을 가능하게 한다.
특징 간 상호작용을 다루기 위한 자기 일관 신념 정제(V0, V1, V2 신념) 및 일관성 향상을 위한 선택적 절차를 도입한다.

실험 결과

연구 질문

RQ1공개-신념 MDP(PuB-MDP)가 사적 관찰을 갖는 협력형 MARL에서 소통 기반의 관습 학습을 확장 가능하게 하는가?
RQ2공개 신념에 대한 요인화 및 근사 Bayesian 업데이트가 Hanabi와 같은 큰 상태 공간에서 실용적인 성능 향상을 제공하는가?
RQ3BAD는 두 명의 플레이어 Hanabi에서 정책-경사 baselines 및 핸드코딩 에이전트와 비교해 어떤가?
RQ4BAD 하에서 관습(grounded information)과 관념 간의 기여도는 Hanabi의 성능에 어느 정도인가?

주요 결과

에이전트	학습 단계	평균 ± 표준오차	완전한 비율
SmartBot	-	23.09	29.52%
FireFlower	-	23.37 ± 0.0002	52.6%
V0-LSTM	20.2B	23.622 ± 0.005	36.5%
V1-LSTM	21.1B	23.919 ± 0.004	47.5%
BAD	16.3B	24.174 ± 0.004	58.6%

BAD는 증명-개념의 2단계 행렬 게임에서 정책-경사 baselines를 능가한다.
2인 Hanabi에서 BAD는 평균 24.174점을 달성하고, 이전 학습 방법 대비 약 9점을 상회하며 개방 핸드 속임수 성능에 근접한다.
BAD는 평가에서 완전 게임의 높은 비율(표 1에서 58.6%)을 달성한다.
베이지안 업데이트를 통해 학습된 신념은 핸드에 대한 불확실성을 기저 추론 기반 기준선에 비해 약 40% 감소시킨다.
Hanabi 게임에서 약 40%의 정보가 관습을 통해 전달되며, 게임 내 행동 분석으로 확인된다.
BAD는 두 명의 플레이어 자기학습에 대해 Hanabi 학습 환경에서 새로운 최첨단 성능을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.