QUICK REVIEW

[논문 리뷰] Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning

Hengyuan Hu, Jakob Foerster|arXiv (Cornell University)|2020. 04. 30.

Reinforcement Learning in Robotics참고 문헌 30인용 수 17

한 줄 요약

이 논문은 협동적 다중에이전트 강화학습에서 팀원의 탐색 행동을 학습 중에 추론할 수 있도록 해주는 단순화된 액션 디코더(SAD)를 소개한다. 중심화된 훈련을 통해 의도를 디코딩하고 보조 상태 예측 작업을 활용함으로써, SAD는 2–5명의 플레이어가 참여하는 자가대전 한비(한비) 환경에서 최신 기술 수준(SOTA)의 성능을 달성하며, 협동적 다중에이전트 강화학습에서 탐색과 정보성 간의 상충 관계를 해결한다.

ABSTRACT

In recent years we have seen fast progress on a number of benchmark problems in AI, with modern methods achieving near or super human performance in Go, Poker and Dota. One common aspect of all of these challenges is that they are by design adversarial or, technically speaking, zero-sum. In contrast to these settings, success in the real world commonly requires humans to collaborate and communicate with others, in settings that are, at least partially, cooperative. In the last year, the card game Hanabi has been established as a new benchmark environment for AI to fill this gap. In particular, Hanabi is interesting to humans since it is entirely focused on theory of mind, i.e. the ability to effectively reason over the intentions, beliefs and point of view of other agents when observing their actions. Learning to be informative when observed by others is an interesting challenge for Reinforcement Learning (RL): Fundamentally, RL requires agents to explore in order to discover good policies. However, when done naively, this randomness will inherently make their actions less informative to others during training. We present a new deep multi-agent RL method, the Simplified Action Decoder (SAD), which resolves this contradiction exploiting the centralized training phase. During training SAD allows other agents to not only observe the (exploratory) action chosen, but agents instead also observe the greedy action of their team mates. By combining this simple intuition with an auxiliary task for state prediction and best practices for multi-agent learning, SAD establishes a new state of the art for 2-5 players on the self-play part of the Hanabi challenge.

연구 동기 및 목표

협동적 다중에이전트 강화학습에서 탐색과 정보성 간의 균형을 맞추는 데 도전한다.
학습 중 탐색 행동이 존재하더라도 팀원의 의도한 행동을 추론할 수 있도록 한다.
한비와 같은 부분관측 가능한 협동 환경에서의 통신 효율성을 향상시킨다.
탐색 행동이 훈련 중 정보 공유를 감소시키는 본질적인 모순을 극복한다.
간단하면서도 효과적인 아키텍처를 사용하여 2–5명의 플레이어가 참여하는 자가대전 한비 설정에서 새로운 최신 기술 수준을 수립한다.

제안 방법

에이전트가 자신의 탐색 행동뿐만 아니라 팀원의 탐욕적 행동도 관찰할 수 있는 중심화된 훈련 메커니즘을 도입한다.
정책 출력에서 팀원의 의도한 행동을 재구성하기 위해 단순화된 액션 디코더 헤드를 사용한다.
정책 일반화와 통신 향상을 위해 상태 예측을 위한 보조 작업을 통합한다.
커리큘럼 학습과 가치 함수 정규화와 같은 다중에이전트 강화학습의 최선의 실천 방식을 활용한다.
내재 밀도 보상과 보조 상태 예측 손실을 조합하여 정책을 엔드 투 엔드로 훈련한다.
에이전트가 행동이 확률적일 때도 의도를 추론할 수 있도록 탐색과 통신을 분리한다.

실험 결과

연구 질문

RQ1훈련 중 팀원의 의도를 디코딩함으로써, 한비와 같은 부분관측 가능한 환경에서 협동적 다중에이전트 통신을 향상시킬 수 있는가?
RQ2탐욕적 행동을 위한 중심화된 디코더를 통합할 경우 협동적 다중에이전트 강화학습의 성능에 어떤 영향을 미치는가?
RQ3보조 상태 예측 작업이 협동 설정에서 통신과 정책 학습에 얼마나 기여하는가?
RQ4간단한 아키텍처 수정으로 협동적 다중에이전트 강화학습에서 탐색과 정보성 간의 상충 관계를 해결할 수 있는가?
RQ5제안된 방법은 2–5명의 플레이어가 참여하는 자가대전 한비 설정에서 최신 기술 수준의 성능을 달성하는가?

주요 결과

SAD는 2–5명의 플레이어가 참여하는 자가대전 한비 챌린지 설정에서 새로운 최신 기술 수준을 달성한다.
에이전트가 팀원의 의도한 행동을 추론할 수 있도록 해 탐색과 통신를 효과적으로 분리한다.
보조 상태 예측 작업은 정책 일반화와 통신 효율성 향상에 기여한다.
이 방법은 훈련 중 탐색 행동과 정보성 있는 행동 선택 간의 근본적인 갈등을 해결한다.
복잡한 아키텍처 수정 없이도 이전 방법들에 비해 뚜렷한 성능 향상을 보여준다.
중심화된 훈련 단계 덕분에 효과적인 의도 디코딩이 가능해져 팀 수준의 협업이 크게 향상된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.