QUICK REVIEW

[논문 리뷰] A multi-agent reinforcement learning model of common-pool resource appropriation

Julien Pérolat, Joel Z. Leibo|arXiv (Cornell University)|2017. 07. 20.

Experimental Behavioral Economics Studies참고 문헌 30인용 수 68

한 줄 요약

논문은 공간적으로 동적 공통 자원(CPR) 게임에서 독립적인 딥 강화 학습 에이전트를 사용하여 배제, 지속가능성, 불평등을 포함한 출현하는 행동을 연구하고 이를 경험적 게임 이론 도구로 분석한다.

ABSTRACT

Humanity faces numerous problems of common-pool resource appropriation. This class of multi-agent social dilemma includes the problems of ensuring sustainable use of fresh water, common fisheries, grazing pastures, and irrigation systems. Abstract models of common-pool resource appropriation based on non-cooperative game theory predict that self-interested agents will generally fail to find socially positive equilibria---a phenomenon called the tragedy of the commons. However, in reality, human societies are sometimes able to discover and implement stable cooperative solutions. Decades of behavioral game theory research have sought to uncover aspects of human behavior that make this possible. Most of that work was based on laboratory experiments where participants only make a single choice: how much to appropriate. Recognizing the importance of spatial and temporal resource dynamics, a recent trend has been toward experiments in more complex real-time video game-like environments. However, standard methods of non-cooperative game theory can no longer be used to generate predictions for this case. Here we show that deep reinforcement learning can be used instead. To that end, we study the emergent behavior of groups of independently learning agents in a partially observed Markov game modeling common-pool resource appropriation. Our experiments highlight the importance of trial-and-error learning in common-pool resource appropriation and shed light on the relationship between exclusion, sustainability, and inequality.

연구 동기 및 목표

정적 게임 이론을 넘어 동적이고 공간적이며 시간에 따라 진화하는 환경으로 CPR 문제 모델링에 동기를 부여한다.
독립 학습 에이전트가 자율적으로 조직해 공통 자원을 지속 가능하게 적절히 이용할 수 있는지 조사한다.
배제 메커니즘과 영토 형성이 지속 가능성 및 불평등에 미치는 영향을 조사한다.
학습 역학을 게임 이론적 개념과 연결하고 사회적 결과를 요약하는 지표를 제공한다.

제안 방법

에이전트가 지역 재고에 의존하는 재생량을 가진 사과를 수확하는 부분 가시성의 N-플레이어 마코프 게임을 모델링한다.
중앙 집중식 조정 없이 독립적인 Deep Q-Network (DQN) 에이전트를 사용하여 상호 작용을 통해 정책을 학습한다.
그룹 행동을 요약하기 위해 사회적 결과 지표 네 가지를 도입한다: Utilitarian (U), Equality (E), Sustainability (S), and Peace (P).
출현 정책을 분석하고 Schelling diagrams를 통해 인센티브를 특징짓는 경험적 게임 이론 분석을 수행한다.
다른 에이전트를 자원에서 배제할 수 있는 타임아웃 태깅 메커니즘을 포함한 다양한 변형을 검토한다.
학습 단계 전반에 걸친 학습 정책의 관찰/비디오 예시를 제공한다.

실험 결과

연구 질문

RQ1독립적인 딥 강화학습 에이전트가 공간적으로 동적인 환경에서 CPR를 지속 가능하게 적절히 이용하도록 자율적으로 조직할 수 있는가?
RQ2배제 메커니즘(태깅)이 지속 가능성, 평등성 및 전반적인 효율성에 어떤 영향을 미치는가?
RQ3학습 중에 나타나는 사회심리학적 유사 단계(naïvety, tragedy, maturity)가 자원 재고와 어떻게 관련되는가?
RQ4경험적 게임 이론 도구(Schelling diagrams)가 학습 에이전트 간에 발전하는 전략적 인센티브를 어떻게 특징짓는가?

주요 결과

단일 에이전트 학습은 고립된 상태에서 지속 가능한 정책을 낳을 수 있다.
다중 에이전트 환경에서 그룹 수익은 개인 학습 진행 상황을 일관되게 반영하지 못한다; 사회적 메트릭이 개인 보상 이상으로 단계 변화를 드러낸다.
세 가지 학습 단계가 나타난다: naïvety (건강한 재고와 높은 효율), tragedy (급속한 고갈), maturity (배제 다이내믹스에 의해 재고 유지).
타임아웃 태깅을 통한 배제는 재고를 지속시키고 태거의 개인 수익을 높이는 비공개 영토를 만들 수 있으며, 에이전트 간 불평등을 증가시킨다.
영토 구조와 더 쉬운 배제는 더 큰 불평등으로 이어지며, 여러 출입구가 있거나 벽이 없는 지도는 그러한 불평등을 줄인다.
Schelling diagrams를 통한 경험적 게임 이론 분석은 전략적 인센티브가 시간이 지남에 따라 균일한 외부 효과에서 상황 의존적 외부 효과로 이동함을 보여주며, 진화하는 전략적 역학을 나타낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.