QUICK REVIEW

[논문 리뷰] Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Kaiqing Zhang, Sham M. Kakade|arXiv (Cornell University)|2020. 07. 15.

Reinforcement Learning in Robotics참고 문헌 69인용 수 24

한 줄 요약

이 논문은 생성 모델을 사용하여 이인자 제로섬 마르코프 게임에서 모델 기반 다중 에이전트 강화 학습을 위한 첫 번째 근사 최소 최대 최적의 표본 복잡도를 확립한다. 이는 보상 무관 설정에서는 최소 최대 최적이고, 보상 인지 설정에서는 근사 최적임을 보여준다. $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}||\mathcal{B}|(1-\gamma)^{-3}\epsilon^{-2})$의 표본 복잡도로 $\epsilon$-네시 균형을 찾는 데 성공한다.

ABSTRACT

Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $ ilde O(|S||A||B|(1-γ)^{-3}ε^{-2})$ for finding the Nash equilibrium (NE) value up to some $ε$ error, and the $ε$-NE policies with a smooth planning oracle, where $γ$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $ ildeΩ(|S|(|A|+|B|)(1-γ)^{-3}ε^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.

연구 동기 및 목표

생성 모델이 있는 이인자 제로섬 마르코프 게임에서 모델 기반 다중 에이전트 강화 학습의 표본 복잡도를 분석하는 것.
간단한 모델 기반 접근 방식—먼저 모델을 학습하고, 그 다음 계획을 수행하는 방식—이 근사 최적의 표본 효율성을 달성하는지 여부라는 근본적인 질문을 다루는 것.
MARL에서 표본 복잡도 하한을 고려할 때 보상 인지 및 보상 무관 설정 간의 차이를 구분하는 것.
보상 무관 설정에서 하한이 일치함을 증명하여 최소 최대 최적성(로그함수 요소를 제외한)을 입증하는 것.
방법의 강력함(다양한 보상 함수를 재샘플링 없이 처리 가능)과 제한점(보상 인지 케이스에서 |A|와 |B|에 대한 비최적 의존성) 사이의 상충 관계를 명확히 하는 것.

제안 방법

모델 기반 접근 방식을 사용하여 생성 모델을 통해 데이터를 샘플링한 후 전이 모델을 추정하고, 그 다음 계획을 통해 균형 정책을 계산한다.
부드러운 계획 오라클을 활용하여 경험 모델에서 네시 균형 정책을 계산하여 $\epsilon$-네시 균형으로 수렴함을 보장한다.
표본 독립성 및 확률적 이탈 한계를 이용한 농도 불등식을 통해 가치 함수 추정 오차를 제한하며, 생성 모델의 i.i.d. 샘플링 성질에 의존한다.
표본 복잡도 하한을 $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}||\mathcal{B}|(1-\gamma)^{-3}\epsilon^{-2})$로 유도하여 $\epsilon$-네시 균형을 달성하는 데에 사용한다.
보상 무관 설정에서 하한이 일치함을 증명하여 최소 최대 최적성(로그함수 요소를 제외한)을 입증한다.
보상 인지 및 보상 무관 설정을 구분하여, 모델 기반 방법이 첫 번째 케이스에서는 근사 최적이고, 두 번째 케이스에서는 최적임을 보여준다.

실험 결과

연구 질문

RQ1생성 모델에 접근 가능한 이인자 제로섬 마르코프 게임에서 모델 기반 다중 에이전트 강화 학습의 표본 복잡도는 무엇인가?
RQ2보상이 데이터 수집 시 사용되지 않는 보상 무관 설정에서 모델 기반 접근 방식이 최소 최대 최적인가?
RQ3보상 인지 케이스에서 모델 기반 접근 방식의 표본 복잡도가 정보 이론적 하한과 어떻게 비교되는가?
RQ4방법의 다수의 보상 함수를 처리할 능력과 행동 공간 크기 |A|와 |B|에 대한 의존성 사이의 근본적 상충 관계는 무엇인가?
RQ5보상 무관 케이스에서 상한과 일치하는 하한을 도출하여 최소 최대 최적성을 입증할 수 있는가?

주요 결과

모델 기반 다중 에이전트 강화 학습 접근 방식은 이인자 제로섬 마르코프 게임에서 $\epsilon$-네시 균형을 찾기 위해 $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}||\mathcal{B}|(1-\gamma)^{-3}\epsilon^{-2})$의 표본 복잡도를 달성한다.
이 표본 복잡도는 보상이 모델 샘플링 시 사용되지 않는 보상 무관 설정에서는 최소 최대 최적성(로그함수 요소를 제외한)을 보인다.
보상 인지 설정에서는 근사 최적이다. 하한 $\tilde{\Omega}(|\mathcal{S}|(|\mathcal{A}|+|\mathcal{B}|)(1-\gamma)^{-3}\epsilon^{-2})$와 비교할 때 |A|와 |B|에 대한 의존성에서만 격차가 존재한다.
방법은 매우 표본 효율적이며, 표본 복잡도가 보상 함수의 수가 아니라 상태 공간과 행동 공간 크기의 곱에 비례하여 스케일링된다.
동일한 모델을 다양한 보상 함수에 대해 재사용할 수 있기 때문에, 재샘플링 없이도 여러 보상 함수를 처리할 수 있다는 점에서 강력하다.
분석을 통해 근본적인 상충 관계가 드러났다: 보상 무관 설정에서는 강인하고 효율적이지만, 보상 인지 설정에서는 |A|와 |B|에 대한 의존성으로 인해 덜 적응적이고 비최적적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.