QUICK REVIEW

[논문 리뷰] Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

Gen Li, Yuting Wei|arXiv (Cornell University)|2020. 05. 26.

Reinforcement Learning in Robotics참고 문헌 52인용 수 29

한 줄 요약

이 논문은 생성 모델 하에서 모델 기반 강화 학습의 오랫동안 지속된 표본 크기 장벽을 극복하기 위해 두 가지 최소최대 최적 알고리즘—편향된 및 보수적인 모델 기반 계획—을 도입함으로써 할인 무한수명 MDP에서 표본 복잡도가 $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$ 순서(로그 요소를 제외하고)로 최적화됨을 입증함. 이는 유한수명 MDP에 대해서도 단순한 모델 기반 계획기를 사용하여 최소최대 최적성으로 확장되며, 모든 표본 크기에 대해 완전한 최소최대 최적 보장을 제공함으로써 처음으로 전체 표본 크기 범위에서의 최소최대 최적성 보장을 달성함.

ABSTRACT

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $γ$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-γ}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).

연구 동기 및 목표

모델 기반 강화 학습에서 이전의 보장이 $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$를 초과하는 표본 크기를 요구하는 오랜 기간 동안 지속된 표본 크기 장벽을 해결하기 위해.
생성 모델 하에서 할인 무한수명 MDP에서 모델 기반 계획의 최소최대 최적 표본 복잡도를 확립하기 위해.
일반적인 모델 기반 계획기를 사용하여 시간에 따라 변화하는 유한수명 MDP로 최소최대 최적성을 확장하기 위해.
모든 실현 가능한 표본 크기 범위에서 표본 복잡도와 통계적 정확도 사이의 기본적인 상호 교환 관계를 완전히 특성화하기 위해.

제안 방법

표본 크기가 $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$를 초과할 경우 최소최대 최적성을 보증하는 편향된 모델 기반 계획 알고리즘을 제안함.
동일한 표본 크기 조건 하에서 동일한 최소최대 최적성을 달성하는 보수적인 모델 기반 알고리즘을 도입함.
값 함수 추정에서 통계적 의존성을 분리하기 위해 $(s,a)$-흡수 MDP를 활용함.
베르누이 타입 조건 하에서 정책 평가의 분산을 제어하기 위해 동점 처리 전략을 사용함.
행렬 표기법과 벨먼 방정식을 사용하여 무한수명 및 유한수명 설정 모두에서 값 함수 역학을 분석함.
유한수명 MDP에서 값 함수 수열의 성장률을 유한하게 제한하기 위해 타월리프 합과 코시-슈바르츠 부등식을 적용함.

실험 결과

연구 질문

RQ1생성 모델 하에서 모델 기반 강화 학습의 표본 크기 장벽을 극복할 수 있는가? 특히, 비선형 표본 추출 영역에서도 최소최대 최적성이 달성될 수 있는가?
RQ2할인 무한수명 MDP에서 모델 기반 계획의 최적 표본 복잡도는 무엇이며, 이를 증명 가능한 보장 하에 달성할 수 있는가?
RQ3일반적인 모델 기반 계획기는 유한수명 MDP에서 최소최대 최적성을 달성하는 데에 충분한가? 만약 그렇다면 어떤 조건에서 가능한가?
RQ4서브선형에서 슈퍼선형에 이르기까지 전체 표본 크기 범위를 최소최대 최적 알고리즘으로 처리할 수 있으며 통계적 정확도를 잃지 않는가?
RQ5값 함수 추정에서의 통계적 의존성은 표본 복잡도에 어떤 영향을 미치며, 효과적으로 분리시킬 수 있는가?

주요 결과

편향된 모델 기반 알고리즘은 표본 복잡도 $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right)$로 최소최대 최적성을 달성하며, 이는 이전의 $\frac{1}{(1-\gamma)^2}$ 장벽을 돌파함.
보수적인 모델 기반 알고리즘도 동일한 표본 복잡도 조건 하에서 최소최대 최적성을 달성함으로써 다양한 알고리즘 설계에서의 강건성을 확인함.
유한수명 MDP의 경우, 일반적인 모델 기반 계획기는 표본 복잡도 $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|H^2}{N}\right)$로 최소최대 최적성을 달성함. 여기서 $N$은 각 상태-행동 쌍에 대한 표본 수임.
분석을 통해 할인 무한수명 MDP의 최소최대 최적 표본 복잡도는 $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right)$로 확인되었으며, 이는 정보 이론적 하한선과 일치함.
이 논문은 서브선형 영역 포함하여 모든 실현 가능한 표본 크기 범위에서 유효한 최소최대 최적 보장을 제공하는 최초의 보장을 제공함.
값 함수 수열의 성장률은 $\max_j \|\bm{V}_j^{(l)}\|_\infty \leq (\sqrt{3H})^l H$로 유한하게 제한되며, 이는 제안된 프레임워크 하에서 유한 단계 내 수렴을 보장함.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.