QUICK REVIEW

[논문 리뷰] Provably Efficient Reinforcement Learning with Linear Function Approximation

Chi Jin, Zhuoran Yang|arXiv (Cornell University)|2019. 07. 11.

Advanced Bandit Algorithms Research인용 수 219

한 줄 요약

이 논문은 선형 MDP 설정에서 다항 실행 시간과 샘플 복잡도를 가지는 최초의 증명 가능하게 효율적인 RL 알고리즘을 제시하며, 상태 및 행동에 독립적으로 regret ~O~(d^3 H^3 T)^{1/2}를 달성한다. 이는 optimistic LSVI에 UCB 보너스를 사용하고, 작은 모델 오차에서도 견고하다.

ABSTRACT

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $ ilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

연구 동기 및 목표

시뮬레이터나 강한 가정에 의존하지 않고 함수 근사를 사용하는 provably efficient RL 알고리즘의 설계에 동기를 부여한다.
전이와 보상이 특징 맵에서 선형인 선형 MDP 설정을 연구하고, regret 및 샘플 복잡도 보장을 확립한다.
상태 공간과 행동 공간의 크기에 독립적인 Sublinear regret를 달성하는 알고리즘을 개발하고 분석한다.

제안 방법

optimistic modification의 Least-Squares Value Iteration(LSVI)과 Upper-Confidence Bounds(UCB)를 채택한다.
Q_h를 특징의 선형 함수로 표현한다: Q_h(x,a)=w_h^T φ(x,a).
관측된 보상과 다음 값 추정치를 사용하여 w_h를 규제된 최소제곱으로 업데이트한다.
Λ_h를 그램 행렬로 두고 β(φ^T Λ_h^{-1} φ)^{1/2}의 UCB 보너스를 도입하여 탐색을 촉진하며, Λ_h는 Gram 행렬이다.
적절한 λ와 β를 사용하면 Assumption A(선형 MDP) 하에서 총 regret가 Õ(d^3 H^3 T)임을 증명한다.
ζ-근사 선형 MDP에 대한 강건성은 합성 regret 항 Õ(ζ d H T)로 추가된다.

실험 결과

연구 질문

RQ1함수 근사를 사용할 때 시뮬레이터나 제한적 가정 없이 다항 실행 시간과 샘플 복잡도를 가진 RL 알고리즘을 설계할 수 있는가?
RQ2선형 MDP 구조가 상태 공간과 행동 공간의 크기에 독립적인 Sublinear regret를 보장하는데 충분한가?
RQ3오류(ζ-근사 선형 MDP)가 있을 때 regret와 학습 보장은 어떻게 달라지는가?

주요 결과

제안된 LSVI-UCB 알고리즘은 고확률로 regret Õ(d^3 H^3 T)을 달성하며 S와 A에 독립적이다.
알고리즘은 O(d^2 A K T) 시간, O(d^2 H + d A T) 공간으로 실행되며, 역시 S와 A에 독립적이다.
ζ-근사 선형 MDP에서 regret은 Õ(d^3 H^3 T) + Õ(ζ d H T √log)로 변하며, 모델 오차로 인한 항이 T에 선형으로 증가한다.
초기 상태가 고정된 경우 ε-최적 정책을 Õ(d^3 H^4 / ε^2) 샘플로 학습할 수 있다는 PAC 스타일의 보장을 포함한다.
이 방법은 시뮬레이터 없이도 표 형식 RL과 함수 근사 RL 사이의 다리 역할을 제공하며 Sublinear regret를 달성한다.
해당 분석은 가치-의식적 균일 집중과 실제 전이 측정치와 경험적 구조 간의 다리 역할을 하는 선형 구조를 도입한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.