QUICK REVIEW

[논문 리뷰] Bridging Exploration and General Function Approximation in Reinforcement Learning: Provably Efficient Kernel and Neural Value Iterations.

Zhuoran Yang, Chi Jin|arXiv (Cornell University)|2020. 11. 09.

Advanced Bandit Algorithms Research참고 문헌 33인용 수 18

한 줄 요약

이 논문은 커널 및 신경망 함수 근사 기반으로 최초로 증명 가능한 효율성을 확보한 강화학습 알고리즘을 제안한다. 최적의 최소 제곱가치 반복과 탐색을 조합하여 $\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$의 손실을 달성한다. 이 방법은 추가적인 데이터 가정 없이 다항 시간 복잡도와 샘플 복잡도를 보장하여, 큰 또는 무한한 상태 공간으로의 확장이 가능하다.

ABSTRACT

Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamental challenge to developing algorithms with provable computational and statistical efficiency, due to the need to take into consideration both the exploration-exploitation tradeoff that is inherent in RL and the bias-variance tradeoff that is innate in statistical estimation. To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network, we propose the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model. In particular, for both the kernel and neural settings, we prove that an optimistic modification of the least-squares value iteration algorithm incurs an $ ilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ regret, where $\delta_{\mathcal{F}}$ characterizes the intrinsic complexity of the function class $\mathcal{F}$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states and therefore even allows it to diverge, which exhibits the benefit of function approximation.

연구 동기 및 목표

함수 근사 기반 강화학습에서 탐색-이용 갈등과 편향-분산 갈등을 동시에 고려하는 이론적 과제를 해결한다.
커널 및 신경망 함수 근사기법을 사용하여 큰 또는 무한한 상태 공간에 대해 증명 가능한 효율성을 확보한 강화학습 알고리즘을 개발한다.
데이터 생성 모델에 대한 추가 가정 없이 다항 시간 복잡도와 샘플 복잡도를 확보한다.
상태 수에 종속되지 않는 손실 한계를 확립하여 고차원 또는 연속 환경으로의 확장 가능성을 높인다.

제안 방법

탐색과 이용을 균형 잡기 위해 최소 제곱가치 반복 알고리즘의 최적화된 수정안을 제안한다.
커널 함수와 과다 매개변수화된 신경망을 사용하여 함수 클래스 $\mathcal{F}$ 내에서 행동가치 함수를 표현한다.
불확실성 추정치를 가치 갱신에 통합하여 덜 알려진 상태-행동 쌍의 탐색을 장려한다.
함수 클래스 $\mathcal{F}$ 의 내재적 복잡도 $\delta_{\mathcal{F}}$ 를 활용하여 손실을 $\delta_{\mathcal{F}}$, $H$, 및 $T$ 에 따라 제한한다.
통계학적 학습 이론을 적용하여 추정 오차를 통제하고 함수 근사 하에 일반화를 보장한다.
상태 수에 종속되지 않는 $\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ 의 손실 한계를 유도한다.

실험 결과

연구 질문

RQ1커널 및 신경망 함수 근사 기반으로 탐색과 일반화를 균형 잡은 증명 가능한 효율성의 강화학습 알고리즘을 설계할 수 있는가?
RQ2함수 근사가 있는 에피소드 강화학습에서 상태 공간 크기와 무관하게 달성 가능한 최적의 손실 한계는 무엇인가?
RQ3계산적 효율성과 통계적 일致성 양자를 확보하기 위해 가치 반복에 최적의 태도를 어떻게 통합할 수 있는가?
RQ4함수 클래스의 내재적 복잡도 $\delta_{\mathcal{F}}$ 는 함수 근사 기반 강화학습에서 손실에 어떤 역할을 하는가?
RQ5데이터 생성 과정에 제한적인 가정을 두지 않고도 다항 시간 복잡도와 샘플 복잡도를 달성할 수 있는가?

주요 결과

제안된 알고리즘은 $\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ 의 손실 한계를 달성하며, 이는 상태 수에 종속되지 않는다.
손실 한계는 함수 클래스 $\mathcal{F}$ 의 내재적 복잡도 $\delta_{\mathcal{F}}$ 에 따라 스케일링되며, 근사 오차와 추정 오차 간의 트레이드오프를 반영한다.
알고리즘은 상태 수가 무한하거나 매우 클 경우에도 다항 시간 복잡도와 샘플 복잡도를 유지한다.
최적의 최소 제곱가치 반복 프레임워크는 함수 근사 기반 강화학습에서 탐색과 이용을 성공적으로 균형 잡는다.
추가적인 데이터 생성 모델 가정 없이 이론적 분석이 유지되어 방법의 일반성을 높인다.
결과는 함수 근사가 고차원 또는 연속 환경에서 증명 가능한 효율성과 함께 효과적으로 사용될 수 있음을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.