QUICK REVIEW

[논문 리뷰] Model-Based Reinforcement Learning with Value-Targeted Regression

Alex Ayoub, Zeyu Jia|arXiv (Cornell University)|2020. 06. 01.

Advanced Bandit Algorithms Research참고 문헌 44인용 수 71

한 줄 요약

논문은 가치-타깃 회귀를 이용해 신뢰 구간을 구성하고 낙관적 계획을 수행하는 모델 기반 RL 알고리즘 UCRL-VTR를 소개한다. 이는 상태 공간 크기나 행동 공간 크기가 아닌 모델 복잡성에 비례하는 후회 경계와 선형 혼합에 대한 경계까지 포함한다.

ABSTRACT

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_θ = \sum_{i=1}^{d} θ_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $ ilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $θ$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $Ω(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

연구 동기 및 목표

온라인 모델 기반 RL 하에서 알려진 전이 모형 가족 P에 대해 후회 최소화를 동기화한다.
P에 대해 데이터 일관 신뢰 구간을 구축하기 위해 value-targeted regression을 제안한다.
이 집합들을 활용하는 낙관적 계획 기반 알고리즘(UCRL-VTR)을 개발한다.
이론적 후회 경계를 제시하고 방법을 실험적으로 평가한다.

제안 방법

에피소드 MDP를 정의하고 알려진 모델 가족 P를 고려하며 P = sum_j θ_j P_j인 선형 혼합 모델을 고려한다.
value-targeted regression을 도입하여 예측 값 V_{h+1,k}와 관찰 대상 y_{h,k}를 기반으로 한 회귀 손실 L_{k+1}(P, P̂_{k+1})를 형성한다.
회귀 손실을 통해 B_k를 구성한다. 예: B_{k+1} = {P' ∈ P : L_{k+1}(P', P̂_{k+1}) ≤ β_{k+1}}.
각 에피소드에서 B_k에 대해 낙관적 계획을 수행하여 V^{*}_{P',1}(s_1^k)를 최대화하는 P_k를 선택하고, 유도된 정책을 실행하며 가치 타깃을 업데이트한다.
Eluder 차원과 커버링 수에 따른 후회 경계를 제공하며, 선형 혼합으로 특화 시 R_K = Ō(d √(H^3 K))와 하한 Ω(√(HdK))를 얻는다.
구현 고려사항과 MuZero와의 연결에 대해 논의한다.

실험 결과

연구 질문

RQ1value-targeted regression이 일반 모델 클래스 P에 대해 서브선형 후회를 달성할 수 있는가?
RQ2후회 경계가 P의 복잡성(예: Eluder 차원) 및 가치 타깃에 대한 노이즈/비정규성에 어떻게 의존하는가?
RQ3낙관적 계획과 가치 타깃 신뢰 구간을 이용한 이점과 한계는 전통적인 모델 기반 접근법과 비교해 어떤가?
RQ4이 방법이 선형 혼합 모델에 특화될 때 후회 스케일링은 어떻게 되는가?
RQ5다른 모델 기반 RL 방법 및 가치 타깃 회귀 변형과 실험적으로 어떻게 비교되는가?

주요 결과

선형 혼합 모델의 경우 알고리즘은 후회 경계 Ō(d √(H^3 T))를 달성한다.
일반 모델 클래스 설정에서 후회는 가치 타깃으로 정의된 함수 클래스의 Eluder 차원에 의해 상한된다.
상한은 상태 공간 혹은 행동 공간의 크기에 의존하지 않으며 선형의 경우 Ω(√(HdT))에 가깝다.
Value-targeted regression은 작업 관련 동역학에 모델 학습을 집중시켜 가능하면 likelihood 기반 회귀보다 효율을 높일 수 있다.
실험은 낙관적 계획과 함께 value-targeted regression이 효과적임을 보여주며, 낙관성 제거나 value-targeted regression 제거는 성능이 저하된다.
이 연구는 MuZero와 연결되며 MuZero도 모델 구성에 대해 독립적으로 value-targeted regression을 사용한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.