QUICK REVIEW

[논문 리뷰] Finite-Sample Analysis for SARSA with Linear Function Approximation

Shaofeng Zou, Tengyu Xu|arXiv (Cornell University)|2019. 02. 06.

Reinforcement Learning in Robotics인용 수 65

한 줄 요약

이 논문은 비 i.i.d. 데이터와 시간에 따라 변하는 행동정책 하에서 선형 함수 근사화를 사용하는 온-폴리시 SARSA의 최초의 비점근적(non-asymptotic) 유한샘플 분석과 함께, 유한샘플 보장을 갖는 적합화된 SARSA 변형을 제시한다.

ABSTRACT

SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement learning. We investigate the SARSA algorithm with linear function approximation under the non-i.i.d.\ data, where a single sample trajectory is available. With a Lipschitz continuous policy improvement operator that is smooth enough, SARSA has been shown to converge asymptotically \cite{perkins2003convergent,melo2008analysis}. However, its non-asymptotic analysis is challenging and remains unsolved due to the non-i.i.d. samples and the fact that the behavior policy changes dynamically with time. In this paper, we develop a novel technique to explicitly characterize the stochastic bias of a type of stochastic approximation procedures with time-varying Markov transition kernels. Our approach enables non-asymptotic convergence analyses of this type of stochastic approximation algorithms, which may be of independent interest. Using our bias characterization technique and a gradient descent type of analysis, we provide the finite-sample analysis on the mean square error of the SARSA algorithm. We then further study a fitted SARSA algorithm, which includes the original SARSA algorithm and its variant in \cite{perkins2003convergent} as special cases. This fitted SARSA algorithm provides a more general framework for extit{iterative} on-policy fitted policy iteration, which is more memory and computationally efficient. For this fitted SARSA algorithm, we also provide its finite-sample analysis.

연구 동기 및 목표

시간에 따라 변하는 정책에서 비-i.i.d. 샘플로부터 SARSA가 선형 함수 근사화로 얼마나 빨리 수렴하는지 이해를 자극한다.
시간에 따라 변하는 마코프 커널과 관련된 확률적 근사의 편향 특성화 기법을 개발한다.
SARSA와 일반화된 적합 SARSA 알고리즘에 대한 유한샘플 평균제곱오차(bound) 를 도출한다.
적합화된 SARSA 스킴이 수렴 특성을 보존하면서 메모리 및 계산 효율이 더 높을 수 있음을 보인다.

제안 방법

시간에 따라 변하는 마코프 전이 커널을 갖는 확률적 근사에 대한 새로운 편향 특성화 기법을 도입한다.
선형 함수 근사화를 갖는 SARSA와 Lipschitz 연속인 정책 개선 연산자를 모델링한다.
그래디언트 디센트 스타일 프레임워크와 편향 경계(bias bounds)를 사용하여 유한샘플 분석을 제공한다.
정책 개선 사이에 TD(0) 기반의 적합 단계가 있는 일반적인 온-정책 적합 SARSA 알고리즘으로 확장한다.
감소하는 스텝 크기와 상수 스텝 크기에 대한 명시적 유한샘플 경계를 도출한다.

실험 결과

연구 질문

RQ1비-i.i.d. 샘플과 시간 변동 정책에서 선형 함수 근사를 갖는 온-policy SARSA에 대한 비점근적 수렴 보장을 얻을 수 있는가?
RQ2시간에 따라 변하는 마코프 커널에서 발생하는 확률적 편향이 수렴에 어떤 영향을 미치며 그 속도는 무엇인가?
RQ3SARSA 및 일반화된 적합 SARSA 알고리즘에 대해 어떤 유한샘플 오차 경계를 확립할 수 있는가?
RQ4적합-SARSA 프레임워크가 같은 혹은 향상된 샘플 복잡도와 잠재적 계산 이점을 제공하는가?
RQ5정책 개선(Lipschitz)의 조건이 수렴과 관리 가능한 편향을 보장하는가?

주요 결과

선형 함수 근사화를 갖는 SARSA는 감소하는 스텝 사이즈와 일정 스텝 사이즈에서 유한샘플 평균제곱오차 경계를 달성하고, 양화된 속도로 한계점 theta*로 수렴하는 것을 보인다.
감소하는 스텝 크기일 때 오차는 large T에 대해 O(log^3 T / T)로 스케일링되며, 오차 delta에 도달하기 위한 샘플 복잡도는 O(1/delta * log^3(1/delta))를 시사한다.
상수 스텝 크기일 때도 스텝이 충분히 작고 T가 충분히 큰 경우 theta*의 작은 이웃으로 수렴한다.
일반적인 온-정책 적합 SARSA 알고리즘을 분석하여 SARSA와 동일한 전반적 O(1/delta log^3(1/delta)) 샘플 복잡도를 보이고, 정책 개선 사이에 TD 반복을 사용할 때 계산 이득이 있을 수 있다.
적합 단계는 전체 수렴 전에 종료되어도 전체 수렴이나 샘플 복잡도에 해를 끼치지 않는다.
시간-변동 마코프 과정에 대한 편향 특성화를 보조 균일하게 재현되는 체인을 통해 개발하여 비점근적 분석을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.