QUICK REVIEW

[논문 리뷰] Successor Features for Transfer in Reinforcement Learning

André Sales Barreto, Will Dabney|arXiv (Cornell University)|2016. 06. 16.

Reinforcement Learning in Robotics참고 문헌 17인용 수 177

한 줄 요약

이 논문은 환경 역학을 보상으로부터 분리하기 위한 successor features (SFs)와 공유된 역학을 가진 서로 다른 보상 간의 전이를 가능하게 하는 일반화된 정책 개선(GPI) 프레임워크를 제시하며, 이론적 보장과 실증적 검증을 제공한다.

ABSTRACT

Transfer in reinforcement learning refers to the notion that generalization should occur not only within a task but also across tasks. We propose a transfer framework for the scenario where the reward function changes between tasks but the environment's dynamics remain the same. Our approach rests on two key ideas: "successor features", a value function representation that decouples the dynamics of the environment from the rewards, and "generalized policy improvement", a generalization of dynamic programming's policy improvement operation that considers a set of policies rather than a single one. Put together, the two ideas lead to an approach that integrates seamlessly within the reinforcement learning framework and allows the free exchange of information across tasks. The proposed method also provides performance guarantees for the transferred policy even before any learning has taken place. We derive two theorems that set our approach in firm theoretical ground and present experiments that show that it successfully promotes transfer in practice, significantly outperforming alternative methods in a sequence of navigation tasks and in the control of a simulated robotic arm.

연구 동기 및 목표

보상 함수가 바뀌고도 역학이 고정될 때의 전이를 동기부여하고 형식화한다.
확장 가능한 전이를 위해 역학을 보상으로부터 분리하는 successor features를 도입한다.
작업 변화 하에서 다수의 정책을 결합하기 위한 generalized policy improvement을 개발한다.
추가 학습 전에 이전된 정책에 대한 이론적 보장을 제공한다.
실험에서 항해 태스크와 로봇 팔을 통한 실제 전이를 입증한다.

제안 방법

하나의 스텝 보상을 r(s,a,s') = phi(s,a,s')^T w 로 표현하고 successor features psi^pi(s,a) = E_pi[sum_{t} gamma^{t-t0} phi_{t+1} | S_t=s, A_t=a] 를 정의한다.
Q^pi(s,a) = psi^pi(s,a)^T w 로 표현하여 역학과 보상의 분리를 가능하게 한다.
벨만 정책 개선을 Generalized Policy Improvement(GPI)으로 확장하고 max_i tildeQ^pi_i 를 사용하며 성능 경계를 입증하다.
phi가 고정된 상태에서 작업은 w로 다르고 M^phi = {M(phi,w) | w in R^d} 를 통해 전이를 도입한다.
이미 학습된 정책들의 successor features를 계산하고 저장한 뒤, 새로운 작업에서 w_{n+1} 로 Q-값을 얻기 위해 psi^{pi_i}^T w_{n+1} 를 사용하고 GPI를 적용한다.
두 가지 정리: (1) 근사 보장을 갖는 GPI, (2) w 공간에서의 작업 유사성과 성능 간의 경계를 제시한다.

실험 결과

연구 질문

RQ1보상이 바뀌고 역학이 고정될 때 successor features가 효과적인 전이를 가능하게 할 수 있는가?
RQ2SF를 활용한 일반화된 정책 개선이 학습 전 새로운 작업에 대해 성능 보장을 제공할 수 있는가?
RQ3phi 가중 공간에서의 작업 유사성이 전이 성능에 어떻게 반영되며, 기술 라이브러리 구축에 대한 실용적 지침에 어떤 의미가 있는가?
RQ4내비게이션 및 로봇 제어 태스크에서 baselines에 비해 SF와 GPI가 제공하는 실증적 이점은 무엇인가?

주요 결과

SFs는 역학과 보상을 분리하는 가치-함수 표상을 제공하여 전이를 용이하게 한다.
GPI 와 SFs는 성능 보장을 제공하고 새로운 작업을 개선하기 위해 정책 집합을 활용한다.
실험에서 SFQL 및 SFDQN은 내비게이션 태스크 및 reacher 도메인에서 baselines보다 우수한 성능을 보이며 상당한 이득을 보인다.
학습된 SFs(SFQL-h)의 사용은 phi를 완벽하게 알지 못하더라도 빠르고 강건한 전이를 달성할 수 있다.
reacher 실험은 학습 태스크에서의 학습이 명시적으로 훈련되지 않은 테스트 태스크의 성능을 향상시킴을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.