QUICK REVIEW

[논문 리뷰] Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Benjamin Eysenbach, Ruslan Salakhutdinov|arXiv (Cornell University)|2019. 06. 12.

Reinforcement Learning in Robotics인용 수 70

한 줄 요약

SoRB는 재생 버퍼 상태들에 대해 거리 그래프를 구축하고 멀리 있는 목표를 향한 최단 경로를 계획함으로써 그래프 탐색을 통한 계획과 목표 조건부 강화학습을 결합한다.

ABSTRACT

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -- namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.

연구 동기 및 목표

과거 관찰을 기반으로 계획을 통해 자동으로 발견된 서브목표로 목표를 분해함으로써 고차원 관찰에서 긴 목표-거리의 희소 보상 문제에 대응한다.
각 서브목표를 해결하기 위해 목표 조건부 RL 정책을 활용하고 재생 버퍼를 비모수적 상태 그래프로 사용하여 계획한다.
분포형 RL과 앙상블을 사용하여 그래프 탐색을 안내하는 견고한 거리 추정치를 얻는다.
긴 목표 네비게이션 과제에서 표준 RL보다 개선된 성능을 입증하고 미확인 환경에 대한 일반화를 보여준다.

제안 방법

목표 조건부 정책과 그 Q/가치 함수를 목표 재라벨링과 분포형 RL을 포함한 오프폴리시 RL 알고리즘을 사용하여 학습한다.
최단 경로 거리 d_sp(s,s_g)를 정의하고 V(s,s_g)와 Q(s,a,s_g)를 음의 최단 경로 거리와 연결시킨다.
재생 버퍼 관찰들에 그래프를 구성하고 간선 가중치를 예측 거리로 두되 MaxDist로 상한을 설정한다.
재생 버퍼 그래프에서 시작점과 목표점 사이의 최단 경로를 찾기 위해 다익스트라 알고리즘을 사용한다.
실행 중 경로를 따라 일련의 웨이포인트를 계획하고 다음 웨이포인트에 정책을 조건부로 두거나 거리가 가까울 경우 목표 자체에 조건을 건다.
거리 추정치를 분포형 RL(거리에 대한 단계의 빈)과 불확실성을 위한 앙상블로 향상시킨다.

실험 결과

연구 질문

RQ1고차원 환경에서 재생 버퍼를 그래프 탐색으로 계획하여 먼 목표에 도달할 수 있는 서브목표의 시퀀스를 찾을 수 있는가?
RQ2분포형 RL과 앙상블로 학습된 거리 추정치가 SoRB를 위한 신뢰할 수 있는 계획 지침을 제공하는가?
RQ3SoRB가 긴 목표-희소 보상 과제에서 성공률을 높이고 보지 못한 환경으로 일반화하는가, 표준 목표 조건부 RL과 비교하여?
RQ4이미지 기반 네비게이션 과제에서 SoRB가 반모수적 토폴로지 기억(SPTM) 및 다른 기준선과 어떻게 비교되는가?

주요 결과

SoRB는 100단계를 넘는 긴 목표-희소 보상 과제를 해결할 수 있게 하고 표준 RL 방법보다 일반화가 더 잘 된다.
목표 조건부 가치 기반 거리로 안내되는 재생 버퍼 위의 그래프 탐색은 이미지 기반 도메인에서 효과적인 웨이포인트 시퀀스를 산출한다.
분포형 RL과 앙상블은 거리 추정과 계획의 강건성을 크게 향상시키며 특히 먼 목표의 경우에 그렇다.
시각적 내비게이션에서 SoRB는 SPTM, C51, VIN, HER 등을 포함한 기준선보다 현저하게 우수하며 목표 거리 증가에 따라 그 차이가 더 커진다.
SoRB는 SUNCG의 새로운 주택으로 일반화하여 거리가 커져도 목표 성공률을 유지하는 반면 순수 목표 조건부 RL은 어려움을 겪는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.