QUICK REVIEW

[논문 리뷰] COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning

Avi Singh, Albert S. Yu|arXiv (Cornell University)|2020. 10. 27.

Robot Manipulation and Learning참고 문헌 42인용 수 38

한 줄 요약

COG는 오프라인 강화학습을 활용해 작업별 데이터와 대량의 레이블이 없는 사전 데이터셋을 융합하여 정책이 학습한 행동을 조합해 새로운 다단계 작업을 새로운 초기 조건에서 해결하도록 한다.

ABSTRACT

Reinforcement learning has been applied to a wide variety of robotics problems, but most of such applications involve collecting data from scratch for each new task. Since the amount of robot data we can collect for any single task is limited by time and cost considerations, the learned behavior is typically narrow: the policy can only execute the task in a handful of scenarios that it was trained on. What if there was a way to incorporate a large amount of prior data, either from previously solved tasks or from unsupervised or undirected environment interaction, to extend and generalize learned behaviors? While most prior work on extending robotic skills using pre-collected data focuses on building explicit hierarchies or skill decompositions, we show in this paper that we can reuse prior data to extend new skills simply through dynamic programming. We show that even when the prior data does not actually succeed at solving the new task, it can still be utilized for learning a better policy, by providing the agent with a broader understanding of the mechanics of its environment. We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task, with our hardest experimental setting involving composing four robotic skills in a row: picking, placing, drawer opening, and grasping, where a +1/0 sparse reward is provided only on task completion. We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands, and present results in both simulated and real world domains. Additional materials and source code can be found on our project website: https://sites.google.com/view/cog-rl

연구 동기 및 목표

로봇공학에서 사전 데이터의 비작업 특성과 무관한 데이터가 정책 일반화를 확장하는 동기를 제시한다.
계층 구조를 명시하지 않고, 오프라인 강화학습을 통해 행동을 연결(스티치)하는 간단하고 데이터 기반의 방법을 제안한다.
사전 데이터가 보지 못한 초기 조건에서의 신규 다단계 작업 학습에 도움이 될 수 있음을 보여준다.
오프라인 데이터와 희박한 보상을 사용해 시각적 관찰로부터 저수준 제어까지의 엔드-투-엔드 학습을 보여준다.

제안 방법

오프라인 RL에서 사전 데이터와 작업별 데이터를 모두 포함하도록 Conservative Q-learning (CQL)을 확장한다.
제로 보상으로 라벨링된 사전 데이터로 재생 버퍼를 초기화하고, 사전 데이터와 작업 데이터를 혼합하여 학습한다.
작업 보상 경로에서 가치 정보를 사전 데이터가 커버하는 영역으로 전파하기 위해 Q-learning 다이내믹을 사용한다.
선택적으로 오프라인 학습 후 제한된 온라인 상호작용으로 오프라인 정책을 미세 조정한다.
48x48 또는 64x64 이미지와 로봇 상태를 연속적인 6-DoF 동작과 이산적 그리퍼 제어로 매핑하는 엔드-투-엔드 네트워크(ConvNets)를 학습한다.

실험 결과

연구 질문

RQ1모델 프리(오프라인) 강화학습이 작업 무관한 사전 데이터셋을 활용해 새로운 기술을 학습할 수 있는가?
RQ2정책이 이전 데이터에서 본 행동을 이어붙여 새로운 초기 조건에서 새로운 작업을 해결할 수 있는가?
RQ3사전 데이터를 포함한 오프라인 RL이 이전 데이터를 활용한 정책 학습을 행동 복제(BC) 베이스라인과 어떻게 비교되는가?
RQ4오프라인 학습 후 온라인 미세 조정이 필요한지 또는 유익한지?
RQ5이 접근법이 시뮬레이션을 넘어 실제 로봇 환경에 얼마나 일반화될 수 있는가?

주요 결과

데이터에서 완전한 시퀀스를 본 적이 없더라도 서랍 열기, 집기, 장애물 제거를 조합해 다단계 작업을 해결하는 데 COG가 가능하다.
COG는 시뮬레이션에서 새로운 초기 조건에 대해 BC 기준선, SAC 및 차등 실험을 상회한다.
온라인 미세 조정은 추가 데이터가 상대적으로 적은 경우에도 서랍 작업의 성공률을 90% 이상으로 향상시킨다.
실제 로봇 실험에서 서랍이 처음 닫힌 상태일 때 7/8의 성공을 달성했고 BC-오라클 baseline보다 우수했다.
BC-init은 보지 못한 초기 조건을 해결하지 못해, 사전 데이터 통합의 가치를 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.