QUICK REVIEW

[논문 리뷰] Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto|arXiv (Cornell University)|2023. 10. 16.

Domain Adaptation and Few-Shot Learning인용 수 10

한 줄 요약

SuSIE는 미래의 하위 목표를 생성하기 위해 사전 학습된 이미지 편집 확산 모델을 사용하고 이를 달성하기 위한 저수준 목표-조건화 정책을 통해 강력한 일반화 능력을 갖춘 제로샷 언어-조건 로봇 조작을 가능하게 한다.

ABSTRACT

If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image-editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller can accomplish. Specifically, we finetune InstructPix2Pix on video data, consisting of both human videos and robot rollouts, such that it outputs hypothetical future "subgoal" observations given the robot's current observation and a language command. We also use the robot data to train a low-level goal-conditioned policy to act as the aforementioned low-level controller. We find that the high-level subgoal predictions can utilize Internet-scale pretraining and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization and precision than conventional language-conditioned policies. We achieve state-of-the-art results on the CALVIN benchmark, and also demonstrate robust generalization on real-world manipulation tasks, beating strong baselines that have access to privileged information or that utilize orders of magnitude more compute and training data. The project website can be found at http://rail-berkeley.github.io/susie .

연구 동기 및 목표

훈련 중에 보지 못한 새로운 물체와 시나리오에서 일반 목적 로봇이 작동하도록 동기를 부여한다.
언어 지시로부터 고수준의 하위 목표 계획을 제공하기 위해 사전 학습된 이미지 편집 확산 모델을 활용한다.
로봇 데이터를 사용하여 하위 목표에 도달하는 저수준 목표-조건화 정책을 학습하고 강력한 제로샷 전이를 가능하게 한다.
실세계 조작 작업과 CALVIN 벤치마크에서 향상된 일반화 및 정밀도를 입증한다.

제안 방법

현재 관측 및 언어 명령을 바탕으로 가정된 미래 하위 목표 관찰을 출력하기 위해 언어 라벨이 있는 비디오 데이터에서 Instruct Pix2Pix를 미세조정한다.
생성된 하위 목표를 k_max 단계 이내에 달성하기 위해 행동 복제(behavioral cloning)를 통해 저수준 목표-조건화 정책을 학습한다.
테스트 시 반복적으로 하위 목표를 생성하고 저수준 정책으로 짧은 롤아웃을 실행한다(하위 목표당 k_test 단계).
하위 목표 생성 중 확산 모델을 언어 입력과 이미지 입력 모두에 조건 지시하기 위해 분류기 없는 가이던스를 사용한다.
시간 평균화를 이용해 동작 청크를 예측하는 확산 기반 정책을 적용하여 강인성을 확보한다.
고수준 하위 목표 합성을 저수준 제어와 분리하여 작업 특화 데이터 없이 제로샷 계획에 의존한다.

실험 결과

연구 질문

RQ1보지 못한 물체와 언어 명령이 포함된 새로운 환경에서 SuSIE가 제로샷 설정으로 작업을 해결할 수 있는가?
RQ2하위 목표 유도 계획이 하위 목표가 없는 언어 조건 정책에 비해 정밀도와 조작 능력을 향상시키는가?
RQ3인터넷 규모의 사전학습과 비디오 cotraining이 제로샷 일반화에 얼마나 필수적인가?
RQ4강력한 기준선과 비교했을 때 실제 세계 조작 작업에서 SuSIE의 성능은 어떤가?

주요 결과

SuSIE는 CALVIN에서 제로샷 성능을 최신 수준으로 달성한다(훈련 A–C에서 테스트 D로).
실세계 장면 전반에서 RT-2-X, UniPi, LCBC를 포함한 기준선보다 우수하며, 특히 새로운 산만 물체와 물체가 있는 장면에서 우수하다.
하위 목표 가이드가 저수준 조작 정밀도를 향상시켜 피망 잡기와 같은 도전적 작업의 성공 가능성을 높인다.
인터넷 사전학습과 비디오 데이터에 대한 코트레이닝은 하위 목표 품질과 제로샷 일반화를 크게 향상시킨다.
Something-Something 데이터와 함께 하위 목표 모델을 코트레이닝하면 보이지 않는 장면(B, C)에서 성능이 향상된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.