QUICK REVIEW

[논문 리뷰] VIMA: General Robot Manipulation with Multimodal Prompts

Yunfan Jiang, Agrim Gupta|arXiv (Cornell University)|2022. 10. 06.

Multimodal Machine Learning Applications인용 수 65

한 줄 요약

VIMA는 다양한 로봇 조작 작업을 통합하기 위해 다중모달 프롬프트를 도입하고, VIMA-Bench 벤치마크를 제시하며, 객체 중심 표현을 사용해 제로샷 일반화가 강한 트랜스포머 기반 에이전트를 학습시킨다.

ABSTRACT

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9 imes$ task success rate given the same training data. With $10 imes$ less training data, VIMA still performs $2.7 imes$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/

연구 동기 및 목표

텍스트와 이미지를 교차시키며 광범위한 로봇 조작 작업 스펙트럼을 형식화한다.
스케일러빌리티와 일반화를 평가하기 위한 대규모의 프로시저로 생성된 벤치마크인 VIMA-Bench를 만든다.
다중모달 프롬프트를 처리하고 자가회귀적으로 모터 액션을 출력하는 트랜스포머 기반의 구체적 에이전트 VIMA를 개발한다.
모델 규모와 학습 데이터 규모에 걸쳐 확장성 및 데이터 효율성을 보여준다.

제안 방법

텍스트와 이미지 토큰의 교차 시퀀스로 다중모달 프롬프트를 정의한다.
객체 중심 시각 토크나이저(Mask R-CNN)를 사용해 이미지를 객체 토큰 시퀀스로 변환한다.
디코더가 크로스-어텐션을 통해 프롬프트에 조건화되고 자가회귀적으로 모터 액션을 출력하는 인코더-디코더 트랜스포머를 사용한다.
오프라인 행동 클로닝을 통해 프롬프트와 이력에 주어진 전문가 행동의 가능도(likelihood)를 최대화하며 학습한다.
네 단계의 VIMA-Bench 프로토콜로 평가해 점진적으로 강한 제로샷 일반화를 평가한다.

실험 결과

연구 질문

RQ1하나의 모델이 다중 모달 프롬프트로 지정된 여러 조작 작업을 학습할 수 있는가?
RQ2모델 용량과 학습 데이터 크기가 다중모달 로봇 학습의 제로샷 일반화에 어떤 영향을 미치는가?
RQ3시각적 토큰화 및 프롬프트 조건화가 정책 성능에 어떤 영향을 주는가?
RQ4주의 산만한 요인과 손상된 프롬프트에 대해 이 접근 방식은 얼마나 강건한가?

주요 결과

VIMA는 모든 제로샷 일반화 수준과 모델 규모에서 baseline 설계보다 우수하다.
가장 어려운 설정에서 동일한 학습 데이터가 주어졌을 때 최대 2.9x의 높은 작업 성공률을 달성한다.
10배 적은 학습 데이터로도 VIMA는 경쟁적 변형들보다 여전히 현저히 우수하다(일부 경우 2.7x).
객체 중심 토큰은 원시 픽셀 또는 다운샘플 토큰으로 작동하는 방법보다 우수하다.
프롬프트에 대한 디코더의 교차 주의(cross-attention)로 작은 모델에서 특히 눈에 띄는 이점을 얻으며 일반화에 중요하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.