QUICK REVIEW

[논문 리뷰] CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang|arXiv (Cornell University)|2024. 11. 29.

Robotics and Automated Systems인용 수 5

한 줄 요약

CogACT는 VLM 출력에 의해 안내되는 전문화된 액션 모듈을 갖춘 기초 Vision-Language-Action 아키텍처를 도입하고, 확산 액션 트랜스포머를 사용하여 액션 시퀀스를 모델링하며 강한 교차 로봇 일반화와 더 높은 작업 성공률을 달성합니다.

ABSTRACT

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

연구 동기 및 목표

비전-언어-액션(VLA) 모델로 인지와 액션을 시너지화하여 로봇 조작을 진전시킨다.
VLM 출력에 조건화된 전문화된 액션 모듈을 설계하여 단순한 액션 양자화를 넘어선다.
여러 로봇 구현체와 보지 못한 물체/배경에서의 확장과 적응성을 입증한다.

제안 방법

VLM 출력에 조건화된 전용 액션 모듈을 갖춘 구성요소화된 VLA 아키텍처를 도입한다.
확산 액션 트랜스포머를 액션 시퀀스 모델링에 대해 평가한다.
효과적인 액션 모듈 설계를 식별하고 확장 동작 동향을 평가하기 위한 소거 연구를 수행한다.
시뮬레이션 및 실제 환경에서 다섯 가지 로봇 구현체에 대해 테스트한다.
작업 성공률 향상을 측정하기 위해 OpenVLA (7B) 및 RT-2-X (55B) 기준과 비교한다.

실험 결과

연구 질문

RQ1비전-언어-모델(VLM) 출력에 조건화된 전문화된 액션 모듈이 VLM 출력을 직접 양자화하는 것보다 조작 성공률을 향상시킬 수 있는가?
RQ2확산 액션 트랜스포머가 로봇 조작용 VLA 모델에서 더 나은 액션 시퀀스 모델링과 확장을 제공하는가?
RQ3CogACT가 새로운 로봇, 미확인 물체 및 다양한 배경에 대해 시뮬레이션 및 실제 테스트에서 얼마나 잘 일반화하는가?

주요 결과

CogACT는 다섯 가지 로봇 구현체에 걸친 작업 성능에서 기존 VLAs를 크게 능가한다.
시뮬레이션에서 CogACT는 OpenVLA 기준선(동일 모델 크기, 7B)보다 평균 성공률에서 35% 이상 초과한다.
실제 로봇 실험에서 CogACT는 평균 성공률에서 OpenVLA보다 55% 이상 초과한다.
시뮬레이션에서 CogACT는 대형 RT-2-X 모델(55B)을 절대 성공률로 18% 포인트 앞선다.
이 접근법은 새로운 로봇에 대한 주목할만한 적응성과 미확인 물체/배경에 대한 일반화를 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.