QUICK REVIEW

[論文レビュー] CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang|arXiv (Cornell University)|Nov 29, 2024

Robotics and Automated Systems被引用数 5

ひとこと要約

CogACTは、VLMの出力によって導かれる専門のアクションモジュールを備えた基礎的なVision-Language-Actionアーキテクチャを導入し、拡散アクショントランスフォーマーを用いてアクション系列をモデル化し、ロボット間の一般化とタスク成功率の向上を実現します。

ABSTRACT

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

研究の動機と目的

Vision-Language-Action (VLA)モデルによって認知と行動を統合したロボット操作を前進させる。
VLMの出力で条件付けられた専門のアクションモジュールを設計し、単純なアクションの量子化を超える。
複数のロボット実装と未知の物体/背景へのスケーリングと適応を実証する。

提案手法

VLM出力に条件付けられた専用アクションモジュールを備えた構成要素化されたVLAアーキテクチャを導入する。
アクション系列モデル化のために拡散アクショントランスフォーマーを評価する。
効果的なアクションモジュール設計を特定し、スケーリング挙動を評価するためのアブレーション研究を実施する。
シミュレーションと実世界の設定の両方で五つのロボット実装を横断して検証する。
タスク成功率の向上を測るためにOpenVLA (7B) および RT-2-X (55B) のベースラインと比較する。

実験結果

リサーチクエスチョン

RQ1Vision-Language-Modelの出力によって条件付けられた専門のアクションモジュールは、VLMの出力を直接量子化するよりも操作成功率を改善できるか。
RQ2拡散アクショントランスフォーマーは、ロボット操作のVLAモデルにおけるアクション系列モデル化とスケーリングを優位にするか。
RQ3CogACTはシミュレーションと実世界の試験を通じて新しいロボット、未知の物体、 varied backgrounds へどれだけ一般化できるか。

主な発見

CogACTは五つのロボット実装全体で既存のVLAsを大幅に上回るタスク性能を示す。
シミュレーションでは、CogACTはOpenVLAベースライン（同程度のモデルサイズ、7B）を平均成功率で40%以上上回る。
実ロボットの実験では、CogACTはOpenVLAより平均成功率で55%以上上回る。
CogACTは大規模なRT-2-Xモデル（55B）をシミュレーションで絶対成功率で18%上回る。
新しいロボットへの適応性と未知の物体/背景への一般化性能が顕著に示される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。