QUICK REVIEW

[論文レビュー] 3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu|arXiv (Cornell University)|Mar 14, 2024

Robotics and Automated Systems被引用数 14

ひとこと要約

3D-VLAは、3Dで推論し、ゴール画像/ポイントクラウドを生成し、操作のための行動を予測する、拡散ベースのゴール生成を3D-LMMバックボーンと相互作用トークンと組み合わせて、3D視覚-言語-行動の embodiment 基盤モデルを導入する。

ABSTRACT

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

研究の動機と目的

2D入力を超える知覚・推論・行動を統合する3Dの embodiment 世界モデルの必要性を動機づける。
3D環境で推論・マルチモーダルなゴールを想像し、行動を計画できる、スケーラブルな3D embodied foundationモデルを開発する。
このようなモデルを訓練するための大規模な3D embodiment instruction tuningデータセットを作成する。
専用トークンとLLMに合わせた拡散ベースのゴール生成を介して、3D環境との相互作用を可能にする。

提案手法

物体・位置・シーン用の専用相互作用トークンと7-DoFロボット動作を備えた3D-LLMバックボーン（3D-VLA）を構築する。
RGB-D-to-RGB-Dおよび点対点拡散モデルを事前学習させてゴール生成を注入し、トランスフォーマープロジェクターを介してLLMに整合させる。
ロボット工学と人間-物体データセットから3D注釈（深度、点群、3Dボックス、アクション）を抽出して2Mの3D言語-アクションデータセットを作成する。
新しいトークン埋め込み、出力層、およびプロジェクターのみを訓練しつつ、拡散モデルをLoRAで微調整する。
追加トークン (<image>, <pcd> など) とプロジェクターを介して、言語モデルと拡散モデルを橋渡しし、DMの特徴をLLMの埋め込みへ写像する。
指示に条件付けて、実装計画をガイドする3D goal出力（画像、深度、点群）を導くパイプラインを訓練する。

実験結果

リサーチクエスチョン

RQ13D指向の世界モデルは、 embodiment 推論、局在化、計画において2Dベースラインを上回ることができるか。
RQ2LLMに整合したゴール生成拡散モデルは、 embodimentタスクでのマルチモーダル出力（画像、深度、点群）をどの程度改善できるか。
RQ33D embodiment instruction tuningデータセットは、ロボティクス環境で堅牢な3D局在化、タスクキャプション、アクション予測を実現できるか。
RQ4相互作用トークンと3D表現は、アクション計画と未知タスクへの一般化にどう影響するか。
RQ5embodiedタスクにおける3D-VLAと2Dビジョン-言語モデルの比較性能はどうか。

主な発見

IoU	Acc@25	Acc@50
Kosmos-2 (w/ GT Depth)	10.92	12.73	3.85
CoVLM (w/ GT Depth)	19.81	25.39	16.61
3D-VLA	29.33	42.26	27.09

3D-VLAは、 embodimentデータセット全体で推論と局在化タスクにおいて2Dベースラインを上回る。
中間予測境界ボックスの助けを借り、3Dでないまたは整合しないベースラインと比較して、RGBおよび点群のゴール生成で優れている。
RLBenchとCALVINのベンチマークで強力なアクション計画を示し、複数のタスクで選択したベースラインより顕著な改善を示す。
3D注釈を使用したとき、局在化の結果は3D-VLAがKosmos-2およびCoVLMベースラインを上回る。
データセットと3Dトークンは動的シーンとの堅牢な相互作用と操作予測の改善を可能にする。
専用の3D embodiment instruction tuningデータセットは、3D推論、ゴールの想像、アクション実行を可能にするのに効果的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。