QUICK REVIEW

[論文レビュー] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown|arXiv (Cornell University)|Jul 28, 2023

Multimodal Machine Learning Applications被引用数 265

ひとこと要約

RT-2 は大規模なビジョン-言語モデルを微調整してロボットの動作を出力させ、ウェブ規模のビジョン-言語事前学習を継承したエンドツーエンド制御を可能にし、一般化と出現的意味推論を向上させる。

ABSTRACT

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

研究の動機と目的

ウェブ規模のビジョン-言語事前学習を活用してロボット制御の一般化を改善する。
観察をアクションへマッピングする単一のエンドツーエンドモデルを実現し、言語に基づく意味論を活用する。
ウェブスケールの訓練から生じる出現的能力をロボットタスクで調査する。
ロボット軌道とウェブデータの共ファインチューニングが性能と一般化へ与える影響を評価する。）
method":[
represent robot actions as text tokens and train vision-language models to output action tokens alongside natural language outputs.
Fine-tune pre-trained vision-language models (PaLI-X and PaLM-E) on a combination of robotic trajectories and web-scale vision-language tasks (e.g., VQA, captioning).
Co-fine-tune with robotic data and web data to preserve web-learned concepts while adapting to robot control.
Discretize 6-DoF action space into 256 bins per dimension and map to tokens within the model’s vocabulary.
Constrain decoding to use only valid action tokens during robot-task prompting to ensure executable outputs.
Enable real-time inference by deploying large models on a cloud service with multi-TPU infrastructure to achieve 1–3 Hz for 55B models.

提案手法

ロボットの動作をテキストトークンとして表現し、ビジョン-言語モデルを自然言語出力とともにアクショントークンを出力できるように訓練する。
事前学習済みビジョン-言語モデル（PaLI-X および PaLM-E）を、ロボット軌道とウェブ規模のビジョン-言語タスク（例：VQA、キャプション生成など）を組み合わせてファインチューニングする。
ウェブデータとロボットデータを共ファインチューニングして、ウェブで学習した概念を保持しつつロボット制御へ適応する。
6-DoF アクション空間を各次元256ビンに離散化し、モデルの語彙内のトークンへマッピングする。
実行可能な出力を保証するために、ロボットタスクの prompting 中は有効なアクション・トークンのみをデコードするよう制約を設ける。
クラウドサービス上で大規模モデルを展開し multi-TPU インフラストラクチャを用いて実時間推論を実現し、55B モデルで 1–3 Hz を達成する。

実験結果

リサーチクエスチョン

RQ1RT-2 モデルは未知のオブジェクト、背景、および環境に対してベースラインと比べてどれだけ一般化できるか？
RQ2ウェブ規模のビジョン-言語事前学習からロボット制御へと移行する際の出現的な能力は何か？
RQ3モデルサイズと訓練戦略（共ファインチューニング vs ゼロからのファインチューニング）は一般化にどのように影響するか？
RQ4チェーン・オブ・ソート prompting は RT-2 の推論とロボット操作の成功にどのように寄与するか？

主な発見

RT-2 (PaLI-X および PaLM-E 変種) は、オブジェクト、シーン、指示に対する一般化を RT-1 と MOO と比較して大幅に改善し、様々なテストで約2倍から6倍程度の改善を示す。
RT-2 は、意味的に示された場所へ物体を置く、関係性に基づいて物体を選択するなどの出現的な意味推論を可能にする。
チェーン・オブ・ソート prompting は多段階の意味推論を可能にし、計画と実行の能力を向上させる。
より大きな RT-2 モデルは一般化がやや優れ、ウェブデータとの共ファインチューニングは、ロボットデータだけでファインチューニングするよりも強い一般化をもたらす。
Language-Table シミュレーションでは、RT-2-PaLI-3B がベースラインを上回り、ウェブ規模の事前学習が他のロボティクス風タスクへも移転することを示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。