QUICK REVIEW

[論文レビュー] Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen, Oier Mees|arXiv (Cornell University)|Feb 5, 2024

Multimodal Machine Learning Applications被引用数 6

ひとこと要約

PR2L は vision-language モデルのプロンプト対応表現を使用して RL ポリシーを初期化し、Minecraft と Habitat のタスクで非プロンプト可能な埋め込みと直接 VLM アクション生成を上回る。

ABSTRACT

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that encode semantic features of visual observations based on the VLM's internal knowledge and reasoning capabilities, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings. Finally, we show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.

研究の動機と目的

VLM（視覚と言語モデル）からの世界知識を用いて RL のサンプル効率を改善する動機づけ。
PR2L の導入：エンドツーエンドの VLM 微調整を行わずに、VLM をプロンプトして RL にタスク関連の埋め込みを生成させる。
プロンプト可能な VLM 埋め込みが RL を介して低レベルの制御信号をグラウンドできることを示す。
長期的なタスクを横断して、PR2L を非プロンプト可能な埋め込みや直接 VLM アクション方法と比較する。
プロンプト可能な表現が品質においてドメイン特化埋め込みと対等であることを示す。

提案手法

各観測に対してタスク関連のプロンプトで生成モデル VLM を問合せ、プロンプト可能な表現を取得する。
選択された VLM 層（最終の数層）の埋め込みを、可変長入力を要約する CLS トークンを持つ Transformer ベースのポリシーの入力として使用する。
デコード済みテキストを破棄し、埋め込みをアクションへマップする RL ポリシーを学習する（Minecraft では PPO、Habitat では CQL/QR-DQN によるオフライン RL）。
効率のために貪欲デコードを用い、タスク関連の意味特徴を引き出すようプロンプト設計に依拠する。
ターゲットエンティティの有無や文脈補助テキストなどの特徴をエンコードするタスク関連のプロンプトを設計する。
ダウンストリームタスクの有用性の代理として、エンドツーエンド最適化よりも小規模なラベル付きデータセットでプロンプトを評価する。

Figure 1: An example instantiation of PR2L for the combat spider Minecraft task. We query a VLM with a task-relevant prompt about observations to produce promptable representations , which we train a policy on via RL. Rather than directly asking for actions or specifying the task, the prompt enables

実験結果

リサーチクエスチョン

RQ1VLM のプロンプト可能な表現は、非プロンプト可能な視覚埋め込みと比較して学習効率と性能を向上させるか？
RQ2PR2L は VLM から直接アクションを生成する方法とどう比較されるか？
RQ3Minecraft と Habitat でプロンプト可能な表現はドメイン特化埋め込みと競合できるか？
RQ4プロンプト設計とデコード方式が RL の性能に与える影響は？
RQ5探索が制限されるオフライン RL 設定で PR2L は有効か？

主な発見

PR2L は Minecraft タスクで非プロンプト可能 VLM 画像エンコーダベースラインを上回る。
PR2L は Minecraft タスクで直接 VLM アクション生成ベースラインを上回る。
Habitat ObjectNav でのオフライン RL パフォーマンスがベースラインを上回り、平均成功率をほぼ倍増させる。
プロンプト可能な表現は Habitat において専門家の価値状態と相関する構造化された VLM 出力を生む。
Minecraft では PR2L 埋め込みが高報酬遷移を含む二峰性分布を示し、学習を助ける。
一般目的の VLM を用いた PR2L はドメイン特有の表現と競合する。

Figure 3: Example tasks, observations, and task-relevant prompts from MineDojo and Habitat.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。