QUICK REVIEW

[论文解读] Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen, Oier Mees|arXiv (Cornell University)|Feb 5, 2024

Multimodal Machine Learning Applications被引用 6

一句话总结

PR2L 使用来自视觉语言模型的可提示表示来初始化强化学习策略，在Minecraft和Habitat任务上，优于非提示式嵌入和直接VLM动作生成。

ABSTRACT

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that encode semantic features of visual observations based on the VLM's internal knowledge and reasoning capabilities, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings. Finally, we show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.

研究动机与目标

推动使用来自视觉语言模型（VLMs）的世界知识，以提高RL的样本效率。
介绍PR2L：通过提示VLMs生成对RL有任务相关性的嵌入，而无需端到端微调VLM。
证明可提示的VLM嵌入可以通过RL将低层控制信号扎根。
在长时任务中将PR2L与非提示嵌入和直接VLM动作方法进行比较。
显示可提示表示在质量上可与领域特定嵌入相媲美。

提出的方法

对每个观测用与任务相关的提示查询一个生成式VLM，以获得可提示表示。
使用来自选定VLM层（最后几层）的嵌入，作为输入给基于Transformer的策略，带有CLS标记以总结变长输入。
丢弃解码文本，训练RL策略将嵌入映射到动作（Minecraft中为PPO； Habitat中为离线RL，使用CQL/QR-DQN）。
为提高效率采用贪婪解码，并依赖提示设计以引出与任务相关的语义特征。
设计编码目标实体存在与上下文辅助文本等特征的任务相关提示。
在小型带标签数据集上评估提示，作为下游任务有用性的代理，而非直接端到端优化。

Figure 1: An example instantiation of PR2L for the combat spider Minecraft task. We query a VLM with a task-relevant prompt about observations to produce promptable representations , which we train a policy on via RL. Rather than directly asking for actions or specifying the task, the prompt enables

实验结果

研究问题

RQ1来自VLMs的可提示表示是否相较于非提示的视觉嵌入提高学习效率和性能？
RQ2PR2L与直接从VLMs生成动作的方法相比如何？
RQ3在Minecraft和Habitat中，可提示表示是否能与领域特定嵌入相竞争？
RQ4提示设计和解码方案对RL性能有何影响？
RQ5在探索受限的离线RL环境中，PR2L是否也有效？

主要发现

PR2L在Minecraft任务上优于非提示的VLM图像编码基线。
PR2L在Minecraft任务上优于直接VLM动作生成基线。
PR2L在Habitat ObjectNav中的离线RL表现高于基线，平均成功率几乎翻倍。
可提示表示产生的VLM输出与Habitat中的专家价值状态相关，具有结构化特征。
在Minecraft中，PR2L嵌入呈现双峰分布，具有高奖励转变，促进学习。
使用通用VLM的PR2L在与领域特定表示竞争力方面具有竞争力。

Figure 3: Example tasks, observations, and task-relevant prompts from MineDojo and Habitat.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。