QUICK REVIEW

[论文解读] RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Yufei Wang, Zhanyi Sun|arXiv (Cornell University)|Feb 6, 2024

Robotics and Automated Systems被引用 6

一句话总结

RL-VLM-F 自动通过查询视觉-语言基础模型将观测与文本任务描述进行比较，从而为 RL 自动生成奖励函数，使在多样的操控任务中无需人工设计奖励即可实现自主学习并取得强劲表现。

ABSTRACT

Reward engineering has long been a challenge in Reinforcement Learning (RL) research, as it often requires extensive human effort and iterative processes of trial-and-error to design effective reward functions. In this paper, we propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks, using only a text description of the task goal and the agent's visual observations, by leveraging feedbacks from vision language foundation models (VLMs). The key to our approach is to query these models to give preferences over pairs of the agent's image observations based on the text description of the task goal, and then learn a reward function from the preference labels, rather than directly prompting these models to output a raw reward score, which can be noisy and inconsistent. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains - including classic control, as well as manipulation of rigid, articulated, and deformable objects - without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions. Videos can be found on our project website: https://rlvlmf2024.github.io/

研究动机与目标

通过利用文本任务描述和视觉观测，消除手动奖励工程。
通过视觉-语言基础模型（VLM）的偏好自动学习奖励函数。
证明在经典控制、刚性/关节化以及可变形物体操作任务中的适用性。
分析基于 VLM 的偏好对奖励学习和策略性能的影响。

提出的方法

使用两阶段的 VLM 提示过程根据任务描述获得图像对偏好。
通过 Bradley-Terry 基似然（Eq. 1）从 VLM 提供的偏好中学习奖励函数 rψ。
通过最小化标准偏好损失（Eq. 2）来优化奖励函数，并使用离策略 RL 更新策略。
以 SAC 作为底层 RL 算法，当奖励函数更新时对回放缓冲区进行重新标签。
迭代收集滚出，采样图像对，向 VLM 询问偏好，并同时更新策略和奖励模型（Algorithm 1）。

实验结果

研究问题

RQ1视觉-语言基础模型是否能够提供可靠的偏好标签，从原始图像观测中学习任务奖励？
RQ2VLM 基于奖励学习与原始 VLM 分数或其他基线在多种机器人任务中的表现有何区别？
RQ3通过任务描述自动学习的奖励在经典控制、刚性/关节化和可变形物体操作中的泛化程度如何？
RQ4两阶段提示策略对 VLM 相较单阶段方法的影响有多大？

主要发现

RL-VLM-F 在七个任务上超越依赖 VLM 分数、CLIP/BLIP-2 相似性以及 RoboCLIP 风格奖励的基线。
RL-VLM-F 在七个任务中的六项达到或超过 GT（ground-truth）偏好性能，展示了强大的自动奖励学习能力。
两阶段 VLM 提示策略在大多数任务中表现优于单阶段提示。
VLM 生成的偏好标签通常比错误标签更正确，且当图像对之间的视觉进展差异增大时，准确率提升。
学习到的奖励与任务进度对策略有效性有一致性，即使奖励信号存在噪声和局部最小值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。