QUICK REVIEW

[论文解读] Vision-Language Models Unlock Task-Centric Latent Actions

Alexander Nikulin, Ilya Zisman|arXiv (Cornell University)|Jan 30, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文使用来自视觉-语言模型的可提示表示来过滤干扰项并改进潜在行动学习，在 Distracting MetaWorld 的离线模仿学习中实现高达六倍的下游成功提升（无监督）。

ABSTRACT

Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.

研究动机与目标

在离线观测学习中解决与动作相关干扰下的潜在行动学习的动机与问题
提出将可提示的VLM表示的无监督用作潜在行动模型的目标，以将可控变化与噪声解耦
基准测试广泛的VLM，以评估提示质量、鲁棒性和语言条件效应
Demonstrate that promptable representations can significantly improve latent action quality and downstream performance without true-action supervision.

提出的方法

定义可提示表示：使用任务特定的提示和简单的池化策略从VLM获取观测嵌入
将这些表示作为潜在行动模型中的前向动力学模型（FDM）的目标，避免动作量化
在MT10上对多种VLM进行29k+实验基准，以评估提示质量和对提示/超参数的鲁棒性
通过线性探针预测来自潜在行动的真实动作来评估潜在行动质量，并在带标签微调后衡量下游成功
通过添加干扰视频并与标准 LAPO 基线进行对比，进行受控 Distracting MetaWorld 设置

Figure 1 : Main result . Success rate on MetaWorld-10 benchmark for LAPO and proposed LAPO+VLM (Molmo), which uses promptable representations. We use three random seeds and report IQM and $95\%$ -CI based on stratified bootstrapping, following the Agarwal et al. ( 2021 ) . See Section 7 for full res

实验结果

研究问题

RQ1Vision-Language Models 的可提示表示能否将可控变化与干扰噪声解耦以促进潜在行动学习？
RQ2在干扰项下，哪些VLM及提示策略能产生最好的潜在行动及下游策略性能？
RQ3语言条件提示是否优于自监督基线（如 CLIP、DINOv2）作为LAM目标？
RQ4所选择的潜在行动维度对VLM引导目标的有效性有何影响？
RQ5在 MT10 的小子集基准到完整数据集的改进传播效果如何？

主要发现

可提示表示对 LAPO 有显著提升，其中 Molmo 在超参数鲁棒性方面表现最佳
具备语言条件和面向任务的提示的VLM显著提升潜在行动质量，降低干扰项的影响
在 Distracting MetaWorld 的下游成功率相较于 LAPO+VLM 的提示表示实现了六倍提升
嵌入式VLM（如基于 CLIP）并未优于可提示VLM；语言条件对性能至关重要
在完整 MT10 数据上，LAPO+Molmo 及相关VLM 将与非干扰性能的差距缩小，潜在行动维度的降维进一步提升结果
可提示表示在干扰情境下可超越如 OTTER 与 UniVLA 之类的基线
Molmo 收益的来源很可能是数据质量而非架构变更，因为在相同数据下不同骨干网络得到不同结果

Figure 2 : Visualization of the task-relevant promptable representations extraction from the VLMs and their subsequent use as targets during latent action learning.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。