QUICK REVIEW

[論文レビュー] Vision-Language Models Unlock Task-Centric Latent Actions

Alexander Nikulin, Ilya Zisman|arXiv (Cornell University)|Jan 30, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

この論文は Vision-Language Models (VLMs) の prompt 可能な表現を用いてディストラクターをフィルタリングし潜在的な行動学習を改善し、指導なしで Distracting MetaWorld における下流の成功率を最大6倍改善した。

ABSTRACT

Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.

研究の動機と目的

観察からのオフライン模倣学習における行動関連ディストラクター下での潜在的行動学習の解決を動機づける。
潜在的行動モデルのターゲットとして prompt-able な VLM 表現を教師なしで用い、 controllable な変化とノイズを分離する。
プロンプト品質、頑健性、および言語条件付け効果を評価するために広範な VLM のベンチマークを行う。
プロンプト可能な表現が、真のアクション supervision なしで潜在的行動の質と下流の性能を大幅に改善できることを示す。

提案手法

タスクに特化したプロンプトと簡易プーリング戦略を用いて VLM から観察埋め込みを取得し、プロンプト可能な表現を定義する。
これらの表現を潜在的行動モデル（LAM）における Forward Dynamics Model（FDM）のターゲットとして使用し、アクション量子化を回避する。
MT10 全体で 29k+ の実験を通じて複数の VLM をベンチマークし、プロンプト品質と prompts/ハイパーパラメータへの頑健性を評価する。
潜在的行動の質を、潜在的行動から真のアクションを予測する線形プローブで評価し、ラベル付きファインチューニング後の下流成功を測定する。
ディストラクター映像を追加して従来の LAPO ベースラインと比較する、制御された Distracting MetaWorld 設定を実施する。

Figure 1 : Main result . Success rate on MetaWorld-10 benchmark for LAPO and proposed LAPO+VLM (Molmo), which uses promptable representations. We use three random seeds and report IQM and $95\%$ -CI based on stratified bootstrapping, following the Agarwal et al. ( 2021 ) . See Section 7 for full res

実験結果

リサーチクエスチョン

RQ1Vision-Language Models の prompt-able 表現は、ディストラクターのノイズから controllable な変化を潜在的行動学習のために分離できるか？
RQ2どの VLM と prompting 戦略が、ディストラクター下で最良の潜在的行動と下流方策の性能をもたらすか？
RQ3言語 conditioning を含む prompting は、LAM ターゲットに対して自己教師付きのベースライン（例：CLIP、DINOv2）より優れているか？
RQ4選択した潜在的行動次元は、VLM ガイドターゲットの有効性にどう影響するか？
RQ5MT10 の小さなサブセットのベンチマークから全データセットへの改善の転移はどの程度か？

主な発見

prompt-able 表現は LAPO より顕著な改善をもたらし、Molmo がハイパーパラメータ全体で最も頑健である。
言語条件付けとタスク焦点プロンプトを持つ VLM は潜在的行動の質を大幅に高め、ディストラクターの影響を小さくする。
ディストラクター下で LAPO+VLM の prompt-able 表現を用いると下流の成功率が六倍向上。
埋め込み型 VLM（例：CLIP ベース）は prompt-able な VLM を上回らず、言語条件付けが性能にとって重要である。
full MT10 データでは LAPO+Molmo および関連 VLM が非ディストラクターの性能との差を縮め、潜在的行動の次元削減がさらに成果を改善する。
prompt-able 表現は OTTER や UniVLA のようなベースラインをディストラクター設定で上回る可能性がある。
Molmo の利益の源泉はデータ品質であり、同じデータを用いた別のバックボーンでも結果は異なる。

Figure 2 : Visualization of the task-relevant promptable representations extraction from the VLMs and their subsequent use as targets during latent action learning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。