QUICK REVIEW

[论文解读] VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu, Jie Lu|arXiv (Cornell University)|Feb 7, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

VGAS 将少样本 Vision-Language-Action 适应重新表述为生成-再选择问题，使用基于 Transformer 的 Q-Chunk-Former 评论者和显式几何正则化来按长 horizon 的成功与几何可行性对行动块进行排序，从而提升鲁棒性。

ABSTRACT

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework extbf{VGAS} ( extbf{V}alue- extbf{G}uided extbf{A}ction-chunk extbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, extbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the extrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose extit{Explicit Geometric Regularization} ( exttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that extbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

研究动机与目标

在稀少演示下，推动 Vision-Language-Action (VLA) 策略的鲁棒少样本适应。
将端到端的似然生成转变为生成-再选择的范式，使用基于价值的评论者。
开发一个几何上可解释的评论者（Q-Chunk-Former），保留细粒度几何线索。
提出显式几何正则化 (EGR)，在稀缺监督与分布偏移下保持高排序分辨率。

提出的方法

提出 VGAS：使用高召回基策略 πμ 与高精度 Q-critic Qθ 的生成-再选择。
引入带 State-Action Fusion (SAF) 模块的 Q-Chunk-Former，将行动块在多模态融合前对地感知（本体感知）进行锚定。
为时间一致性采用与 Best-of-N 选择对齐的分块的 Expected-Max 备份 (TμN) 。
添加显式几何正则化（EGR），包括几何锚定和几何排序，以保持排序分辨率并校准价值地形。
用分块的 TD 损失与 EGR 的组合（LTD + L(EGR)）进行训练，并使用目标网络以保持稳定性。

Figure 1 : Illustration of near-miss actions distribution under 5-shot VLA fine-tuning.

实验结果

研究问题

RQ1RQ1：哪种评论者架构能够将高维 VLA 观测地面化为对时间扩展的行动块的精确价值估计？
RQ2RQ2：如何在示范数据丰富的情况下训练价值函数，以在稀缺监督和分布偏移下保持高排序分辨率？

主要发现

VGAS 在 LIBERO 基准上优于 SFT 与常规离线 RL 基线，且在分布偏移下表现尤为出色。
消融结果显示显式几何正则化（EGR）提供最大提升，时间一致性（TD）有助于稳定性。
基于 Transformer 的 Q-Chunk-Former 搭配 SAF 的表现优于基于 MLP 的评论者，突显对细粒度多模态几何锚定的需求。
EGR 防止价值地形崩塌，保留对 Best-of-N 选择至关重要的近似错过辨别能力。
通过分块 TD 目标实现的时间一致性对于稳定长远价值估计是必要的。

Figure 2 : The overall framework of VGAS . Generation: A fine-tuned VLA policy proposes $N$ candidate action chunks from multimodal inputs. Selection: Q-Chunk-Former learns a scoring function $Q$ via the EGR + TD objective. Best-of- $N$ selection defines the induced policy $\pi_{\mu,Q}^{(N)}$ by max

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。