Skip to main content
QUICK REVIEW

[论文解读] VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu, Jie Lu|arXiv (Cornell University)|Feb 7, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

VGAS 将少样本 Vision-Language-Action 适应重新表述为生成-再选择问题,使用基于 Transformer 的 Q-Chunk-Former 评论者和显式几何正则化来按长 horizon 的成功与几何可行性对行动块进行排序,从而提升鲁棒性。

ABSTRACT

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework extbf{VGAS} ( extbf{V}alue- extbf{G}uided extbf{A}ction-chunk extbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, extbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the extrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose extit{Explicit Geometric Regularization} ( exttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that extbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

研究动机与目标

  • 在稀少演示下,推动 Vision-Language-Action (VLA) 策略的鲁棒少样本适应。
  • 将端到端的似然生成转变为生成-再选择的范式,使用基于价值的评论者。
  • 开发一个几何上可解释的评论者(Q-Chunk-Former),保留细粒度几何线索。
  • 提出显式几何正则化 (EGR),在稀缺监督与分布偏移下保持高排序分辨率。

提出的方法

  • 提出 VGAS:使用高召回基策略 πμ 与高精度 Q-critic Qθ 的生成-再选择。
  • 引入带 State-Action Fusion (SAF) 模块的 Q-Chunk-Former,将行动块在多模态融合前对地感知(本体感知)进行锚定。
  • 为时间一致性采用与 Best-of-N 选择对齐的分块的 Expected-Max 备份 (TμN) 。
  • 添加显式几何正则化(EGR),包括几何锚定和几何排序,以保持排序分辨率并校准价值地形。
  • 用分块的 TD 损失与 EGR 的组合(LTD + L(EGR))进行训练,并使用目标网络以保持稳定性。
Figure 1 : Illustration of near-miss actions distribution under 5-shot VLA fine-tuning.
Figure 1 : Illustration of near-miss actions distribution under 5-shot VLA fine-tuning.

实验结果

研究问题

  • RQ1RQ1:哪种评论者架构能够将高维 VLA 观测地面化为对时间扩展的行动块的精确价值估计?
  • RQ2RQ2:如何在示范数据丰富的情况下训练价值函数,以在稀缺监督和分布偏移下保持高排序分辨率?

主要发现

  • VGAS 在 LIBERO 基准上优于 SFT 与常规离线 RL 基线,且在分布偏移下表现尤为出色。
  • 消融结果显示显式几何正则化(EGR)提供最大提升,时间一致性(TD)有助于稳定性。
  • 基于 Transformer 的 Q-Chunk-Former 搭配 SAF 的表现优于基于 MLP 的评论者,突显对细粒度多模态几何锚定的需求。
  • EGR 防止价值地形崩塌,保留对 Best-of-N 选择至关重要的近似错过辨别能力。
  • 通过分块 TD 目标实现的时间一致性对于稳定长远价值估计是必要的。
Figure 2 : The overall framework of VGAS . Generation: A fine-tuned VLA policy proposes $N$ candidate action chunks from multimodal inputs. Selection: Q-Chunk-Former learns a scoring function $Q$ via the EGR + TD objective. Best-of- $N$ selection defines the induced policy $\pi_{\mu,Q}^{(N)}$ by max
Figure 2 : The overall framework of VGAS . Generation: A fine-tuned VLA policy proposes $N$ candidate action chunks from multimodal inputs. Selection: Q-Chunk-Former learns a scoring function $Q$ via the EGR + TD objective. Best-of- $N$ selection defines the induced policy $\pi_{\mu,Q}^{(N)}$ by max

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。