QUICK REVIEW

[논문 리뷰] VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu, Jie Lu|arXiv (Cornell University)|2026. 02. 07.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

VGAS는 few-shot Vision-Language-Action 적응을 generate-then-select 문제로 재구성하고, Transformer 기반 Q-Chunk-Former 비평가와 Explicit Geometric Regularization을 사용하여 장기 성공 가능성과 기하학적 타당성으로 액션 청크를 순위 매겨 강건성을 향상시킨다.

ABSTRACT

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework extbf{VGAS} ( extbf{V}alue- extbf{G}uided extbf{A}ction-chunk extbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, extbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the extrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose extit{Explicit Geometric Regularization} ( exttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that extbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

연구 동기 및 목표

희소한 시연 하에서 Vision-Language-Action (VLA) 정책의 강건한 few-shot 적응을 촉진한다.
엔드-투-엔드 가능도 기반 생성에서 값 기반 비평가를 사용하는 generate-then-select 패러다임으로 전환한다.
정밀한 기하학적 신호를 보존하는 기하학적으로 근거를 둔 비평가(Q-Chunk-Former)를 개발한다.
희소 감독 및 분포 이동하에서 높은 랭킹 해상도를 유지하기 위한 Explicit Geometric Regularization (EGR)을 제안한다.

제안 방법

VGAS를 제안한다: 높은 재현율 기반 정책 πμ와 높은 정밀도의 Q-critic Qθ를 갖춘 generate-then-select.
Q-Chunk-Former를 도입하고, 다중 모달 융합 전 운동감각(proprioception)에 액션 청크를 접지하기 위한 State-Action Fusion (SAF) 모듈을 도입한다.
Best-of-N 선택과 정렬된 시간적 일관성을 위한 청크된 Expected-Max 백업(TμN)을 채택한다.
랭킹 해상도를 보존하고 가치 지형을 보정하기 위해 Geometric Anchoring과 Geometric Ranking으로 구성된 Explicit Geometric Regularization (EGR)을 추가한다.
안정성을 위해 청크 TD 손실과 EGR(LTD + L(EGR))의 조합으로 학습하고 안정화를 위한 타깃 네트워크를 사용한다.

Figure 1 : Illustration of near-miss actions distribution under 5-shot VLA fine-tuning.

실험 결과

연구 질문

RQ1RQ1: 고차원 VLA 관찰을 시계열적으로 확장된 액션 청크에 대한 정밀한 가치 추정으로 접지할 수 있는 비평가 아키텍처는 무엇인가?
RQ2RQ2: 시연이 많은 데이터에서 가치 함수를 어떻게 학습시켜 희소한 감독과 분포 이동하에서도 높은 랭킹 해상도를 유지할 수 있는가?

주요 결과

VGAS는 LIBERO 벤치마크에서 SFT 및 표준 오프라인 RL 베이스라인을 능가하며, 특히 분포 이동하에서 우수하다.
변수 제거 실험은 Explicit Geometric Regularization (EGR)가 가장 큰 이득을 제공함을 보여주며, 시간적 일관성(TD)이 안정화에 기여한다.
SAF를 갖춘 Transformer 기반 Q-Chunk-Former가 MLP기반 비평가보다 성능이 우수하다고 나타났으며, 미세한 다중모달 기하학적 접지의 필요성을 강조한다.
EGR은 가치 지형 붕괴를 방지하고 Best-of-N 선택에 필수적인 근접 실패 구분 능력을 유지한다.
청크 TD 목표를 통한 시간적 일관성은 장기 horizon 가치 추정의 안정화에 필요하다.

Figure 2 : The overall framework of VGAS . Generation: A fine-tuned VLA policy proposes $N$ candidate action chunks from multimodal inputs. Selection: Q-Chunk-Former learns a scoring function $Q$ via the EGR + TD objective. Best-of- $N$ selection defines the induced policy $\pi_{\mu,Q}^{(N)}$ by max

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.