QUICK REVIEW

[논문 리뷰] Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

William Chen, Jagdeep Singh Bhatia|arXiv (Cornell University)|2026. 02. 13.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

본 논문은 다양한 추상화(작업, 하위 작업, 모션, 그리퍼 트레이스, 점)에 걸친 다양한 Steering 명령을 수용하는 저수준 VLA 가족인 Steerable Policies를 소개하고, 고수준의 구현적 추론 및 맥락 학습 VLM이 이를 제어하여 일반화와 장기 로봇 작업을 향상시킬 수 있음을 보여준다.

ABSTRACT

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io

연구 동기 및 목표

로봇 정책에 VLM 지식을 접지시키는 데 있어 운전 가능성(steerability)을 핵심 병목으로 동기 부여하고 정의한다.
다중 수준의 추상화를 수용하여 로봇 동작을 조정하는 Steerable Policies(VLAs)를 개발한다.
고수준 구현적 추론과 맥락 학습 VLM이 Steerable Policies를 제어하여 일반화 및 장기적 작업 성능을 향상시키는 방법을 제시한다.
다양한 정책을 훈련시키기 위한 합성 Steering 명령의 확장 가능한 생성을 대규모로 생성하는 방법을 시연한다.

제안 방법

작업 수준, 하위 작업 수준, 원자 모션, 그리퍼 트레이스, 점 및 이들의 조합을 포함한 광범위한 steering 명령을 따르도록 Steerable Policies를 학습시킨다.
기초 모델을 통해 근거 특징, 하위 작업, 프롬프트를 추출하는 파이프라인을 사용하여 로봇 궤적으로부터 대규모로 steering 명령을 자동 생성한다.
Steerable Policies를 두 가지 고수준 VLM 제어 방법과 통합한다: (i) 추론과 steering 명령을 생성하는 미세 조정된 구현적 추론자; (ii) 맥락 학습 VLM이 정책을 조정하기 위한 명령 추상화를 선택한다.
실제 Bridge WidowX 조작 작업에서 분포 내, 모션, 공간 및 의미론적 일반화 축 전반에 걸쳐 평가하고, 장기 작업도 탐구한다.

Figure 0 : The hierarchical policy inference loop, where a high-level model sends commands to the low-level Steerable Policy.

실험 결과

연구 질문

RQ1다양한 추상화를 아우르는 steering 명령이 Steerable Policies에서 구성적이고 일반화 가능한 행동을 유도할 수 있는가?
RQ2고수준 구현적 추론 모델이 훈련 데이터를 활용하여 Steerable Policies를 제어할 때 일반화할 수 있는 방법은 무엇인가?
RQ3일반적으로 사용 가능한 VLM이 맥락 학습을 활용하여 장기 로봇 작업을 향상시키는 steering 추상화를 선택할 수 있는가?

주요 결과

제한 없는 steering 명령을 가진 인간 오라클은 거의 모든 작업을 달성한다(Bridge 작업에서 약 100% 성공).
단일 steering 스타일은 보편적으로 최선이 아니며, 다양한 추상화 스펙트럼이 상호 보완적 강점과 향상된 성능을 제공한다.
Steerable Policies와 함께 미세 조정된 구현적 추론은 OpenVLA 및 ECoT 변형을 포함한 기준보다 우수하며 특히 모션 및 의미론적 일반화에서 더 우수하다.
맥락 추론을 갖춘 일반적인 VLM은 추상화를 효과적으로 선택할 수 있어 SayCan 유사 기준선 및 표준 OpenVLA보다 우수하다.
맥락 학습은 장면 이해 및 작업 진행에 따라 보정 steering 및 동적 추상화 선택을 가능하게 한다.

Figure 1 : Our automated pipeline for annotating robot data with synthetic steering commands at scale. 1: We use a suite of foundation models to extract subtasks and grounded features (bounding boxes, motions, and gripper traces) from each trajectory. 2: We query a VLM to generate diverse steering c

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.