QUICK REVIEW

[논문 리뷰] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

Xunlan Zhou, Xuanlin Chen|arXiv (Cornell University)|2026. 01. 28.

Robot Manipulation and Learning인용 수 0

한 줄 요약

MARVL은 비전-언어 모델을 미세조정하고, 희소 보상 환경에서 로봇 조작에 대해 신뢰할 수 있고 진행 상황 인식 보상을 제공하기 위해 작업 방향 투영과 신뢰도 임계값 기반 형성을 포함한 다단계 분해를 도입한다.

ABSTRACT

Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.

연구 동기 및 목표

로봇 조작에서 기존 VLM 기반 보상의 한계 파악(공간적 근거, 진행 상황 인식, 의미적 정렬).
타깃팅된 미세조정과 구조화를 통해 VLM 보상을 개선하기 위한 플러그앤플레이 프레임워크를 제안한다.
Meta-World의 다양한 조작 과제에서 샘플 효율성과 견고성을 개선하고 Panda-Gym으로의 도메인 간 이전을 입증한다.

제안 방법

Scene-View Decomposition를 통해 장면 의미를 시점 노이즈로부터 구분하기 위해 VLM을 미세조정한다.
작업 방향 투영과 다단계 분해를 적용하여 진행 신호를 하위 작업 목표와 일치시킨다.
의미론적 신뢰도에 기반해 VLM 보상을 게이트하고 노이즈를 줄이기 위해 Confidence-Thresholded Shaping을 도입한다.
projected start/goal 임베딩과 현재 관찰 간의 코사인 유사도를 사용하여 임베딩으로부터 보상을 계산한다.
중간 목표에 대한 유사도 임계값을 사용하여 단계 전환을 자동으로 관리한다.
MARVL이 서로 다른 RL 백본(SAC 및 TD3)과 호환된다는 것을 시연한다.

Figure 1: Radar plot of performance across eight Meta-World manipulation tasks. MARVL achieves consistently strong and balanced performance across all skill categories, surpassing the Oracle reward on several tasks and outperforming prior VLM-based reward methods.

실험 결과

연구 질문

RQ1MARVL은 Meta-World 벤치마크에서 기존 VLM-보상 기준과 비교하여 어떻게 성능을 발휘하는가?
RQ2MARVL의 개별 구성요소들(Scene-View Decomposition, TDP, CTS)이 얼마나 효과적이며, 언제 기여가 가장 큰가?
RQ3MARVL은 서로 다른 RL 백본과 카메라 구성에서 일반화되는가?
RQ4MARVL은 대상 도메인 적응 없이 Panda-Gym으로의 전이 가능한가?
RQ5다단계, 방향 투영 보상이 학습을 위한 안정적이고 단조로운 진행 신호를 제공하는가?

주요 결과

MARVL은 여덟 개의 Meta-World 과제에서 이전의 VLM 기반 보상을 지속적으로 능가한다.
MARVL은 Button Press 및 Window Close 같은 여러 과제에서 Oracle Dense 보상과 일치하거나 이를 상회한다.
Scene-View Decomposition은 공간적 근거를 개선하고 임베딩을 안정화한다.
Task Direction Projection 및 Multi-Stage Decomposition은 샘플 효율성과 수렴 속도를 향상시킨다.
Confidence-Thresholded Shaping은 노이즈를 줄이고 보상 해킹을 방지하며 안정성을 높인다.
MARVL은 카메라 뷰와 RL 백본(SAC에서 TD3으로) 전반에 걸쳐 일반화되며 도메인 적응 없이 Panda-Gym으로의 전이도 가능하다.

Figure 2: Reward Misalignment in VLM-Based Methods. Left: VLM reward signals along an oracle button-press-topdown trajectory. The green dashed curve denotes the environment-provided dense reward in Meta-World, whose scale differs from VLM rewards and is shown only to indicate the overall trend of ta

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.