QUICK REVIEW

[논문 리뷰] Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

Ruixing Jin, Zicheng Zhu|arXiv (Cornell University)|2026. 03. 24.

Robot Manipulation and Learning인용 수 0

한 줄 요약

본 논문은 dexterous manipulation에서 Vision-Language-Action 모델의 zero-shot Sim2Real 전이를 실증적으로 연구하며, 통합 RoboTwin 기반 벤치마크와 실제 평가를 통해 도메인 랜덤화, 렌더링 정밀도, RL 미세조정의 영향을 분리합니다.

ABSTRACT

Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.

연구 동기 및 목표

dexterous manipulation에서 VLA 모델의 zero-shot Sim2Real 전이의 핵심 동인을 이해한다.
도메인 랜덤화 요인, 렌더링 정밀도, 강화 학습 미세조정이 실제 성능에 기여하는 정도를 정량화한다.
재현 가능성을 보장하기 위한 표준화된 평가 프로토콜과 오픈 액세스 벤치마킹 리소스를 제공한다.
시뮬레이션 설계 선택이 다양한 조작 작업에서의 실제 강건성으로 어떻게 매핑되는지 평가한다.

제안 방법

시뮬레이션에서 학습하고 실제 하드웨어에 배치된 VLA 모델에 대한 제약 없는(요인 분해된) zero-shot 전이에 대한 실증 연구를 수행한다.
controlled distribution shifts를 갖춘 통합 시뮬레이션 기반 평가를 위해 RoboTwin 2.0을 사용한다.
도메인 랜덤화를 다섯 가지 요인(Background, Table Distractor, Camera Pose, Lighting, Table Height)으로 분해하고 에피소드 단위 vs 프레임 단위의 세분성을 비교한다.
전이 효과를 측정하기 위해 렌더링 정밀도(photorealism)와 물리 현실감을 체계적으로 변화시키고 영향을 조사한다.
지도 학습 미세조정 위에 강화 학습 미세조정(GRPO)을 조사하되, RL 중 도메인 랜덤화의 유무를 포함한 변형을 포함한다.
실험은 다섯 가지 dexterous 듀얼 암 작업에서의 실제 실험(>10k)과 제어된 실제 평가 프로토콜을 사용하여 평가한다.

실험 결과

연구 질문

RQ1dexterous manipulation에서 VLA 모델의 zero-shot Sim2Real 일반화의 주요 결정 요인은 무엇인가?
RQ2도메인 랜덤화 요인들, 그들의 시간적 세분성, 렌더링 정밀도 및 RL 미세조정이 각각 Sim2Real 전이 성능에 어떻게 기여하는가?
RQ3더 높은 photorealism이나 물리 정밀도가 실제로 더 나은 전이에 일관되게 기여하는가, 그 정도는 어느 정도인가?
RQ4RL 미세조정이 Sim2Real 차이를 보완할 수 있는가, 그리고 RL과 도메인 랜덤화를 결합하면 추가 이득이 있는가?
RQ5공간적 변화가 외관 변화보다 VLA 정책의 실제 강건성 개선에 어떤 차이를 만들어내는가?

주요 결과

공간적 특징(카메라 포즈, 탁자 높이)이 외관 교란(배경, 조명)보다 Sim2Real 일반화에 더 큰 개선을 유도합니다.
에피소드 단위보다 프레임 단위 랜더리제이션이 일반적으로 더 큰 이득을 얻으며, 특히 공간적 요인과 배경 텍스처에 대해 더 큰 이득을 제공합니다.
더 높은 렌더링 정밀도(photorealism)는 실제 성공률을 향상시키며, 중간 수준 이후 수익은 감소하는 수익 체감을 보이고; 물리 정밀도는 더 작지만 양의 이익을 제공합니다.
강화 학습 미세조정은 강건성을 크게 향상시키며, 도메인 랜덤화와 결합될 때 더 큰 이득을 제공합니다(예: RL 중 DR로 실제에서의 성공률 약 42.8%).
RL 단독으로도 깨끗한 시뮬레이션에서 SFT보다 개선되며 DR에 근접하며, DR과 함께 사용하는 RL은 가장 강한 Sim2Real 이익을 달성합니다(예: 보고된 결과에서 Sim-OOD 70.8%, Real 42.8%).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.