[論文レビュー] CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning
tldr: CRAFT introduces a force-aware curriculum fine-tuning framework using a variational information bottleneck to prioritize force signals when adapting Vision–Language–Action models to contact-rich manipulation, improving generalization and performance across architectures.
Vision-Language-Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance, and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high-entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT, a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vision and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leader-follower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success, generalizes to unseen objects and novel task variations, and adapts effectively across diverse VLA architectures, enabling robust and generalizable contact-rich manipulation.
研究の動機と目的
- Motivate and address the gap in VLA models for contact-rich manipulation where force/tactile signals are undervalued.
- Propose a lightweight, model-agnostic framework that prioritizes force information during early training via a variational information bottleneck (VIB).
- Enable robust, generalizable force-aware manipulation by integrating torque-based proprioception and a force-aware data collection setup.
- Demonstrate cross-architecture applicability by adapting multiple VLA models and validating on real-world tasks.
提案手法
- Insert a variational information bottleneck after the vision–language encoder to compress high-entropy visual and language features.
- Apply a force-aware curriculum with an exponential decay schedule to gradually reintegrate visual and language information.
- Use joint torque as proprioception under impedance control to capture force-rich interaction data.
- Develop a homologous leader–follower teleoperation system to collect synchronized vision, language, and force data across tasks.
- Fine-tune pretrained VLA models (e.g., pi0-base and RDT) with force-aware demonstrations and the VIB module without modifying encoder architectures.
実験結果
リサーチクエスチョン
- RQ1Can force-aware curriculum fine-tuning improve VLA models on contact-rich manipulation tasks?
- RQ2Do CRAFT-enhanced VLA models generalize to unseen objects and task variations?
- RQ3Is the approach broadly applicable across different VLA architectures and tasks?
- RQ4What is the contribution of the VIB module and torque-based proprioception to performance gains?
主な発見
| Model | w/o CRAFT | w/ CRAFT | Δ (%) |
|---|---|---|---|
| RDT | 22.66 | 48.32 | 25.66 ↑ |
| π0-base | 25.32 | 60.68 | 35.36 ↑ |
- CRAFT improves task success across five contact-rich tasks for both pi0-base and RDT models.
- pi0-base + CRAFT increases average success from 25.32% to 60.68% (35.36 percentage points).
- RDT + CRAFT increases average success from 22.66% to 48.32% (25.66 percentage points).
- In Wipe Whiteboard, CRAFT boosted success from 33.3% to 66.7%.
- Rolling Plasticine saw a 50% gain with CRAFT (8.3% to 58.3%).
- Generalization to object- and task-level variants improved average success from 22.50% to 58.75% (36.25 percentage points).
- Ablations show VIB improves over base; adding torque as proprioception yields further gains.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。