QUICK REVIEW

[论文解读] OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch|arXiv (Cornell University)|Jun 13, 2024

Semantic Web and Ontologies被引用 37

一句话总结

OpenVLA 是一个 7B 参数的开源视觉-语言-行动模型，在970k 真实世界机器人示范上训练，具备强大的通用性操作性能，并且通过 LoRA 和量化实现高效微调。它在多个任务上超越闭源 VLA RT-2-X，同时体积更小且公开获取。

ABSTRACT

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

研究动机与目标

Motivate the need for open, accessible VLA models for robust robotics generalization.
Develop a 7B-parameter open VLA that leverages diverse real-world demonstrations to generalize across robots and tasks.
Demonstrate fine-tuning efficacy to new robots and tasks with data-efficient methods.
Show compute-efficient training and inference via LoRA and quantization.
Release code, data, and checkpoints to enable open research and replication.

提出的方法

Adopt a Prismatic-7B VLM backbone with a fused SigLIP and DinoV2 visual encoder and a Llama 2 7B language model.
Represent continuous robot actions as discrete tokens aligned with the LLM tokenizer by discretizing each action dimension into 256 bins (mapped to 256 tokens in the Llama vocabulary).
Train via next-token prediction on sequences that pair image observations, language instructions, and action tokens.
Curate a diverse OpenX Embodiment dataset of 970k robot demonstrations across multiple embodiments and tasks, with data mixture weights inspired by Octo.
Fine-tune the vision encoder during training (not frozen) to capture fine-grained spatial details necessary for control.
Explore efficient fine-tuning and inference techniques including LoRA, sandwich fine-tuning, and model quantization to enable consumer-GPU deployment.

Figure 1: OpenVLA model architecture. Given an image observation and a language instruction, the model predicts 7-dimensional robot control actions. The architecture consists of three key components: (1) a vision encoder that concatenates Dino V2 [ 25 ] and SigLIP [ 77 ] features, (2) a projector th

实验结果

研究问题

RQ1Can OpenVLA provide strong out-of-the-box performance across multiple robot embodiments and tasks?
RQ2How does OpenVLA compare to prior generalist policies and to a larger closed VLA (RT-2-X) on standard benchmarks?
RQ3How effectively can OpenVLA be fine-tuned to new robot setups with limited data?
RQ4Are parameter-efficient fine-tuning methods (e.g., LoRA) and quantization viable for training and inference on consumer hardware without degrading performance?
RQ5What are the trade-offs between model size, data diversity, and compute for OpenVLA-style VLAs?

主要发现

OpenVLA (7B) outperforms closed RT-2-X (55B) by 16.5 percentage points in absolute success rate across 29 tasks, despite having fewer parameters.
Fine-tuning OpenVLA on new tasks yields strong generalization, with data-efficient adaptations outperforming diffusion-policy baselines on multi-object, language-grounded tasks.
LoRA fine-tuning matches full fine-tuning performance while using only 1.4% of parameters, enabling training on consumer GPUs within 10–15 hours per task.
Quantization (including int4) enables memory-efficient inference with minimal or no loss in downstream performance.
OpenVLA demonstrates strong out-of-the-box generalist manipulation across multiple embodiments (WidowX, Google robot) and supports scalable training/inference workflows.

Figure 2: BridgeData V2 WidowX robot evaluation tasks and results. We evaluate OpenVLA and prior state-of-the-art generalist robot policies on a comprehensive suite of tasks covering several axes of generalization, as well as tasks that specifically assess language conditioning ability. OpenVLA achi

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。