[论文解读] QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization
QVLA 引入以行动为中心、通道级的量化,优于基于大模型/多模态大模型量化的方法,并在整体 INT8 预算内实现剪枝(0 位)
The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.
研究动机与目标
- 量化需要针对具身化 VLA 模型定制,以避免因微小动作偏差导致的灾难性任务失败
- 展示模块内的逐通道敏感性存在异质性,关键接口驱动性能
- 提出 QVLA 以将量化与动作空间保真对齐,并将量化与剪枝统一
- 建立快速的敏感性代理和贪心降位算法用于逐通道比特分配
- 在 OpenVLA/OpenVLA-OFT 与 LIBERO 基准上评估 QVLA 相较于来自 LLM/MMLM 的量化方法
提出的方法
- 通过将单个通道量化到位宽 {0,2,4,8,16} 并测量动作空间误差来量化逐通道敏感性
- 定义单步 Action-MSE 和累积任务精度作为在动作空间中的评估导向
- 使用雅可比矩阵的一阶泰勒敏感性代理来高效排序通道重要性
- 使用贪心降位算法在目标平均预算下分配逐通道位宽,从 16 位开始并逐步降低敏感性最低的通道
- 采用逐通道权重量化,输出通道按位宽分配,启动统一位宽的激活以稳定性,并采用逐行权重存储方案以提高硬件效率
- 验证逐通道量化在动作保真度与稳定性方面优于层级量化或统一位宽方案,剪枝作为 0 位通道处理
实验结果
研究问题
- RQ1量化对 VLA 模型的动作输出有何影响,相较于标准的 LLM/MMLM 量化方法?
- RQ2是否能够有效估计逐通道、基于动作空间的敏感性并用于在实时机器人推理中分配比特?
- RQ3在 OpenVLA/OpenVLA-OFT 与 LIBERO 基准上,逐通道混合精度量化并带有剪枝是否优于统一或分层量化?
- RQ4在资源受限的机器人硬件上应用 QVLA 时,内存占用、速度与任务性能之间的权衡如何?
主要发现
- 逐通道量化揭示了显著的层内异质性;投影头和动作头对量化扰动最为敏感
- 单步的动作空间敏感性排序与通过累计指标验证的长时程性能相一致
- QVLA 的逐通道比特分配结合剪枝在内存更小、速度更快方面实现高精度,优于来自 LLM/MMLM 的方法(如 SmoothQuant、OmniQuant)
- 在 OpenVLA/OpenVLA-OFT 中,QVLA 在显著减少显存(约原始的 29.2%)和最高 1.49 倍加速的情况下达到可比或更好任务性能;许多设置下平均性能下降接近零
- 在 INT8 预算下,逐通道量化结合剪枝在多种配置下达到或超过 FP 性能,而层级量化会降低精度
- 经验结果表明,在总体 INT8 预算下,通道级门控并带剪枝的效果优于统一位宽量化,尤其在长时程任务中
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。