QUICK REVIEW

[論文レビュー] QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization

Yuhao Xu, Yantai Yang|arXiv (Cornell University)|Feb 3, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

QVLAはアクション中心・チャネル-wise量子化を提供し、LLM/MMLMベースの量子化手法を上回り、全体INT8予算内で0ビットのプルーニングを可能にする。

ABSTRACT

The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.

研究の動機と目的

embodiment VLAモデルに特化した量子化の必要性を喪失的な小さなアクション偏差による catastrophic task failures から動機づける
モジュール内のチャネルごとの感度は異種であり、重要なインターフェースが性能を左右することを示す
量子化をアクション空間の忠実性と整合させ、プルーニングと量子化を統合するQVLAを提案する
迅速な感度代理指標と各チャネルビット割り当ての貪欲的デモーションアルゴリズムを開発する
OpenVLA/OpenVLA-OFTとLIBEROのベンチマークでQVLAをLLM/MMLM由来の量子化方法と評価する

提案手法

個々のチャネルを0,2,4,8,16ビットに量子化してアクション空間誤差を測定することでチャネルごとの感度を定量化する
アクション空間で評価を指針づけるために単一ステップのAction-MSEと累積タスク精度指標を定義する
ヤコビアンを用いた1次 Taylor-based感度代理指標を計算しチャネル重要度を効率的にランク付けする
ターゲット平均予算の下で最も感度の低いチャネルから順にビット幅を低減させる貪欲デモーションアルゴリズムを用いて16ビットから割り当てを進める
安定性のために出力チャネル毎のビット幅を持つチャネル-wise 重み量子化を採用し、活性化は均一ビットで、ハードウェア効率のために行ごとの重み格納方式を採用する
チャネル-wise量子化が層-wiseや均一ビット schemesよりアクション忠実度と安定性で優れることを検証し、プルーニングは0ビットチャネルとして扱う

実験結果

リサーチクエスチョン

RQ1VLAモデルにおける量子化は標準のLLM/MMLM量子化アプローチと比較してアクション出力へどのような影響を与えるのか？
RQ2チャネルごとのアクション空間主導の感度を効果的に推定し、ロボティクス推論をリアルタイムで頑健にビット割り当てすることが可能か？
RQ3チャネル-wise・混合精度量子化とプルーニングがOpenVLA/OpenVLA-OFTとLIBEROベンチマークで均一または層-wise schemesより優れているか？
RQ4QVLAをリソース制約下のロボティクスハードウェアへ適用したとき、メモリ使用量・速度・タスク性能のトレードオフはどうなるか？

主な発見

チャネル-wise量子化は層内の強い非均質性を明らかにする。プロジェクターとアクションヘッドが量子化の変動に最も敏感である
アクション空間感度ランキング（単一ステップ）は累積指標で検証された長期的な性能と一致する
QVLAのチャネル別ビット割り当てとプルーニングは、LLM/MMLM由来手法（例：SmoothQuant, OmniQuant）よりも高い精度を低いメモリで達成し、速度を向上させる
OpenVLA/OpenVLA-OFTでQVLAは元のVRAMを大幅に削減（約29.2%）し、最大1.49倍の速度向上を実現。多くの設定で平均性能低下はほぼゼロに近い
INT8予算下でのチャネル-wiseプルーニングはFP性能と同等またはそれを超えることが多く、層-wise量子化は精度を低下させる
empirical results show that channel-wise gating with pruning outperforms uniform-bit quantization under an overall INT8 budget, especially in long-horizon tasks
ハードウェア全体のINT8予算下では、チャネル-wiseゲーティング＋プルーニングが均一ビット量子化を上回る傾向があり、特に長期的タスクで顕著である

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。