QUICK REVIEW

[論文レビュー] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang, Hsieh Y-s|arXiv (Cornell University)|Feb 23, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

QuantVLA は Diffusion Transformer アクションヘッドを備えた Vision-Language-Action モデルのためのトレーニング不要の後処理量子化フレームワーク。選択的量子化レイアウトと軽量キャリブレーションを用いて substantial なメモリ削減を達成しつつ、VLA タスクでフル precisión ベースラインを同等またはそれを上回る。

ABSTRACT

Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

研究の動機と目的

Diffusion Transformer アクションヘッドを備えた Vision-Language-Action モデルの量子化感度を分析する。
選択的量子化レイアウトと軽量キャリブレーションを備えたトレーニング不要 PTQ フレームワーク QuantVLA を提案する。低ビット推論を安定化させる。
LIBERO ベンチマークの OpenPI 0.5 および GR00T N1.5 におけるメモリ削減と競争力ある、または優れたタスク性能を実証する。
QuantVLA のプレシジョンレベルとタスク設定を横断した頑健性と一般化を示す。

提案手法

言語モデルのすべての線形層と DiT MLP を整数化し、注意のプロジェクション（Q、K、V、O）は浮動小数点のままにする選択的量子化レイアウトを採用する。
線形層の低ビット頑健性を向上させるために DuQuant に触発された再パラメータ化を組み込む。
Attention Temperature Matching (ATM) を導入し、言語とアクションのインターフェースでロジット分布を頭ごとに整合させる。
Output Head Balancing (OHB) を導入し、各層ごとにスカラーを用いて投影後のエネルギーを回復し、残差経路を安定化させる。
ATM と OHB を小さなラベルなしキャリブレーションバッファからキャリブレーションし、スカラーをデクオンタイズスケールへ折り畳んで演算子スケジュールを変更せずに済む。
元のアーキテクチャを保持し、トレーニングは不要、多くのコンポーネントで低ビット整数カーネルを有効にする。

実験結果

リサーチクエスチョン

RQ1DiT アクションヘッドを備えた緊密に結合した VLA スタックを量子化がどのように撹乱するか。
RQ2トレーニング不要の PTQ フレームワークは低ビット量子化の下で言語バックボーンと拡散ベースのアクションヘッドの両方を安定化できるか。
RQ3再学習なしで VLA モデルにおけるメモリ削減の程度はどれくらいで、精度はフル精度ベースラインとどう比較されるか。
RQ4ATM と OHB のキャリブレーションは LIBERO 内の異なる VLA モデルとタスクに一般化するか。

主な発見

QuantVLA は基準 FP16 モデルと比較して、量子化コンポーネントで約 70% の相対メモリ削減を達成する。
QuantVLA は評価対象の LIBERO タスクでフル精度ベースラインのタスク成功率を上回るか同等にする。
OpenPI 0.5 では、QuantVLA は平均成功率 97.6% を達成し、メモリを 4.27 GB から 1.28 GB に削減。
GR00T N1.5 では、QuantVLA は平均成功率 88.0% を達成し、メモリを 2.02 GB から 0.91 GB に削減。
キャリブレーション ATM と OHB はロジット統計と投影後エネルギーを回復し、キャリブレーションを超える計算オーバーヘッドを加えることなく低ビット推論を安定化する。
QuantVLA は低ビット幅でも堅牢な性能を維持しており（例えば OpenPI 0.5 で W4A4 の平均 95.3% など）、ノイズ除去ステップを跨いでも頑健性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。