Skip to main content
QUICK REVIEW

[Paper Review] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang, Hsieh Y-s|arXiv (Cornell University)|Feb 23, 2026
Multimodal Machine Learning Applications0 citations
TL;DR

QuantVLA introduces a training-free post-training quantization framework for Vision-Language-Action models with a Diffusion Transformer action head. It uses a selective quantization layout plus lightweight calibrations to achieve substantial memory savings while matching or surpassing full-precision baselines on VLA tasks.

ABSTRACT

Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

Motivation & Objective

  • Analyze quantization sensitivity in Vision-Language-Action models with Diffusion Transformer action heads.
  • Propose QuantVLA, a training-free PTQ framework with a selective quantization layout and lightweight calibrations to stabilize low-bit inference.
  • Demonstrate memory savings and competitive or superior task performance on LIBERO benchmarks for OpenPI 0.5 and GR00T N1.5.
  • Show robustness and generalization of QuantVLA across precision levels and task settings.

Proposed method

  • Adopt a selective quantization layout that integerizes all linear layers in the language model and DiT MLPs while keeping attention projections (Q, K, V, O) in floating point.
  • Incorporate a DuQuant-inspired reparameterization to improve low-bit robustness of linear layers.
  • Introduce Attention Temperature Matching (ATM), a per-head scalar to align logits distribution at the language–action interface.
  • Introduce Output Head Balancing (OHB), a per-layer scalar to restore post-projection energy and stabilize the residual path.
  • Calibrate ATM and OHB from a small unlabeled calibration buffer and fold scalars into dequantization scales without changing the operator schedule.
  • Preserve original architecture, require no training, and enable low-bit integer kernels for many components.

Experimental results

Research questions

  • RQ1How does quantization perturbation propagate through tightly coupled VLA stacks with a DiT action head?
  • RQ2Can a training-free PTQ framework stabilize both the language backbone and the diffusion-based action head under low-bit quantization?
  • RQ3What level of memory savings is achievable for VLA models without retraining, and how does accuracy compare to full-precision baselines?
  • RQ4Do ATM and OHB calibrations generalize across different VLA models and tasks within LIBERO?

Key findings

  • QuantVLA achieves about 70% relative memory savings on quantized components compared to baseline FP16 models.
  • QuantVLA matches or exceeds full-precision baseline task success rates on evaluated LIBERO tasks.
  • On OpenPI 0.5, QuantVLA attains an average success rate of 97.6% with memory reduced from 4.27 GB to 1.28 GB.
  • On GR00T N1.5, QuantVLA attains an average success rate of 88.0% with memory reduced from 2.02 GB to 0.91 GB.
  • Calibrations ATM and OHB restore logits statistics and post-projection energy, stabilizing low-bit inference without adding computational overhead beyond calibration.
  • QuantVLA maintains strong performance even at lower bit widths (e.g., 95.3% average on OpenPI 0.5 at W4A4) and shows robustness across denoising steps.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.