QUICK REVIEW

[논문 리뷰] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang, Hsieh Y-s|arXiv (Cornell University)|2026. 02. 23.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

QuantVLA는 Diffusion Transformer 액션 헤드를 갖춘 Vision-Language-Action 모델에 대한 학습 없이의 포스트 트레이닝 양자화 프레임워크를 도입합니다. 선택적 양자화 레이아웃과 경량 보정을 사용해 많은 메모리 절감을 달성하면서 VLA 작업에서 전체 정밀도 기준선과 동등하거나 이를 상회합니다.

ABSTRACT

Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

연구 동기 및 목표

Vision-Language-Action 모델에서 Diffusion Transformer 액션 헤드를 가진 양자화 민감도 분석을 수행합니다.
학습 없는 PTQ 프레임워크인 QuantVLA를 제안하고, 저비트 추론을 안정화하기 위한 선택적 양자화 레이아웃과 경량 보정을 제시합니다.
LIBERO 벤치마크의 OpenPI 0.5 및 GR00T N1.5에 대해 메모리 절감과 경쟁력 있거나 우수한 작업 성능을 입증합니다.
정밀도 수준 및 작업 설정 전반에 걸친 QuantVLA의 강건성과 일반화를 보여줍니다.

제안 방법

언어 모델과 DiT MLP의 모든 선형 계층을 정수화하고 주의도(쿼리 Q, 키 K, 값 V, 출력 O) 투영은 부동 소수점으로 유지하는 선택적 양자화 레이아웃을 채택합니다.
선형 계층의 저비트 강건성을 향상시키기 위해 DuQuant에서 영감을 받은 재매개변화를 도입합니다.
언어-액션 인터페이스에서 로짓 분포를 정렬하기 위한 헤드당 스칼라인 Attention Temperature Matching(ATM)을 도입합니다.
출력 헤드 균형화(Output Head Balancing, OHB)을 도입하여 프로젝션 후 에너지를 복원하고 잔차 경로를 안정화하는 계층당 스칼라를 제시합니다.
소량의 비라벨 보정 버퍼에서 ATM과 OHB를 보정하고, 연산자 스케줄을 변경하지 않으면서 스칼라를 디퀀타이제이션 스케일에 접어들여 접합합니다.
원래 아키텍처를 보존하고 학습이 필요 없으며 많은 구성요소에 저비트 정수 커널을 가능하게 합니다.

실험 결과

연구 질문

RQ1양자화 교란이 DiT 액션 헤드를 가진 밀접하게 결합된 VLA 스택에서 어떻게 전달되나요?
RQ2학습 없이 PTQ 프레임워크가 저비트 양자화하에서 언어 백본과 확산 기반 액션 헤드를 모두 안정화할 수 있나요?
RQ3재학습 없이 VLA 모델에서 달성 가능한 메모리 절감 수준은 어느 정도이며, 정확도는 전체 정밀도 기준선과 어떻게 비교되나요?
RQ4LIBERO 내의 다양한 VLA 모델과 작업에서 ATM 및 OHB 보정이 일반화되나요?

주요 결과

QuantVLA는 기준 FP16 모델과 비교하여 양자화된 구성요소에서 약 70%의 상대 메모리 절감을 달성합니다.
QuantVLA는 평가된 LIBERO 작업에서 전체 정밀도 기준선의 작업 성공률과 일치하거나 이를 상회합니다.
OpenPI 0.5에서 QuantVLA는 메모리가 4.27 GB에서 1.28 GB로 감소하며 평균 성공률 97.6%를 달성합니다.
GR00T N1.5에서 QuantVLA는 메모리가 2.02 GB에서 0.91 GB로 감소하며 평균 성공률 88.0%를 달성합니다.
ATM과 OHB 보정은 로짓 통계와 프로젝션 후 에너지를 복원하여, 보정에 따른 계산 오버헤드를 추가하지 않고도 저비트 추론을 안정화합니다.
QuantVLA는 낮은 비트 폭에서도 여전히 강력한 성능을 유지합니다(예: OpenPI 0.5에서 W4A4 시점 평균 95.3%) 그리고 잡음 제거 단계 전반에서 강건성을 보입니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.