QUICK REVIEW

[논문 리뷰] AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization

Shaoqiu Zhang, Zizhong Ding|arXiv (Cornell University)|2026. 02. 10.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

AdaTSQ introduces a post-training quantization framework for Diffusion Transformers that uses Pareto-aware, timestep-dynamic bit-width allocation and Fisher-guided temporal calibration to exploit temporal sensitivity, achieving state-of-the-art W4A4 and enabling W3A3 image generation and strong video results with efficient search.

ABSTRACT

Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at https://github.com/Qiushao-E/AdaTSQ.

연구 동기 및 목표

Motivate the need to compress Diffusion Transformers (DiTs) without sacrificing temporal fidelity.
Develop a PTQ framework that exploits temporal heterogeneity in DiTs.
Propose a Pareto-aware beam-search strategy to allocate bit-widths across timesteps and layers.
Introduce Fisher-guided temporal calibration to prioritize calibration data from sensitive timesteps.

제안 방법

Model quantization policy search as a constrained pathfinding problem and solve with Pareto-aware beam search to minimize reconstruction error under a bit-budget.
Generate timestep-specific candidate configurations based on Fisher information to target sensitive timesteps.
Use a Pareto frontier to balance cumulative reconstruction error and bit-cost across timesteps.
Compute temporal importance via Fisher information and apply temperature-scaled softmax to re-weight calibration data per layer.
Reformulate weight quantization as a temporally weighted risk minimization and derive a Risk-Aware Hessian for calibration.
Optionally validate final candidates with lightweight end-to-end metrics (e.g., CLIP) to select the best perceptual quality under the bit-budget.

실험 결과

연구 질문

RQ1How can quantization bit-widths be allocated across timesteps and layers in DiTs to maximize perceptual quality within a fixed bit-budget?
RQ2Can Fisher information identify temporally sensitive phases in diffusion denoising to guide calibration and optimization?
RQ3Does Fisher-guided temporal calibration improve quantization robustness across image and video DiTs?
RQ4How does AdaTSQ perform against state-of-the-art DiT quantization methods on image and video benchmarks?

주요 결과

AdaTSQ outperforms state-of-the-art methods like SVDQuant and ViDiT-Q in both image and video generation scenarios.
It enables robust W4A4 quantization across Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1, preserving perceptual quality.
The Pareto-aware allocation achieves better structural clarity and semantic alignment than static quantization baselines.
Fisher-guided temporal calibration improves preservation of critical denoising steps, enhancing end-to-end generation metrics.
Efficient search overhead: finding the optimal mixed-precision policy for a 50-step model takes about 4 minutes on a single A100-80GB GPU; the policy concentrates around 80% 3-bit, 10% 4-bit, 10% 8-bit, yielding ~3.1-bit average and substantial FLOPs and memory savings (≈5.16× and ≈5.33×).
AdaTSQ achieves W3A3 image generation for text-to-image models and strong W4A4 performance on video models.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.