QUICK REVIEW

[论文解读] Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization

Wenhao Zhao, Qiran Zou|arXiv (Cornell University)|Mar 17, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

论文介绍 Progressive Vector Quantization (ProVQ)，一种基于课程学习的训练策略，将流形预热与离散化解耦，以防止 VQ-VAEs 中的过早离散化，从而在视觉与蛋白质模态下提升重建与生成。

ABSTRACT

Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the encoder has captured the underlying data manifold. We term this phenomenon Premature Discretization. To resolve this, we propose Progressive Quantization (ProVQ), which incorporates the dynamics of quantization hardness as a fundamental yet previously overlooked axis in VQ training. By treating quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, ProVQ effectively guides the codebook toward the well-expanded manifolds. Extensive experimental results demonstrate the broad effectiveness of ProVQ across diverse modalities. We report improved reconstruction and generative performance on the ImageNet-1K and ImageNet-100 benchmarks, highlighting the ProVQ's boost for generative modeling. Furthermore, ProVQ proves highly effective for modeling complex biological sequences, establishing a new performance ceiling for protein structure tokenization on the StrutTokenBench leaderboard.

研究动机与目标

Identify why standard VQ training suffers from Premature Discretization and mutual co-adaptation deadlock between encoder and codebook.
Propose Progressive Vector Quantization (ProVQ) to decouple manifold warmup from discretization.
Demonstrate ProVQ improvements on ImageNet reconstruction and generation, and on protein structure tokenization benchmarks.
Provide a synthetic diagnostic tool (TopoDisc) to reveal discretization pathologies.
Show ablations validating the effectiveness of manifold warmup and soft transition components.

提出的方法

Frame VQ training as curriculum learning to separate continuous manifold warmup from discrete bottleneck optimization.
Stage 1: Manifold Warmup using a standard continuous autoencoder to learn global data structure; initialize codebook with K-Means on embeddings.
Stage 2: Scheduled Discretization with a soft-to-hard transition via a cosine-annealed schedule alpha(t) controlling a soft latent tilde{z} between continuous z and quantized z_q.
Use straight-through estimator for z_q and dynamically weighted loss combining reconstruction and VQ/commitment terms with an adaptive weight omega(t).

Figure 1 : The Premature Discretization and resulting optimization deadlock. During early training stages, grid mapping forces the embedding distribution to contract and align with a sub-optimal clustered code, while uninformative guidance of embeddings causes the codebook vectors to stagnate. This

实验结果

研究问题

RQ1Can decoupling manifold warmup from discretization prevent the encoder-codebook co-adaptation deadlock observed in vanilla VQ-VAEs?
RQ2Does ProVQ improve reconstruction fidelity and generative performance across vision and biological sequence/tokenization tasks?
RQ3How do soft transition and manifold warmup contribute to stabilizing training and expanding latent space utilization?
RQ4Is there a synthetic diagnostic tool to reveal discretization pathologies and how well does ProVQ perform on it?
RQ5How does ProVQ affect downstream protein structure modeling and tokenization benchmarks?

主要发现

Latent Resolution	Tokenizer	rFID ↓	PSNR ↑	SSIM ↑	Perplexity ↑	Euc dist ↑
16×16	LlamaGen	2.19	20.79	0.675	8580.30	1.42
16×16	+ ProVQ	1.86	20.92	0.682	8591.85	6.49

ProVQ consistently improves reconstruction metrics on ImageNet-1K/100 (lower rFID, higher PSNR/SSIM) compared to baselines.
Generative performance improves with ProVQ (lower gFID and higher Recall) for LlamaGen-B/L models.
ProVQ yields stronger codebook utilization and greater latent space diversity (higher perplexity and larger Euc distance).
In protein tokenization, ProVQ + AminoAseed achieves leading averages in functional site, physiochemical, and structure property tasks, surpassing baselines.
ProVQ attains state-of-the-art performance on StructTokenBench for protein structure modeling across multiple tasks.
Ablation studies confirm the importance of manifold warmup and cosine-based soft transition to achieve best performance.

Figure 2 : Empirical Validation on Synthetic 2D datasets. (a) Synthetic dataset composed by Disk shape data plus triangle data to make gridding mapping visible by edge of triangle. (b) Comparison of reconstruction performance over different configurations, demonstrating that both the Soft Transition

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。