QUICK REVIEW

[论文解读] Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu, Xin Li|arXiv (Cornell University)|Oct 9, 2021

Natural Language Processing Techniques参考文献 60被引用 92

一句话总结

本论文提出 ViT-VQGAN，用于改进的矢量量化图像建模（VIM），在 ImageNet 上实现最先进的 FID/IS，并通过两阶段的 ViT-VQGAN 编码器/解码器与一个用于自回归令牌建模的解码器端 Transformer，获得强大的无监督表示。

ABSTRACT

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256 imes256\) resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

研究动机与目标

通过在量化器中用视觉 Transformer 替代 CNN 来激发并扩展矢量量化图像建模，以提高效率和保真度。
开发一个 ViT-VQGAN 量化器，以在下游自回归建模中获得更好的码本使用和重建质量。
证明在离散图像令牌上训练的 Transformer 能进行无条件/条件生成以及无监督表示学习。
展示强大的图像合成指标（FID/IS）以及相对于先前的生成与判别预训练方法的有竞争力的线性探针性能。

提出的方法

在 VQGAN 框架中用 Vision Transformers 替换 CNN 编码器/解码器（ViT-VQGAN）实现端到端图像量化。
通过低维因式化码字索引、L2 归一化以及训练目标（logit-laplace、L2、感知、GAN 损失）等因素，改进码本使用和重建。
训练一个解码器端 Transformer（VIM），对 ViT-VQGAN 产生的 1024 个图像令牌进行自回归建模。
对于无监督学习，通过在中间块的平均 Transformer 特征上训练一个 softmax 头来评估线性探针。
在采样期间，在图像令牌前面添加一个类别ID 令牌，以对图像进行条件化。

实验结果

研究问题

RQ1ViT 基于量化（ViT-VQGAN）是否能在重建质量和码本使用方面优于基于 CNN 的 VQGAN？
RQ2具有离散图像令牌的 Transformer 的 VIM 框架是否实现强无条件与类别条件的图像合成？
RQ3VIM 学到的表示是否在 ImageNet 上产生与其他生成式及判别式预训练方法相比具有竞争力的线性探针准确性？
RQ4架构选择（编码器/解码器大小、码本设计、归一化）如何影响 FID/IS 和下游线性评估？
RQ5在无监督训练中去除感知损失与在生成任务中保留它相比，其影响为何？

主要发现

ViT-VQGAN 在各配置下均实现了比 CNN-VQGAN 更好的重建质量和更快的吞吐量。
使用 ViT-VQGAN + VIM-Large 进行的 ImageNet 无条件生成得到 IS 175.1 和 FID 4.17，而 vanilla VQGAN 的 IS 70.6 和 FID 17.04。
在 ImageNet 上，VIM-Large 的线性探针准确率达到 73.2%，超过 iGPT-L（60.3%）和 iGPT-XL，表明强大的无监督表示。
表格结果显示，8192 码本大小和 1024 令牌的 ViT-VQGAN 在 ImageNet 上提供更优的 FID（1.28），在 CelebA-HQ 和 FFHQ 上也有可比的增益。
使用 ViT-VQGAN 的类别条件采样实现 IS 175.1 和 FID 4.17（L=Large stage 2），并应用基于分类器的拒绝采样进一步将 FID 提升到 3.04，IS 提升到 227.4。
无监督表示（VIM-Large）在 ImageNet 上达到 73.2% 的线性探针准确率，超过 iGPT-L，并在线性评估方面接近像 BYOL/DINO 这类判别式方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。