QUICK REVIEW

[论文解读] Neural Discrete Representation Learning

Aäron van den Oord, Oriol Vinyals|arXiv (Cornell University)|Nov 2, 2017

Speech Recognition and Synthesis被引用 1,919

一句话总结

引入 VQ-VAE，一种通过向量量化学习的离散潜变量的变分自编码器，避免后验坍缩，并实现具有自回归先验的高质量生成。

ABSTRACT

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

研究动机与目标

在图像、音频和视频中推动在无监督的情况下学习有用的表征。
开发一个克服强解码器时出现的后验坍缩的离散潜变量 VAE。
展示离散潜变量在似然上可与连续 VAE 匹配，同时为生成提供强先验。
展示包括图像/视频生成、语音理解和无监督的说话人转换等应用。

提出的方法

定义一个潜在嵌入空间 e 在 R^{K x D}，具有 K 个离散码。
编码器输出 z_e(x)；通过对嵌入空间 e 的最近邻查找获得 z，(z_q(x)=e_k)。
用三项损失进行训练：重建对数似然 p(x|z_q(x))，更新 e 的 VQ 损失，以及保持编码器输出接近嵌入的承诺损失（使用 stop-gradient 的）。
使用直通估计器将梯度传播通过离散量化步骤。
假设 z 上的先验是均匀的，使 KL 项成为常数；稍后对 z 拟合一个自回归先验（图像的 PixelCNN，音频的 WaveNet）用于生成。
通过 log p(x) 作为 log p(x|z_q(x)) p(z_q(x)) 的近似来评估，并与连续 VAE 进行比较。

实验结果

研究问题

RQ1离散潜变量 VAE（VQ-VAE）是否能够在标准数据集上实现与连续 VAE 相当的对数似然？
RQ2在使用强解码器时，离散化潜变量是否有助于避免后验坍缩，同时保持重建质量？
RQ3在离散潜变量上学习的自回归先验是否能在图像、音频和视频中实现连贯且高质量的生成？
RQ4离散潜在表示是否以无监督的方式捕获有意义的高层结构（例如语音中的音位）？

主要发现

VQ-VAE 在 CIFAR-10 上实现与连续 VAE 相媲美的似然（VQ-VAE 的每维比特数 4.67 vs 连续 VAE 的 4.51，5.14 对应 VIMCO）。
离散潜变量在 ImageNet (128x128x3) 上使用一个 32x32x1 的潜在空间（K=512）配合 PixelCNN 先验，实现高质量重建。
对于音频，模型学习对低级波形细节不敏感的潜在空间，支持无监督的音位样结构，并能够通过单独的说话人嵌入实现说话人转换。
在视频建模中，潜在空间通过从学习得到的先验抽样 z 并解码到帧来支持长时序生成，保持局部几何而无需像素级生成。
该模型避免后验坍缩，并使用简单、鲁棒的训练方案，采用直接的字典更新（VQ）和承诺项。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。