QUICK REVIEW

[论文解读] Classification Accuracy Score for Conditional Generative Models

Suman Ravuri, Oriol Vinyals|arXiv (Cornell University)|May 26, 2019

Generative Adversarial Networks and Image Synthesis参考文献 44被引用 99

一句话总结

Classification Accuracy Score (CAS) 在条件生成模型产生的合成数据上训练分类器，并在真实数据上进行测试以衡量下游任务性能，揭示 IS/FID 未能捕捉的弱点，并显示基于似然的模型在 CAS 上可超越 GANs。

ABSTRACT

Deep generative models (DGMs) of images are now sufficiently mature that they produce nearly photorealistic samples and obtain scores similar to the data distribution on heuristics such as Frechet Inception Distance (FID). These results, especially on large-scale datasets such as ImageNet, suggest that DGMs are learning the data distribution in a perceptually meaningful space and can be used in downstream tasks. To test this latter hypothesis, we use class-conditional generative models from a number of model classes---variational autoencoders, autoregressive models, and generative adversarial networks (GANs)---to infer the class labels of real data. We perform this inference by training an image classifier using only synthetic data and using the classifier to predict labels on real data. The performance on this task, which we call Classification Accuracy Score (CAS), reveals some surprising results not identified by traditional metrics and constitute our contributions. First, when using a state-of-the-art GAN (BigGAN-deep), Top-1 and Top-5 accuracy decrease by 27.9\% and 41.6\%, respectively, compared to the original data; and conditional generative models from other model classes, such as Vector-Quantized Variational Autoencoder-2 (VQ-VAE-2) and Hierarchical Autoregressive Models (HAMs), substantially outperform GANs on this benchmark. Second, CAS automatically surfaces particular classes for which generative models failed to capture the data distribution, and were previously unknown in the literature. Third, we find traditional GAN metrics such as Inception Score (IS) and FID neither predictive of CAS nor useful when evaluating non-GAN models. Furthermore, in order to facilitate better diagnoses of generative models, we open-source the proposed metric.

研究动机与目标

通过下游任务性能来评估生成模型，而不仅仅是像 IS/FID 这样的感知指标。
定义并形式化 Classification Accuracy Score (CAS) 作为衡量合成数据在带标签分类方面比真实数据更有效的度量。
在大规模数据集（ImageNet）和较小规模数据集（CIFAR-10）上比较多种模型类别（GANs、VQ-VAE-2、HAMs）的 CAS。
证明 CAS 可以揭示类别特定的不足，而传统的 GAN 指标并不能很好地预测 CAS。
开源 CAS 度量以促进更广泛的采用和诊断性使用。

提出的方法

在条件生成模型产生的合成数据上训练一个图像分类器（基于 ResNet）。
在真实数据上评估分类器以获得 Top-1 和 Top-5 精度，从而定义 CAS。
在不同模型类别之间将 CAS 与 Inception Score (IS) 和 Frechet Inception Distance (FID) 进行对比。
进行逐类别分析，以确定哪些类别被各模型较差地捕捉到。
引入 Naive Augmentation Score (NAS) ，通过在真实数据和合成数据的混合上训练分类器来研究数据增强效应。
开源 CAS 计算工作流以实现可重复性和更广泛的使用。

实验结果

研究问题

RQ1CAS 是否能揭示条件生成模型在下游任务方面的缺陷，而 IS/FID 未能揭示？
RQ2哪些模型类别（GANs 与像 VQ-VAE-2、HAMs 这样的基于似然的模型）在 ImageNet 和 CIFAR-10 上实现更高的 CAS？
RQ3是否存在特定类别，在逐类 CAS 的揭示下，生成模型始终未能捕捉数据分布？
RQ4CAS 与传统指标（IS、FID）在不同模型族之间的关系如何？
RQ5将真实数据与模型生成样本进行扩增（NAS）是否能提升下游任务性能，在何种条件下？

主要发现

ImageNet 上的 BigGAN-deep 显示出显著的 CAS 下降（Top-1 下降 27.9%、Top-5 下降 41.6%），相对于真实数据。
基于似然的条件模型（VQ-VAE-2、HAM）在 CAS 上比 BigGAN-deep 更高，尽管 IS/FID 较差。
逐类 CAS 分析识别出 BigGAN-deep 及其他模型未能捕捉数据分布的特定类别（如 balloons、paddlewheel、pencil sharpener、spatula 在某些情况下甚至为 0% 的准确率）。
IS 和 FID 不能可靠地预测 CAS，特别是对于非 GAN 模型，凸显需要任务对齐的评估指标。
Naive Augmentation Score (NAS) 在用合成样本扩增真实数据时可带来小幅分类提升（例如 Top-5 约提升 0.2%），但结果随截断和模型而异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。