QUICK REVIEW

[论文解读] A Note on the Inception Score

Shane Barratt, Rishi Sharma|arXiv (Cornell University)|Jan 6, 2018

Generative Adversarial Networks and Image Synthesis参考文献 22被引用 234

一句话总结

本文批评 Inception Score (IS) 作为评估图像生成模型的指标，揭示其次优性与误用，并提出一个改进、更加易于解释的替代分数，解决关键问题。

ABSTRACT

Deep generative models are powerful tools that have produced impressive results in recent years. These advances have been for the most part empirically driven, making it essential that we use high quality evaluation metrics. In this paper, we provide new insights into the Inception Score, a recently proposed and widely used evaluation metric for generative models, and demonstrate that it fails to provide useful guidance when comparing models. We discuss both suboptimalities of the metric itself and issues with its application. Finally, we call for researchers to be more systematic and careful when evaluating and comparing generative models, as the advancement of the field depends upon it.

研究动机与目标

评估 Inception Score 作为通用图像生成模型指标的有效性和可靠性。
识别该指标及其常见用法中的次优性。
提出对该指标的改进，以及对更稳健评估生成模型的指导。

提出的方法

重新审视 Inception Score 的理论基础及其与互信息的关系 (IS = exp(I(y; x))).
分析实际计算问题，包括基于分割的估计和数据集类别分布的影响。
引入一个改进的分数，去除指数和批量分割依赖：S(G) = (1/N) sum_i D_KL(p(y|x^(i)) || p_hat(y)).
展示对 IS 的潜在对抗优化，以及在对抗样本样扰动下接近完美分数的情况。
在应用 IS 时讨论数据集和模型的兼容性考虑（更倾向于在对 ImageNet 训练的生成器上使用 IS）。
提供避免过拟合的建议，并鼓励进行超越单一指标的更全面评估。

实验结果

研究问题

RQ1作为生成图像模型的度量，Inception Score 的主要缺陷是什么？
RQ2计算选择（分割、数据集、网络权重）如何影响 IS？
RQ3IS 是否可以改进，使其在跨数据集和模型时更稳健、可解释？
RQ4研究人员应采用哪些做法来对生成模型进行更严格的评估？

主要发现

IS 的取值在 1 到 1000 之间，其上下界由熵性质得到明确界定。
Inception 网络权重的微小变化（即使分类准确率相近）也会在同一组生成样本上引起 IS 的较大波动。
使用分割（n_splits ）引入人工方差；对整个数据集进行计算并去掉指数项可得到稳定、可解释的分数 S(G)。
对抗性和基于优化的尝试可以将 IS 推向近似完美的值（例如，IS ≈ 900–986），而不产生真实感图像，凸显其被滥用的脆弱性。
当 Inception 网络在与生成器相同的数据集上训练时，IS 最具意义（例如 ImageNet 适用于 ImageNet 生成器）；将 IS 应用于非 ImageNet 数据（例如 CIFAR-10）会得到误导性结论。
明确报告过拟合控制措施至关重要，因为记忆化可能提高 IS。
本文主张采用更广泛、更加严格的评估框架，超越单一指标（例如比较多种指标、针对特定数据集的适配）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。