Skip to main content
QUICK REVIEW

[论文解读] HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

Sharon Zhou, Mitchell Gordon|arXiv (Cornell University)|Apr 1, 2019
Visual perception and processing mechanisms参考文献 59被引用 72
一句话总结

HYPE 建立了两个基于人类感知的基准(基于时间和非时间)来可靠地测量生成模型的视觉真实感,使跨数据集的模型比较具有成本效益、可重复且可区分。

ABSTRACT

Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

研究动机与目标

  • 以心理物理学为基础,为生成模型的視覺真实感建立一个黄金标准的人类基准。
  • 提供两种评估变体(基于时间和非时间)——它们可靠、可区分且成本高效。
  • 展示 HYPE 在跨数据集和采样方法下稳定对模型排序的能力。
  • 将 HYPE 与自动化指标进行比较,并展示其在训练过程中的进展跟踪应用。

提出的方法

  • 两种 HYPE 变体:HYPE_time 使用自适应时间约束来找到真实与伪造图像的感知阈值。
  • HYPE_infinity (HYPE_\u221e) 在不设时间约束的情况下,测量对 50 张真实图像和 50 张伪造图像的人类错误率。
  • 从模型和真实数据集中抽取图像来形成评估集(每个模型 K=5000,真实数据集每个模型 5000)。
  • 评估者通过资格任务以确保标签质量;在 100 张图像任务中需要达到 $65\%$ 的准确率才能合格。
  • 使用自助法( Bootstrapping )来计算 95% 置信区间和标准差以提高可靠性。

实验结果

研究问题

  • RQ1基于心理物理学的 human 基准能否在不同 GAN 与采样方法之间可靠地区分感知真实感?
  • RQ2基于时间与非时间的变体是否给出一致的排序和可区分的模型差异?
  • RQ3HYPE 与自动化指标如 FID、KID 以及精度在跨数据集和模型中的相关性或分歧如何?
  • RQ4HYPE 是否可扩展并在大规模模型评估和训练进度跟踪中具成本效益?
  • RQ5结果如何从面部扩展到物体及其他数据集?

主要发现

  • HYPE_time and HYPE_infinity produce consistent model rankings for unconditional face generation across CelebA-64 and FFHQ-1024.
  • StyleGAN with truncation is the top performer on FFHQ-1024 with a HYPE_time of 363.2 ms and HYPE_infinity of 27.6%.
  • HYPE_infinity provides separable distinctions among models on CelebA-64, even when HYPE_time shows bottoming-out effects for some pairs.
  • HYPE shows strong correlation between HYPE_time and HYPE_infinity (rho = 1.0, p = 0.0), while showing weak or variable correlations with FID and KID across tasks.
  • On ImageNet-5, some classes exhibit separable differences among models, while harder classes show consistently low scores across models, indicating task difficulty impacts perceptual realism.
  • CIFAR-10 results show StyleGAN_trunc beginning to outperform earlier models in human perceptual realism; correlations with automated metrics are moderate or insignificant and vary by model class.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。