QUICK REVIEW

[Paper Review] The Role of ImageNet Classes in Fréchet Inception Distance

Tuomas Kynkäänniemi, Tero Karras|Aaltodoc (Aalto University)|Mar 11, 2022

Explainable Artificial Intelligence (XAI)44 citations

TL;DR

This paper reveals that Fréchet Inception Distance (FID) largely reflects ImageNet class distributions, and that matching top-1 or top-N ImageNet predictions can drastically reduce FID without meaningful improvements in image quality, exposing vulnerabilities in FID.

ABSTRACT

Fréchet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.

Motivation & Objective

Explain why FID sometimes disagrees with human judgments.
Visualize what features FID is actually using in generated images.
Test whether matching ImageNet class distributions can artificially lower FID.
Assess the impact of using ImageNet-pretrained discriminators on the reliability of FID.

Proposed method

Apply Grad-CAM to identify image regions most influencing FID by augmenting real/generator feature statistics with individual samples.
Compare FID across feature spaces including pre-logits, logits, and multiple classifier backbones (Inception-V3, ResNet-50, SwAV, CLIP).
Perform Top-1 histogram matching to see if aligning ImageNet top predictions lowers FID.
Generalize to Top-N histogram matching by optimizing class-probability indicators to approximate fringe-feature alignment.
Optimize sampling weights to minimize FID under a fixed real data distribution and analyze the perceptual null space of FID.
Visualize results with heatmaps showing regions FID attends to and relate them to ImageNet Top-1 predictions.

Experimental results

Research questions

RQ1How does FID relate to the distribution of ImageNet classes in real versus generated images?
RQ2Can FID be artificially reduced by aligning ImageNet class statistics without genuine perceptual improvement?
RQ3What is the extent of the perceptual null space in FID when manipulating ImageNet-driven features?
RQ4Do alternative feature spaces (ResNet-50, SwAV, CLIP) corroborate or counteract improvements seen in ImageNet-based FID?
RQ5What are the practical implications for using ImageNet-pretrained discriminators in GAN setups regarding FID reliability?

Key findings

FID tends to focus on ImageNet Top-1 regions, often outside the intended subject area (e.g., faces in FFHQ).
Simple Top-1 histogram matching between real and generated data consistently improves FID across datasets, but does not guarantee perceptual or human-evaluated improvement.
Resampling to match all fringe features can yield large reductions in FID, indicating a substantial perceptual null space tied to ImageNet features.
Top-N histogram matching shows FID improvements grow with N, driven by co-occurrence of top ImageNet classes, while CLIP-based FID remains largely unaffected, suggesting this improvement is tied to ImageNet pre-training.
Practical examples indicate that ImageNet-pretrained discriminators can cause FID to unreliable reflect true image quality, as exemplified by comparisons where FID is favorable but human judgment disagrees.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.