[Paper Review] Are GANs Created Equal? A Large-Scale Study
The paper conducts a large-scale, neutral comparison of state-of-the-art GANs, showing many perform similarly with enough hyperparameter tuning, and proposes precision/recall–based evaluation datasets to supplement FID.
Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not find evidence that any of the tested algorithms consistently outperforms the non-saturating GAN introduced in \cite{goodfellow2014generative}.
Motivation & Objective
- Motivate fair, neutral comparison among GAN variants under practical computational budgets.
- Assess how hyperparameters, seeds, and data sets affect reported GAN performance.
- Evaluate robustness and limitations of current metrics (FID and IS) for GANs.
- Propose precision/recall based evaluations on controlled data manifolds to complement FID.
Proposed method
- Compare unconditional GANs using a common architecture and standardized training setup.
- Perform large-scale hyperparameter searches (wide then narrow) to assess sensitivity across models and data sets.
- Evaluate using Fréchet Inception Distance (FID) and derived precision/recall metrics.
- Analyze bias, variance, and mode dropping effects on FID across data sets.
- Open-source experimental setup and implementations for reproducibility.
Experimental results
Research questions
- RQ1Do different GAN algorithms offer objective performance advantages when hyperparameters and budgets are controlled?
- RQ2How sensitive are GANs to hyperparameters, seeds, and architecture under a fixed budget?
- RQ3Is FID a robust metric for comparing GANs across data sets and encodings, and can precision/recall provide complementary insight?
- RQ4Can we design datasets where precision and recall can be approximated to assess mode coverage and overfitting?
Key findings
| Data Set | MM GAN | NS GAN | LSGAN | WGAN | WGAN GP | DRAGAN | BEGAN | VAE |
|---|---|---|---|---|---|---|---|---|
| MNIST | 9.8±0.9 | 6.8±0.5 | 7.8±0.6 | 6.7±0.4 | 20.3±5.0 | 7.6±0.4 | 13.1±1.0 | 23.8±0.6 |
| FASHION | 29.6±1.6 | 26.5±1.6 | 30.7±2.2 | 21.5±1.6 | 24.5±2.1 | 27.7±1.2 | 22.9±0.9 | 58.7±1.2 |
| CIFAR | 72.7±3.6 | 58.5±1.9 | 87.1±47.5 | 55.2±2.3 | 55.8±0.9 | 69.8±2.0 | 71.4±1.6 | 155.7±11.6 |
| CELEBA | 65.6±4.2 | 55.0±3.3 | 53.9±2.8 | 41.3±2.0 | 30.0±1.0 | 42.3±3.0 | 38.9±0.9 | 85.7±3.8 |
- Most GAN variants achieve similar FID scores given sufficient hyperparameter optimization and random restarts.
- Best reported scores vary with data set and budget, suggesting no single algorithm dominates under fair comparison.
- FID shows robustness to some changes but is highly sensitive to mode dropping and encoding choice; it cannot detect overfitting.
- Precision, recall, and F1 can reveal diversity and coverage gaps not captured by FID or IS.
- In small budgets, algorithmic differences are hard to distinguish; large budgets can flip perceived quality between models.
- Across a suite of data sets, nsGAN and wgan often yield favorable F1 scores, while others show mixed results.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.