[Paper Review] Skill Rating for Generative Models
This paper proposes a tournament-based evaluation framework for generative models using skill rating systems like Elo, where generators and discriminators compete in adversarial matches. The method enables tracking training progress via within-trajectory tournaments and comparing trained models via cross-model tournaments, showing strong correlation with ground-truth performance even for near-perfect generators.
We explore a new way to evaluate generative models using insights from evaluation of competitive games between human players. We show experimentally that tournaments between generators and discriminators provide an effective way to evaluate generative models. We introduce two methods for summarizing tournament outcomes: tournament win rate and skill rating. Evaluations are useful in different contexts, including monitoring the progress of a single model as it learns during the training process, and comparing the capabilities of two different fully trained models. We show that a tournament consisting of a single model playing against past and future versions of itself produces a useful measure of training progress. A tournament containing multiple separate models (using different seeds, hyperparameters, and architectures) provides a useful relative comparison between different trained GANs. Tournament-based rating methods are conceptually distinct from numerous previous categories of approaches to evaluation of generative models, and have complementary advantages and disadvantages.
Motivation & Objective
- To address the challenge of evaluating generative models in a way that is both computationally feasible and conceptually robust.
- To develop a method that enables monitoring the training progress of a single model over time without requiring external benchmarks.
- To provide a relative evaluation framework for comparing multiple trained generative models across different architectures, seeds, and hyperparameters.
- To leverage established skill rating systems (e.g., Elo, Glicko2) to summarize tournament outcomes into interpretable, scalable performance metrics.
- To demonstrate the method's applicability beyond standard image datasets, including unlabeled data and non-image modalities.
Proposed method
- Construct adversarial tournaments where each match involves a generator attempting to fool a discriminator into classifying fake samples as real.
- Use tournament win rate as a direct metric: the average proportion of generated samples misclassified as real by discriminators.
- Apply skill rating systems (e.g., Elo or Glicko2) to infer latent skill values for each generator based on match outcomes.
- Enable efficient rating of n players without running all n² matches by using probabilistic inference from partial match results.
- Use discriminators trained on real data and other generators to evaluate unseen generator samples, even when the generator is nearly perfect.
- Validate the method on both standard image datasets and non-standard modalities, including unlabeled data and toy distributions.
Experimental results
Research questions
- RQ1Can tournament-based evaluation provide a reliable and scalable metric for tracking the training progress of a single generative model?
- RQ2Can skill rating systems effectively rank multiple trained generative models across different architectures, seeds, and hyperparameters?
- RQ3How well do discriminators trained on one model generalize to judging samples from other models, including different GAN variants and non-GAN generators?
- RQ4Can the method be applied to datasets without standardized embeddings or in non-image modalities?
- RQ5How does the skill rating correlate with ground-truth metrics such as distributional similarity (e.g., covariance difference) in controlled settings?
Key findings
- Within-trajectory tournaments between a model’s own generator and discriminator snapshots at different training iterations provide a useful, continuous measure of training progress.
- Skill ratings derived from tournaments show strong correlation with ground-truth performance metrics, such as mean absolute difference in covariance matrices for a toy Gaussian problem.
- Discriminators trained on one generator can successfully judge samples from other generators—even from different architectures—demonstrating generalization capability.
- The method remains effective even when the generator is nearly perfect, as shown in experiments with a GAN trained to model a full-covariance Gaussian distribution.
- Tournament-based evaluation avoids the need for human raters and is reproducible, unlike human judgment-based metrics that vary across populations.
- The skill rating system allows inference of relative model performance across n players using significantly fewer than n² matches, enabling scalable evaluation.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.