QUICK REVIEW

[论文解读] Self-Consuming Generative Models Go MAD

Sina Alemohammad, Josue Casco-Rodriguez|arXiv (Cornell University)|Jul 4, 2023

Generative Adversarial Networks and Image Synthesis被引用 15

一句话总结

本论文分析了自养循环，其中模型在来自先代的合成数据上训练，结果显示若缺乏足够的新鲜真实数据，模型质量（精确度）或多样性（召回率）会在代际中下降，这一现象被他们称为 MAD。

ABSTRACT

Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous (self-consuming) loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and in whether the samples from previous generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), making analogy to mad cow disease.

研究动机与目标

推动对自养循环的形式化研究，其中生成数据在训练管道中大量涌现。
在有偏采样的条件下，定义并比较三种循环变体（完全合成、合成增强、新鲜数据）的数据组成。
开发用于量化质量（precision）和多样性（recall）的指标，并跟踪它们在代际之间的演化。
通过理论与实验证明，真实新鲜数据不足会导致 Model Autophagy Disorder (MAD)。
强调采样偏差如何影响自养循环中质量和多样性之间的权衡。

提出的方法

将自养过程形式化为在真实数据与合成数据混合上训练的一系列模型。
引入三种循环变体：完全合成、合成增强，以及带有精确定义数据组成的新鲜数据。
引入一个普遍的采样偏置参数 λ，用以建模合成数据生成中的质量-多样性权衡。
使用高斯玩具模型和真实生成模型（DDPM、StyleGAN-2、WGAN、Normalizing Flows）来分析循环行为。
使用 precision、recall 和 Fréchet Inception Distance (FID) 作为质量与多样性的代理，并使用 Wasserstein 距离衡量分布漂移。
给出关于 MNIST 与 FFHQ 数据集的实证结果，以说明在不同循环和偏置下的 MAD 动态。

实验结果

研究问题

RQ1自养循环如何影响生成模型在代际中的收敛或发散？
RQ2循环类型（完全合成、合成增强、新鲜数据）对代际中质量与多样性的影响是什么？
RQ3采样偏置（质量-多样性权衡）如何影响这些循环中的 MAD 行为？
RQ4新鲜真实数据是否能防止退化，在何种条件下能稳定或无法稳定性能？
RQ5结论是否在不同的模型家族和除图像之外的数据域中也成立？

主要发现

完全合成的循环随着代际推进会导致质量（precision）和多样性（recall）双双下降。
引入采样偏差可以在保持质量的同时损害多样性，加速 recall 的退化并提升 FID。
使用固定真实数据的合成增强会延迟但不能防止 MAD，因为最终质量或多样性会下降。
新鲜数据循环——每代都包含一个新鲜真实数据成分——在每代有足够真实数据时可以防止退化。
在模型家族（DDPM、StyleGAN-2、WGAN、Normalizing Flows）和数据集（MNIST、FFHQ）上，缺乏足够新鲜真实数据时自养循环会表现出 MAD 行为。
偏向高质量模态的采样放大伪影并在代际中降低合成数据的多样性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。