QUICK REVIEW

[论文解读] Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Matthias Gerstgrasser, Rylan Schaeffer|arXiv (Cornell University)|Apr 1, 2024

Semantic Web and Ontologies被引用 12

一句话总结

论文表明，在与真实数据共同积累的合成数据训练中可避免模型崩溃，这在语言、扩散和图像模型的经验上，以及在线性回归框架的理论上都是成立的。替换数据会导致不可界限的退化，而积累则产生有界误差。

ABSTRACT

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs.

研究动机与目标

在模型-数据循环迭代中，模型在自身输出上训练，从而动机与定义模型崩溃。
在多模态与架构上进行数据替换与数据累积的经验比较。
使用线性模型框架提供理论洞见，以在数据累积下界定测试误差。
演示将合成数据与真实数据相结合的积累在迭代次数无关的上界上产生有限的误差。

提出的方法

在 TinyStories 上对语言模型（ transformers ）进行不同规模和温度的序列训练，比较数据替换与积累的效果。
在分子构象上训练扩散模型（GeoDiff on GEOM-Drugs）并经过多次迭代，比较替换与积累数据。
在 CelebA 上训练变分自编码器，跨迭代比较替换与积累。
使用可解析的线性模型框架（Mobahi 等；Dohmatob 等）推导替换与积累情形下的测试误差表达式。
推导并给出积累下的测试误差上界，显示与迭代次数 n 无关。
提供消融与对照，确保结果对数据集规模、训练周期和生成温度具有鲁棒性。

实验结果

研究问题

RQ1在迭代的模型-数据循环中，积累合成数据与真实数据是否能防止类似模型崩溃的退化？
RQ2在语言、视觉和分子数据模态中，积累与替换的效果有何不同？
RQ3线性模型框架是否能解释替换与积累情形下测试误差增长的差异？
RQ4在数据被积累与替换时，测试误差的理论界限是什么？
RQ5经验发现对超参数、架构和数据集是否鲁棒？

主要发现

替换数据会导致测试损失或交叉熵随迭代次数在所有模型和数据集上增加。
在语言、扩散和 VAE 实验中，累积数据使测试损失在迭代中保持相等或更低。
在线性模型框架中，积累下的测试误差受迭代次数的上界为常数（π^2/6 因子出现在界限中）。
线性分析显示替换数据导致测试误差随迭代线性增长，而积累则产生有限界。
结果在模型规模（语言模型 9M–125M 参数）、扩散模型以及图像的 VAE 中成立。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。