Skip to main content
QUICK REVIEW

[论文解读] The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Yanzhu Guo, Guokan Shang|arXiv (Cornell University)|Nov 16, 2023
Natural Language Processing Techniques被引用 8
一句话总结

论文研究在前任生成的合成文本上对大型语言模型进行递归微调,如何降低在三个NLP生成任务中的输出的词汇、语义和句法多样性。它还介绍了用于量化语言多样性的新指标,超越传统的性能指标。

ABSTRACT

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

研究动机与目标

  • Motivate and measure the long-term impact of training LLMs on self-generated data on linguistic diversity.
  • Develop automatic metrics for lexical, semantic, and syntactic diversity beyond perplexity and standard BLEU-based measures.
  • Experimentally assess diversity changes through recursive finetuning across multiple natural language generation tasks.
  • Highlight potential risks to linguistic richness when relying on predecessor-generated data for training.

提出的方法

  • Propose a recursive finetuning and sampling pipeline that iterates training on predecessor-produced synthetic data starting from human-authored data.
  • Define and compute metrics for lexical diversity (TTR, Distinct-2, Distinct-3, Self-BLEU), semantic diversity (embedding dispersion via Sentence-BERT), and syntactic diversity (dependency-tree graphs with Weisfeiler-Lehman kernel).
  • Evaluate diversity and perplexity across three tasks (news summarization, scientific abstract generation, story generation) under six recursive iterations.
  • Use OPT-350M as base model and fine-tune on task data, generating synthetic data with nucleus and temperature sampling per task.
  • Compare model outputs at each iteration against human references and prior iterations to assess diversity decay.
  • Note: The evaluation emphasizes diversity over traditional task-performance metrics to expose long-term effects of synthetic-data training.

实验结果

研究问题

  • RQ1How can linguistic diversity be quantified automatically across lexical, semantic, and syntactic dimensions?
  • RQ2Does recursive training on predecessor-generated text reduce linguistic diversity in model outputs across different NLG tasks?
  • RQ3How do diversity trends differ across high-entropy versus low-entropy generation tasks?
  • RQ4What is the relationship between perplexity and diversity in models trained on synthetic data?

主要发现

  • Perplexity stays in a reasonable range, while all three diversity metrics decline across iterations.
  • Diversity decline is more pronounced in high-entropy tasks (story generation) than in low-entropy tasks (news summarization, scientific abstracts).
  • Syntactic diversity deteriorates significantly, often more than lexical or semantic diversity, indicating loss of structural variety.
  • Lexical diversity (TTR, Distinct-2/3, Self-BLEU) declines progressively with iterations, signaling reduced lexical variety.
  • Semantic diversity, measured via embedding dispersion, also decreases but with task-dependent patterns compared to lexical/syntactic trends.
  • Model outputs increasingly converge toward the predecessor training distribution, risking linguistic richness over generations.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。