QUICK REVIEW

[Paper Review] The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys|arXiv (Cornell University)|Apr 22, 2019

Topic Modeling40 references1,100 citations

TL;DR

The paper analyzes decoding strategies for open-ended text generation and introduces Nucleus Sampling, which truncates the unreliable tail of the distribution to produce higher-quality and more diverse text than prior methods.

ABSTRACT

Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

Motivation & Objective

Expose neural text degeneration in open-ended generation.
Compare decoding strategies across distributional, perplexity, and human-evaluated criteria.
Propose and validate Nucleus Sampling as the preferred decoding method for long-form text.
Provide practical guidance on when and why to use nucleus sampling over alternatives.

Proposed method

Define top-p (nucleus) vocabulary as the smallest set whose cumulative probability reaches p.
Renormalize the distribution over the nucleus and sample from it.
Compare nucleus sampling with top-k, temperature, beam search, and pure sampling using distributional metrics and human evaluation (HUSE).
Evaluate on GPT-2 Large (762M) Generatively Pre-trained Transformer with WebText data.
Analyze perplexity, Zipf coefficient, Self-BLEU, repetition, and HUSE to assess quality and diversity.

Experimental results

Research questions

RQ1Can maximization-based decoding (e.g., beam search) produce degenerate, repetitive text in open-ended generation?
RQ2Does sampling from a truncated tail of the model distribution (nucleus sampling) yield text that is both high-quality and diverse?
RQ3How do different decoding strategies compare to human text across distributional, statistical, and human-evaluated criteria?

Key findings

Method	Perplexity	Self-BLEU	Zipf Coefficient	Repetition %	HUSE
Human	12.38	0.31	0.93	0.28	-
Greedy	1.50	0.50	1.00	73.66	-
Beam, b=16	1.48	0.44	0.94	28.94	-
Stochastic Beam, b=16	19.20	0.28	0.91	0.32	-
Pure Sampling	22.73	0.28	0.93	0.22	0.67
Sampling, t=0.9	10.25	0.35	0.96	0.66	0.79
Top-k=40	6.88	0.39	0.96	0.78	0.19
Top-k=640	13.82	0.32	0.96	0.28	0.94
Top-k=40, t=0.7	3.48	0.44	1.00	8.86	0.08
Nucleus p=0.95	13.13	0.32	0.95	0.36	0.97

Maximization-based decoding often yields repetitive or generic text for open-ended generation.
The model’s tail distribution is unreliable and should be truncated during generation.
Nucleus Sampling closely matches human perplexity and diversity, and achieves the best overall quality-diversity trade-off per HUSE evaluation.
Nucleus Sampling yields near-human distribution characteristics on Zipf and diversity metrics while avoiding repetition.
Top-k sampling and temperature have context-dependent drawbacks, while pure sampling can be incoherent.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.