QUICK REVIEW

[论文解读] Unifying Human and Statistical Evaluation for Natural Language Generation

Tatsunori Hashimoto, Hugh Zhang|arXiv (Cornell University)|Apr 4, 2019

Topic Modeling参考文献 38被引用 41

一句话总结

引入 HUSE，一种将人工评估与模型概率相结合的统一评估框架，用于共同评估 NLG 的质量与多样性，并分析摘要、故事生成、对话与语言建模等任务的权衡。

ABSTRACT

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

研究动机与目标

Motivate the need for evaluating both quality and diversity in NLG rather than relying on human evaluations or perplexity alone.
Propose a theoretically grounded framework where the optimal discriminator between model and reference distributions determines a unified evaluation metric.
Show how to estimate this metric in practice by combining human judgments with model probabilities (HUSE).
Decompose HUSE into quality (HUSE-Q) and diversity (HUSE-D) components to analyze tradeoffs.
Empirically validate HUSE on language modeling, storytelling, chit-chat dialogue, and summarization tasks, examining annealing and other generation techniques.

提出的方法

Define L* as twice the minimal discriminative error between reference and model distributions, linking it to total variation distance.
Show that the optimal two-dimensional sufficient statistic is (p_ref(y|x), p_model(y|x)); use it to characterize the optimal discriminator.
Introduce phi_huse(x,y) = [log p_model(y|x)/len(y), HJ(x,y)], where HJ is crowdworker-derived typicality estimates for p_ref(y|x).
Estimate the discriminator error using a 16-NN classifier on samples drawn from reference and model, enabling practical computation of L(phi_huse).
Decompose HUSE into HUSE-Q (human judgment based) and HUSE-D (diversity component) and analyze their interaction.

实验结果

研究问题

RQ1How can we jointly quantify quality and diversity in NLG beyond traditional evaluation metrics?
RQ2Can we approximate the optimal discriminator’s error using a two-dimensional statistic involving model probabilities and crowd-sourced typicality judgments?
RQ3Do standard quality-improving techniques (e.g., temperature Annealing) hurt diversity, and vice versa?
RQ4How do HUSE, HUSE-Q, and HUSE-D behave across tasks with varying entropy (language modeling, dialogue, summarization, storytelling)?
RQ5What insights about model failures (quality vs. diversity) can HUSE reveal that human evaluation alone cannot?

主要发现

HUSE detects diversity defects that human evaluation alone can miss.
Annealing to improve sample quality can decrease HUSE by reducing diversity, revealing tradeoffs between quality and diversity.
HUSE provides a two-dimensional assessment that can distinguish between quality and diversity issues across tasks such as summarization, story generation, dialogue, and language modeling.
Human judgments (HJ) correlate strongly with reference distribution likelihood, enabling practical estimation of the reference probability.
The framework yields interpretable diagnostics and visualizations of model failure modes (quality vs. diversity) at the sample level.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。