Skip to main content
QUICK REVIEW

[论文解读] The price of debiasing automatic metrics in natural language evaluation

Arun Tejasvi Chaganty, Stephen Mussman|arXiv (Cornell University)|Jul 6, 2018
Topic Modeling参考文献 30被引用 43
一句话总结

论文提出了一种结合自动指标与人工判断的控制变差估计器,以获得无偏的评估且成本更低,并在固定方差参数下证明其极小极大最优性。

ABSTRACT

For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13% cost reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks---the automatic metric and the prompt shown to human evaluators---both of which need to be improved to obtain greater cost savings.

研究动机与目标

  • 动机:自动评估度量中的偏差问题,以及需要更便宜、无偏的人类评估。
  • 引入一种控制变差方法,将自动指标与人类判断结合起来以降低方差。
  • 证明在固定方差和相关性下估计量的极小极大最优性。
  • 量化跨任务和提示的 数据效率 与 成本节省。
  • 提出在改进评估提示和指标以提升成本节省方面的实用准则。

提出的方法

  • 用人类分数 Y(z) 和自动指标 g(z) 定义评估问题。
  • Construct a control variates estimator hat_mu_cv = (1/n) sum_i [ y^(i) - alpha g(z^(i)) ],其中 alpha = Cov(f(z), g(z)).
  • 将 g 标准化为零均值、单位方差以控制噪声。
  • Var(hat_mu_cv) = (1/n)( sigma_f^2(1 - rho^2) + sigma_a^2 ).
  • 在给定 sigma_f^2、sigma_a^2 和 alpha 的条件下,展示无偏估计量中的极小极大最优性。
  • 提供实际实现指引,包括替代法求取 alpha 的 plug-in 以及样本量规划。

实验结果

研究问题

  • RQ1Can automatic metrics be safely leveraged to reduce the cost of human evaluation without biasing the result?
  • RQ2How much cost reduction (data efficiency) is achievable given annotator variance and correlation between human judgments and the automatic metric?
  • RQ3What are the fundamental bottlenecks in achieving larger cost savings?
  • RQ4How should evaluation prompts and metrics be improved to maximize efficiency?
  • RQ5Is the proposed estimator minimax optimal under known variance and correlation parameters?

主要发现

  • The control variates estimator achieves unbiased evaluation with variance reduced by a factor depending on rho and gamma.
  • Data efficiency ranges from 7% to 13% cost reduction with current metrics and prompts, i.e., DE ≈ 1.08–1.15.
  • Optimality: among all unbiased estimators with fixed sigma_f^2, sigma_a^2, and alpha, hat_mu_cv minimizes variance.
  • Data efficiency improves when both annotator variance is reduced and the automatic metric correlates more with human judgments.
  • Post-editing prompts can reduce annotator variance by a factor of about three compared to Likert-scale prompts.
  • ROUGE-L and post-editing prompts contribute to better data efficiency than VecSim or Likert prompts.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。