Skip to main content
QUICK REVIEW

[论文解读] Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

Shikhar Sharma, Layla El Asri|arXiv (Cornell University)|Jun 29, 2017
Topic Modeling参考文献 19被引用 183
一句话总结

该论文实证评估无监督的自动评估指标(BLEU、METEOR、ROUGE、基于嵌入的指标)与任务型对话生成中的人类判断之间的相关性,发现 METEOR 通常对齐最好,且多参考句能提升相关性;数据集 DSTC2 与 Restaurants 被用于若干 NLG 模型。

ABSTRACT

Automated metrics such as BLEU are widely used in the machine translation literature. They have also been used recently in the dialogue community for evaluating dialogue response generation. However, previous work in dialogue response generation has shown that these metrics do not correlate strongly with human judgment in the non task-oriented dialogue setting. Task-oriented dialogue responses are expressed on narrower domains and exhibit lower diversity. It is thus reasonable to think that these automated metrics would correlate well with human judgment in the task-oriented setting where the generation task consists of translating dialogue acts into a sentence. We conduct an empirical study to confirm whether this is the case. Our findings indicate that these automated metrics have stronger correlation with human judgments in the task-oriented setting compared to what has been observed in the non task-oriented setting. We also observe that these metrics correlate even better for datasets which provide multiple ground truth reference sentences. In addition, we show that some of the currently available corpora for task-oriented language generation can be solved with simple models and advocate for more challenging datasets.

研究动机与目标

  • 评估无监督自动评估指标在任务导向对话生成中与人类判断的相关性。
  • 在两个任务导向数据集上比较词汇重叠指标与基于嵌入的指标。
  • 评估模型复杂度和数据集特征如何影响指标与人类判断的一致性。

提出的方法

  • 对 DSTC2 和 Restaurants 数据集上的自动指标(BLEU、METEOR、ROUGE、Skip-Thought、嵌入平均、向量极值、贪婪匹配)与人类判断进行调查并计算相关性。
  • 实现并比较若干 NLG 模型(Random、LSTM、delex-scLSTM、hierarchical-lex-delex-scLSTM),训练将对话行为转译为自然语言。
  • 使用带有槽错误率惩罚的束搜索解码来生成输出,以实现公平比较。

实验结果

研究问题

  • RQ1无监督自动指标在任务导向对话 NLG 中是否像在非任务导向场景中那样与人类判断相关?
  • RQ2在该领域内,哪些自动指标与人类评估相关性最高?
  • RQ3多参考句是否能提升自动指标与人类判断之间的相关性?
  • RQ4在任务导向 NLG 基准上获得高的指标分数是否需要复杂的神经解码架构?
  • RQ5像 DSTC2 和 Restaurants 这样的任务导向数据集对当前的 NLG 模型与指标是否具有足够挑战性?

主要发现

  • 自动指标在任务导向场景中与人工判断呈正相关,这与某些非任务导向的发现不同。
  • METEOR 在两个数据集上始终与人类评估相关性最好。
  • 嵌入基于的句子相似性指标在大多数模型上与词汇重叠指标的相关性相当。
  • 多参考句(如 Restaurants)提高自动指标与人类判断之间的相关性。
  • 简单模型(如带束搜索的 LSTM)在自动指标上得到高分,表明这些数据集可能挑战性较低,并呼吁建立更大、更复杂的基准。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。