QUICK REVIEW

[论文解读] On Evaluating and Comparing Conversational Agents

Anu Venkatesh, Chandra Khatri|arXiv (Cornell University)|Jan 11, 2018

Topic Modeling参考文献 20被引用 50

一句话总结

本文提出了一套全面的、多指标的评估框架，用于非目标导向对话代理，采用与人类判断高度相关的指标，以减少主观性。该框架在 Alexa 奖竞赛中得到应用，可对数百万次对话实现自动化、细粒度的评估，作为人类评估的可靠代理。

ABSTRACT

Conversational agents are exploding in popularity. However, much work remains in the area of non goal-oriented conversations, despite significant growth in research interest over recent years. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million dollar university competition where sixteen selected university teams built conversational agents to deliver the best social conversational experience. Alexa Prize provided the academic community with the unique opportunity to perform research with a live system used by millions of users. The subjectivity associated with evaluating conversations is key element underlying the challenge of building non-goal oriented dialogue systems. In this paper, we propose a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. We show that these metrics can be used as a reasonable proxy for human judgment. We provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout the Alexa Prize competition. To our knowledge, to date it is the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. We believe that this work is a step towards an automatic evaluation process for conversational AIs.

研究动机与目标

解决非目标导向对话系统中主观评估的挑战，该挑战阻碍了对话人工智能的发展。
开发一种客观的自动化评估策略，减少对人类判断的依赖，同时保持与人类偏好的一致。
利用捕捉对话质量多样方面的指标，实现对话代理的细粒度、可扩展分析。
提供一种统一机制，基于客观指标对表现最佳的代理进行排名，适用于大规模真实场景。
通过在数百万用户互动的大规模真实部署中应用，推进对话人工智能自动评估的现状。

提出的方法

设计一组自动化指标，这些指标在对话质量方面与人类判断高度相关，重点关注连贯性、相关性和参与度。
基于指标预测人类评分的能力来选择指标，确保其以客观方式反映对话的主观方面。
将指标应用于分析在 Alexa 奖竞赛期间收集的数百万次真实用户互动中的对话代理。
使用加权聚合机制，将多个指标统一为一个可解释的评分，用于代理排名。
通过证明指标得分在各种对话类型中与人工标注评分高度相关，验证该框架。
利用 Alexa 奖的大型数据集，确保评估方法的稳健性和泛化能力。

实验结果

研究问题

RQ1自动化指标能否在评估非目标导向对话中作为人类判断的可靠代理？
RQ2哪些具体指标与人工标注的对话质量评分相关性最强？
RQ3如何将多个指标结合，以生成统一且可操作的对话代理排名？
RQ4该评估框架在多大程度上实现了超越聚合人类评分的细粒度、可扩展分析？
RQ5该框架能否在包含数百万次用户互动的真实世界大规模部署中有效应用？

主要发现

所提出的多指标框架与人类判断表现出强烈相关性，验证了其作为人类评估可靠代理的有效性。
该框架实现了对话代理的细粒度分析，捕捉到了聚合人类评分所无法反映的细微差别。
该评估策略在整个 Alexa 奖竞赛期间成功应用，支持了表现最佳代理的选择。
该系统处理了数百万次真实用户对话和数十万条人工评分，使其成为迄今为止已知规模最大的评估设置。
统一的指标聚合机制有效实现了与人类偏好一致的代理排名，实现了可扩展且客观的模型比较。
结果表明，实现完全自动化、大规模对话人工智能系统评估存在可行路径。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。