QUICK REVIEW

[论文解读] On Evaluating and Comparing Open Domain Dialog Systems

Anu Venkatesh, Chandra Khatri|arXiv (Cornell University)|Jan 11, 2018

Topic Modeling参考文献 24被引用 24

一句话总结

本文提出了一套全面的多指标评估框架，用于开放域对话智能体的评估，结合连贯性、参与度、主题多样性、领域覆盖度以及对话深度，以减少人类判断中的主观性。该统一指标与人类评分具有很强的相关性（与用户评分的相关系数 r = 0.66，与频繁用户评分的相关系数 r = 0.70），证明其可作为现实场景中大规模社交机器人评估的可靠代理，例如在 Alexa 奖竞赛中。

ABSTRACT

Conversational agents are exploding in popularity. However, much work remains in the area of non goal-oriented conversations, despite significant growth in research interest over recent years. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million dollar university competition where sixteen selected university teams built conversational agents to deliver the best social conversational experience. Alexa Prize provided the academic community with the unique opportunity to perform research with a live system used by millions of users. The subjectivity associated with evaluating conversations is key element underlying the challenge of building non-goal oriented dialogue systems. In this paper, we propose a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. We show that these metrics can be used as a reasonable proxy for human judgment. We provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout the Alexa Prize competition. To our knowledge, to date it is the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. We believe that this work is a step towards an automatic evaluation process for conversational AIs.

研究动机与目标

为解决由于对话质量具有主观性而导致的开放域对话智能体缺乏客观、可扩展的评估方法的问题。
开发一组与人类判断具有良好相关性的自动化指标，用于评估对话智能体。
将多个细粒度指标统一为单一可比较的评分，以实现在大规模生产环境中对社交机器人进行排序与比较。
利用机器学习实现用户评分的自动化预测，减少对昂贵人工评估的依赖。
基于 Alexa 奖竞赛中数百万条真实用户交互数据，建立对话人工智能评估的基准。

提出的方法

设计一个包含对话用户体验、连贯性、参与度、领域覆盖度、主题深度和主题多样性的多指标评估框架。
收集并分析在 Alexa 奖竞赛期间来自 Alexa 用户的超过一百万条真实对话以及数十万条用户评分。
使用统计相关性分析（皮尔逊相关与斯皮尔曼相关）验证自动化指标与人类评分之间的一致性。
采用加权聚合策略将各项独立指标统一为单一综合评分，以实现跨智能体的比较。
在 60,000 条对话上训练梯度提升决策树（GBDT）模型，以预测用户评分，使用包括主题和连贯性指标在内的对话级特征。
利用用户级特征和主题表征作为潜在输入，以提升未来自动化评分预测模型的性能。

实验结果

研究问题

RQ1能否设计出自动化指标，以减少对开放域对话智能体评估中的主观性？
RQ2所提出的指标（如连贯性、参与度、主题多样性等）是否在真实对话中与人类判断具有强相关性？
RQ3统一指标是否能有效对对话智能体进行排序，使其结果反映人类用户评分？
RQ4机器学习模型在多大程度上能基于对话级特征预测人类用户评分？
RQ5该评估框架能否在维持可靠性和有效性的同时，扩展至数百万条对话？

主要发现

统一评估指标与总体用户评分的相关系数为 0.66，与频繁用户评分的相关系数为 0.70，验证了其作为人类判断可靠代理的有效性。
所提出的指标——连贯性、参与度、主题多样性、领域覆盖度和主题深度——捕捉了人类评分本身无法反映的对话质量的细粒度方面。
基于梯度提升决策树的初步模型在 60,000 条对话数据集上，与用户评分的斯皮尔曼相关系数为 0.352，皮尔逊相关系数为 0.351，显著优于随机选择。
本研究基于迄今为止已知规模最大的对话智能体评估，涵盖超过一百万条对话和数十万条来自真实 Alexa 用户的评分。
结果表明，若使用更大规模的数据集并引入用户级特征，自动化评分预测模型的准确率有望显著提升。
该框架已实际应用于 Alexa 奖竞赛中，用于对社交机器人进行排序与比较，证明了其在真实场景中的可扩展性与实用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。