QUICK REVIEW

[论文解读] FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Nouha Dziri, Ehsan Kamalloo|arXiv (Cornell University)|Apr 22, 2022

Topic Modeling被引用 25

一句话总结

FaithDial 通过编辑充满幻觉的 Wizard of Wikipedia 对话，使其成为一个可信的信息检索对话基准，从而训练幻觉批评器并生成更可信的对话，具备零样本迁移收益与积极的人类评估。

ABSTRACT

The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark. We observe that FaithDial is more faithful than WoW while also maintaining engaging conversations. We show that FaithDial can serve as training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 12.8 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FaithDial generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on FaithDial are perceived as more interpretable, cooperative, and engaging.

研究动机与目标

提升训练数据中对知识的可信度，推动可信赖的知识驱动对话的研究目标。
通过编辑现有 WoW 的话语并对照维基百科片段，建立可扩展的可信标注工作流。
提供数据以训练幻觉批评器并改进可信对话生成。
检验 FaithDial 的好处在其他数据集上的零样本迁移的普适性。
通过人工评估验证可信性与参与度的影响。

提出的方法

将 Wizard of Wikipedia 中的幻觉化 Wizard 对话编辑为忠实于相应知识源的版本。
通过在知识片段上的语义蕴含基础，形式化定义可信性。
通过众包标注并设立质量控制来标注幻觉与编辑需求。
使用 FaithDial 派生数据训练幻觉批评器（FaithCritic），并评估其迁移能力。
尝试一系列模型（GPT2、DialoGPT、T5、DoHA、CTRL、InfoNCE）及辅助损失以提高可信性。
在训练中使用 InfoNCE 对比目标，以区分忠实回答与幻觉回答。

实验结果

研究问题

RQ1与 WoW 相比，FaithDial 能否降低知识驱动对话生成中的幻觉？
RQ2在 FaithDial 上训练的模型在可信性与抽象性指标上表现如何？
RQ3可信性提升是否在其他数据集（如 CMU-DoG、TopicalChat）上实现零样本迁移？
RQ4在 FaithDial 上训练的 FaithCritic 是否可以迁移到其他 NLU 任务与基准？
RQ5基于 FaithDial 的训练对人类感知的合作性、可解释性与参与度有何影响？

主要发现

FaithDial 包含约 50K 条对话跨 5.5K 场对话，在人工验证中实现 94.4% 的忠实话语，幻觉占比仅 5.6%。
在 FaithCritic（从 FaithDial 派生）上训练的幻觉批评器在零样本设置下对 MNLI 和 BEGIN 的迁移能力优于 DNLI 与 DECODE 等基线。
在 FaithDial 上训练的模型相较于仅使用 WoW 时，显著降低幻觉并提升可信性指标（如 Q2-NLI），混合的 FaithDial/WoW 设置可带来进一步提升。
FaithDial 训练的模型对 CMU-DoG 与 TopicalChat 具有零样本迁移的泛化能力。
人类评估表明与在 WoW 上训练的对话相比，FaithDial 训练的回答在可解释性、合作性与吸引力方面更好。
与 WoW 相比，FaithDial 鼓励对知识的概括性使用（密度较低但覆盖率相似），在保持可信性的同时提升对话质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。