QUICK REVIEW

[论文解读] Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee|arXiv (Cornell University)|Feb 8, 2022

Topic Modeling被引用 94

一句话总结

对 NLG 中的幻觉进行了一份全面综述，涵盖定义、度量、缓解，以及跨 abstractive summarization、dialogue、GQA、data-to-text、MT 和视觉-语言生成等任务的进展。

ABSTRACT

Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation; and (3) hallucinations in large language models (LLMs). This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.

研究动机与目标

定义并对 NLG 中的幻觉进行分类，并澄清诸如忠实性和事实性等相关术语。
总结造成幻觉的因素，来自数据、训练和推断阶段。
回顾用于测量幻觉的指标及其与人工判断的相关性。
调查在数据、建模、训练和后处理各环节的缓解策略。
提供 abstractive summarization、dialogue generation、generative QA、data-to-text、machine translation、以及 VL generation 等任务的特定进展。

提出的方法

围绕一般幻觉定义、类型（内在 vs 外在）以及任务特定差异来组织文献。
按数据分歧、训练选择、暴露偏差和参数化知识对幻觉来源进行分类。
总结评价指标（统计的、基于模型的、信息抽取/问答/NLI/语言模型基础的，以及人工评估）及其优缺点。
将缓解方法分为数据相关、体系结构相关、训练以及后处理等途径。
综合主要NLG任务的任务特定定义、指标和缓解策略。

实验结果

研究问题

RQ1在 NLG 中幻觉的规范定义与分类是什么，它们在不同任务中有何差异？
RQ2在数据、训练和推理阶段有哪些因素促成幻觉，以及如何缓解？
RQ3哪些指标最能量化幻觉，并且它们在不同任务中与人类判断的一致性如何？
RQ4在数据、建模、训练和后处理方面，哪些缓解策略在主要 NLG 任务中已显示出潜力？
RQ5在 abstractive summarization、dialogue generation、GQA、data-to-text、MT 和 VL generation 的幻觉研究方面，目前的进展与关键挑战是什么？

主要发现

NLG 中的幻觉被归类为内在或外在，并且在不同任务中对容忍度和定义各不相同。
贡献因素包括数据源分歧、数据收集实践、训练目标、暴露偏差和参数记忆。
存在一系列指标，超越 ROUGE/BLEU，包括基于信息抽取（IE）、基于问答（QA）、基于自然语言推理（NLI）、忠实性分类器、基于语言模型（LM）以及人工评估，其与人类判断的相关性各不相同。
缓解措施涵盖数据整理与增强、架构变更、训练策略以及后处理技术。
任务特定分析显示 abstractive summarization、dialogue、GQA、data-to-text、MT 和 VL generation 在定义、指标及缓解方法上存在差异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。