QUICK REVIEW

[論文レビュー] Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee|arXiv (Cornell University)|Feb 8, 2022

Topic Modeling被引用数 94

ひとこと要約

NLGにおける幻覚の総合的な調査。定義、指標、緩和、そして要約作成、対話、GQA、データ→テキスト、機械翻訳、視覚言語生成におけるタスク別の進展を網羅する。

ABSTRACT

Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation; and (3) hallucinations in large language models (LLMs). This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.

研究の動機と目的

NLGにおける幻覚を定義し分類し、faithfulness（忠実性）や factuality（事実性）など関連用語を明確にする。
データ・訓練・推論から幻覚の要因を要約する。
幻覚を測定する指標とそれらと人間の判断との相関をレビューする。
データ・モデリング・訓練・後処理全体にわたる緩和戦略を調査する。
要約（抽象要約）、対話生成、生成的QA、データ→テキスト、機械翻訳、VL生成についてタスク別の進展を提供する。

提案手法

一般的な幻覚の定義、種類（intrinsic と extrinsic）、およびタスク固有のニュアンスを軸に文献を整理する。
データの乖離、訓練の選択、露出バイアス、パラメトリック知識によって幻覚の原因を分類する。
評価指標（統計的、モデルベース、IE/QA/NLI/LMベース、そして人間評価）とそれらの長所/短所を要約する。
データ関連、アーキテクチャ、訓練、および後処理のアプローチに分類して緩和手法を整理する。
主要なNLGタスク全体で、タスク固有の定義、指標、緩和戦略を統合する。）

実験結果

リサーチクエスチョン

RQ1NLGにおける幻覚の標準的な定義と分類は何であり、それらはタスクによってどう異なるか。
RQ2データ・訓練・推論の各段階で幻覚に寄与する要因は何か、そしてそれらをどう緩和できるか。
RQ3幻覚を最もよく定量化する指標は何で、タスク間で人間の判断とどの程度一致するか。
RQ4主要なNLGタスクにおいてデータ、モデリング、訓練、後処理を横断して有望とされる緩和戦略は何か。
RQ5抽象的要約、対話生成、GQA、データ→テキスト、MT、VL生成における幻覚研究の現状の進展と主要な課題は何か。

主な発見

NLGにおける幻覚は intrinsic または extrinsic と分類され、タスクごとに許容範囲と定義が異なる。
要因にはデータソースの乖離、データ収集慣行、訓練目的、露出バイアス、パラメトリック記憶が含まれる。
ROUGE/BLEU を超える指標が多数存在し、IEベース、QAベース、NLIベース、忠実性分類器、LMベース、そして人間評価があり、人間の判断との相関は様々である。
緩和はデータの編成・拡張、アーキテクチャの変更、訓練戦略、後処理技術にまたがる。
抽象的要約、対話、GQA、データ→テキスト、MT、VL生成において、定義、指標、緩和アプローチがタスクごとに異なることが示されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。