QUICK REVIEW

[论文解读] FELM: Benchmarking Factuality Evaluation of Large Language Models

Shiqi Chen, Yiran Zhao|arXiv (Cornell University)|Oct 1, 2023

Topic Modeling被引用 12

一句话总结

FELM 提出一个跨域基准，用于评估事实性评估者在长篇输出中检测错误的能力，具有细粒度的分段级注释以及对检索和推理辅助的分析。

ABSTRACT

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g.~information from Wikipedia), felm focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on felm, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.

研究动机与目标

将事实性评估的范围从世界知识扩展到五个领域：世界知识、科学与技术、数学、写作与推荐、以及推理。
提供细粒度的分段级注释（事实性、错误类型、原因和参考文献）以指导评估者开发。
使用检索和思维链技术评估原生与增强的基于LLM的事实性评估器。
建立一个健全的注释与验证工作流，以确保高质量、可解释的事实性判断。

提出的方法

从五个领域的多样来源收集提示，并使用 ChatGPT 生成零-shot 响应。
使用基于句子的或 GPT 辅助的方法将响应分割成细粒度文本片段。
通过专家注释者将每个片段标注为事实性标签、错误类型、原因和参考链接。
使用原生、思维链、参考链接、参考文档增强评估器在多个 LLM 主干（Vicuna-33B、ChatGPT、GPT-4）上对分段级与应答级事实性进行评估。
比较基于分段的评估方法与基于主张的评估方法，并分析领域特定表现与增强效应。

实验结果

研究问题

RQ1FELM 的多领域、分段级注释是否能可靠地捕捉长篇 LLM 输出中的事实性错误？
RQ2原生、思维链与检索增强评估器在 FELM 的各领域表现如何？
RQ3在不同领域和模型中，基于分段的评估是否比基于主张的评估更能检测到事实性？
RQ4检索链接或文档是否为 LLM 评估器在事实性检测方面带来可衡量的提升？
RQ5在当前 LLM 下，事实性评估的局限性与领域相关挑战有哪些？

主要发现

事实性错误检测仍具挑战性；基于 GPT-4 的评估器在某些设置中优于其他人，但总体仍然困难。
检索增强评估器（包括参考链接和参考文档）提升了 F1 分数，其中参考文档增强带来显著提升。
思维链提示有助于 GPT-4，但对 GPT-3.5/ChatGPT 并不总是有效，尽管自洽性可以提升 Cot 表现。
世界知识与推理领域在增强与 Cot 下获得更好提升；而在长响应和错误稀疏的推荐/写作领域仍然困难。
在未使用外部工具评估时，ChatGPT 检测器在 FELM 上通常表现不佳，强调评估中需要外部证据。
基于 Vicuna-33B 的检测器在分段级上表现具有竞争力，但平衡准确率仍接近随机。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。