Skip to main content
QUICK REVIEW

[论文解读] Long-form factuality in large language models

Jerry Wei, Chengrun Yang|arXiv (Cornell University)|Mar 27, 2024
Topic Modeling被引用 8
一句话总结

该论文介绍 LongFact 用于长表述事实性基准测试, SAFE 作为使用 Google 搜索的自动化事实性评估器,以及一个新的基于 F1 的度量 F1@K,并在 4 个家族中对 38 个主题的 13 模型进行基准测试。

ABSTRACT

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

研究动机与目标

  • 提出 LongFact,一个覆盖 38 个主题的大规模长表述事实性提示集。
  • 引入 SAFE,一个基于 LLM 的自动评估器,将回答分解为独立事实并通过 Google 搜索验证事实。
  • 将 F1@K 定义为用于长表述事实性的一个带回忆率关注的聚合度量。
  • 在四个家族中对 13 种模型进行基准测试,以评估规模效应对事实性的影响。

提出的方法

  • 通过 GPT-4 生成 LongFact 提示,以覆盖横跨 38 个主题的长表述事实性问题(LongFact-Concepts 和 LongFact-Objects)。
  • 开发 SAFE:一个将回答拆分为独立事实、评估相关性,并使用多步 Google 搜索推理将事实标注为 supported/not supported/irrelevant 的 LLM 代理。
  • 用 F1@K 对事实性进行聚合,这是一个精确率-召回率的混合度量,其中召回率绑定到用户指定的理想长度 K。
  • 在固定提示集合上使用 SAFE 对 13 种模型(Gemini、GPT、Claude、PaLM-2)进行基准测试,报告 F1@K 在 K=64 与 K=178 时的结果。
Figure 1: Our automatic factuality evaluator, SAFE, uses a large language model to rate the factuality of a long-form response to a given prompt using Google Search. We empirically demonstrate that SAFE outperforms human annotators while being more than 20 times cheaper ( Section 4 ).
Figure 1: Our automatic factuality evaluator, SAFE, uses a large language model to rate the factuality of a long-form response to a given prompt using Google Search. We empirically demonstrate that SAFE outperforms human annotators while being more than 20 times cheaper ( Section 4 ).

实验结果

研究问题

  • RQ1大规模提示集(LongFact)在多样化主题上的长表述事实性探测能力有多强?
  • RQ2相比人工标注,基于 LLM 的自动评估器(SAFE)能否可靠评估长表述的事实性?
  • RQ3在主要家族(Gemini、GPT、Claude、PaLM-2)中,模型规模/架构如何影响长表述的事实性?
  • RQ4通过超参数 K 引入召回分量是否提高了长表述回答的事实性指标的实用性?

主要发现

模型S (支持)NS (不支持)I (不相关)精确度R64R178F1@64F1@178
Gemini-Ultra83.413.17.686.298.346.991.760.3
Gemini-Pro66.612.15.582.088.537.483.750.4
GPT-4-Turbo93.68.36.191.799.052.695.066.4
GPT-459.17.02.489.488.333.288.048.0
GPT-3.5-Turbo46.94.51.290.872.426.479.640.5
Claude-3-Opus63.88.12.888.591.435.989.350.6
Claude-3-Sonnet65.48.42.688.591.436.789.451.4
Claude-3-Haiku39.82.81.392.862.022.473.535.8
Claude-2.138.36.51.984.859.821.567.933.7
Claude-2.038.56.31.385.560.021.668.734.0
Claude-Instant43.98.11.384.468.424.673.837.6
PaLM-2-L-IT-RLHF72.98.74.389.194.041.091.055.3
PaLM-2-L-IT13.21.80.188.820.67.431.113.2
  • LongFact 提供了跨越 38 个主题、包含 2,280 条提示的广泛长表述事实性基准。
  • SAFE 在单个事实上的与人工标注一致率为 72%,在分歧案例中超过人工评估(在抽样子集中的正确率为 76%)。
  • SAFE 的成本大约每次回答 0.19 美元,比群众外包人工标注便宜超过 20 倍。
  • 在 13 种模型中,较大模型通常显示更高的长表述事实性,其中 GPT-4-Turbo、Gemini-Ultra、PaLM-2-L-IT-RLHF 在固定的 K 值下领先。
  • F1@K 指标通过人类偏好的长度 K 来捕捉精确度和召回率,以一致地量化长表述事实性。
Figure 3: SAFE is able to accurately rate factual claims that require multi-step evaluation by leveraging an LLM to iteratively issue Google Search queries and reason with search results.
Figure 3: SAFE is able to accurately rate factual claims that require multi-step evaluation by leveraging an LLM to iteratively issue Google Search queries and reason with search results.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。