QUICK REVIEW

[论文解读] TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton|arXiv (Cornell University)|Sep 8, 2021

Topic Modeling被引用 116

一句话总结

TruthfulQA 评估语言模型在跨 817 个问题、38 个类别中是否避免模仿性虚假陈述；最佳模型在 58% 的问题上真实， humans 为 94%，并且大模型通常不如小模型真实。

ABSTRACT

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

研究动机与目标

评估语言模型在零-shot 设置下在跨领域问题中的回答是否真实。
研究扩大模型规模是提升还是降低真实性，并找出影响因素。
开发自动化指标以预测人类对真实性的评估。
创建一个基准以区分模仿性虚假陈述与非模仿性弱点。

提出的方法

构建一个对抗性 817 题的基准，覆盖 38 个类别，旨在诱导模仿性虚假陈述。
在真正的零-shot 设置下，评估多种模型家族（GPT-3、GPT-Neo/J、GPT-2、UnifiedQA），涵盖不同规模和提示。
使用人工评估者对生成答案的真实性和信息性进行评分。
开发并验证 GPT-judge，一种对答案真实性进行预测的微调模型。
包含一个多选题变体以及对参考答案的自动似然性评分。
分析更大模型在真实性和信息性方面是否存在反向缩放现象。

实验结果

研究问题

RQ1当前的语言模型在一个旨在诱导模仿性虚假陈述的基准测试中的真实程度如何？
RQ2增加模型规模是否提升真实性，还是如观察到的那样出现反向缩放？
RQ3自动化指标（GPT-judge）能否准确近似人类对真实性的判断？
RQ4提示对模型输出的真实性和信息性影响有多大？

主要发现

最佳的零-shot模型（GPT-3-175B，带有有帮助的提示）在58%的问题上是真实的。
人类基线在94%的问题上是真实的；人类在真实性和信息性方面为87%。
在不同模型家族中，最大模型往往不如较小模型 truthful（存在反向缩放）。
尽管真实性下降，但更大模型的信息性更高；多项选择结果显示更大模型表现更差。
GPT-judge 能够以 90–96% 的验证准确率预测人类的真实性判断，并在各种体系结构中具有泛化能力。
自动化指标为人类评估提供了价格低廉的代理，与真实性判断有较强的相关性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。