QUICK REVIEW

[论文解读] Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly|arXiv (Cornell University)|Jul 11, 2022

Topic Modeling被引用 159

一句话总结

这篇论文表明，当以正确格式呈现时，大语言模型在多样的选择题和真伪题上具有良好的校准，并探索模型如何自我评估并预测它们是否知道答案（P(IK)），而不依赖于特定的提议答案。

ABSTRACT

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

研究动机与目标

评估在通过显式选项格式化的情况下，大语言模型在多样的MCQ、True/False和相关任务上的校准情况。
通过让模型生成并随后评估自身输出来研究自我评估。
训练模型以预测它们知道答案的概率（P(IK)），独立于提出的答案。
考察P(IK)在跨任务和有无来源材料或提示时的泛化。

提出的方法

在BIG Bench、MMLU、TruthfulQA、QuALITY、LogiQA上，对800M、3B、12B和52B模型在各种格式下进行评估。
以字母选项格式化MCQ，并通过Expected Calibration Error (ECE)等相关指标评估校准。
测试真/假 reformulations 以衡量P(True)的校准。
训练一个数值头以预测P(IK)，并与自然语言方法进行比较。
使用自生成样本（T=1）和自我评估提示来测量P(True)的准确性和Brier分数。

实验结果

研究问题

RQ1在问题以显式选项呈现时，大语言模型是否能够对其输出在多样任务中产生经过校准的概率？
RQ2模型是否能够有效自我评估自身样本的正确性（P(True)），并通过头脑风暴多样本来改进这种评估？
RQ3模型是否可以被训练来预测它们知道答案的概率（P(IK)），且与提出的答案无关，这在跨任务中的泛化程度如何？
RQ4来源材料或提示如何影响P(IK)预测和校准？
RQ5RLHF和提示格式对模型校准和诚实性的影响是什么？

主要发现

在可见选项且格式有利时，大模型在多项选择题上显示出强校准；随着模型规模和少量示例提示，校准有所提升。
用‘以上都不是’替换一个选项会降低性能和校准，表明模型在被迫 abstain 时对未定义的真实性存在困难。
真/假形式在各任务上产生良好校准的预测（P(True)），较大型模型的校准更为稳健。
RLHF 策略校准可以通过简单的温度调整来纠正，从而改善预测的一致性。
对模型生成样本的自我评估（P(True)）是可行的，并且在让模型在给出判断前看到大量样本（Brainstorming）时更为准确；校准随着模型规模增加而改进。
可以训练模型以带有值头来预测P(IK)，展示跨任务的泛化，尽管在分布内的校准优于分布外。
P(IK)随着提供来源材料和提示来解决问题而增加，表明对额外上下文的敏感性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。