QUICK REVIEW

[論文レビュー] Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly|arXiv (Cornell University)|Jul 11, 2022

Topic Modeling被引用数 159

ひとこと要約

論文は、大規模言語モデルが適切にフォーマットされた多様な選択肢問題と真偽問題で良く校正されていることを示し、モデルが自己評価し答えを知っているかどうかを予測する（P(IK)）方法を、提案された特定の答えに依存せずに探究する。

ABSTRACT

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

研究の動機と目的

大規模言語モデルが、明示的な選択肢を付した多肢選択問題（MCQ）、True/False、および関連タスクにおいて、校正されているかを評価する。
モデルに出力を生成させ、それを自ら評価させることによる自己評価の検討。
提案された答えに依存せず、答えを知っている確率（P(IK)）を予測するようにモデルを訓練する。
ソース資料やヒントの有無がある場合のタスク間での P(IK) の一般化を調べる。

提案手法

さまざまなフォーマットで、800M、3B、12B、および 52B のモデルを BIG Bench、MMLU、TruthfulQA、QuALITY、および LogiQA で評価する。
MCQs をアルファベット付きの選択肢でフォーマットし、Expected Calibration Error (ECE) および関連指標で校正を評価する。
True/False の言い換えをテストし、P(True) の校正を測定する。
P(IK) を予測する value-head を訓練し、自然言語アプローチと比較する。
自己生成サンプル (T=1) および自己評価プロンプトを用いて、P(True) の正確さと Brier スコアを測定する。

実験結果

リサーチクエスチョン

RQ1質問が明示的な選択肢として提示された場合、さまざまなタスクにおいて大規模言語モデルは出力の校正された確率を生み出せるか。
RQ2モデルは自分自身のサンプルの正確性（P(True)）を効果的に自己評価し、複数のサンプルをブレインストーミングすることでこの評価を改善できるか。
RQ3提案された答えに依存せず、答えを知っている確率（P(IK)）を予測するようにモデルを訓練できるか、そしてこれがタスク間でどれだけ一般化するか。
RQ4ソース資料やヒントが P(IK) の予測と校正にどのように影響するか。
RQ5RLHF および prompting のフォーマットがモデルの校正と正直さに与える影響は何か。

主な発見

オプションが見える状態でフォーマットが適している場合、Large モデルは多肢選択タスクで強い校正を示す。校正はモデルサイズと few-shot prompting の増加とともに改善する。
オプションを 'none of the above' に置換すると性能と校正が低下し、棄却を強いられた場合に未定義の真理に苦しむことを示唆する。
True/False の形式は、タスク全体で well-calibrated な予測（P(True)）を生み出し、より大きなモデルで校正が堅牢になる。
RLHF ポリシーの校正は、単純な温度調整で是正され、予測の整合性を改善できる。
モデル生成サンプルの自己評価（P(True)）は実現可能で、判断前に多くのサンプルを見せる（ブレインストーミング）ほどより正確になる。校正はモデルサイズの増大とともに改善する。
value head を用いて P(IK) を予測でき、タスク間の一般化を示すが、校正は分布内の方が分布外より良い。
問題解決のソース材料やヒントの利用可能性が高いほど P(IK) が高くなり、追加の文脈に敏感であることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。