QUICK REVIEW

[論文レビュー] Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye|arXiv (Cornell University)|Dec 7, 2022

Topic Modeling被引用数 45

ひとこと要約

本論文は Contrast-Consistent Search (CCS) を提案し、言語モデルの活性化から latent truth representations を抽出する教師なしの方法で、はい/いいえの質問に答える。平均でゼロショットのベースラインを上回り、プロンプト感度を低減します。

ABSTRACT

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

研究の動機と目的

監督なしで言語モデルから latent truth を抽出する問題を動機づけ formalize する。
activation Space で真実に関連する方向を識別する軽量な probe を開発する。
この latent knowledge がタスク間で転移し、 misleading prompts に対しても頑健であることを示す。
学習表現の特性とデータ／サンプル効率を分析する。

提案手法

各はい/いいえ質問を正と負の文の両方としてフォーマットして contrastive pair を構築する。
各 contrastive pair のモデル activations を抽出し正規化する。
正規化された活性化を線形プローブとシグモイド活性化を用いて確率へマッピングする。
収束性のある解を避けるための自 supervising 搭載の損失として、一貫性項 (p(x+)=1-p(x−)) と信頼度項を組み合わせる。
答えは p(x+) と 1−p(x−) を平均して>0.5 の決定境界を選ぶことで推定する。

実験結果

リサーチクエスチョン

RQ1言語モデルには activations のみから supervisor なしで latent truth representations を発見できるか。
RQ2このような表現はトレーニングデータを超えたデータセットやタスクに一般化するか。
RQ3 misleading prompts やモデル出力の操作に対して頑健か。
RQ4これらの truth representations はモデルのどの層に存在し、データ効率はどうか。
RQ5発見された表現はモデル自身の出力や ground-truth ラベルとどのように関連するか。

主な発見

方法	RoBERTa	DeBERTa	GPT-J	T5	UQA	T0	Mean
0-shot	60.1(5.7)	68.6(8.2)	53.2(5.2)	55.4(5.7)	76.8(9.6)	87.9(4.8)	62.8(6.9)
Calibrated 0-shot	64.3(6.2)	76.3(6.0)	56.0(5.2)	58.8(6.1)	80.4(7.1)	90.5(2.7)	67.2(6.1)
CCS	62.1(4.1)	78.5(3.8)	61.7(2.5)	71.5(3.0)	82.1(2.7)	77.6(3.3)	71.2(3.2)
CCS (All Data)	60.1(3.7)	77.1(4.1)	62.1(2.3)	72.7(6.0)	84.8(2.6)	84.8(3.7)	71.5(3.7)
LR (Ceiling)	79.8(2.5)	86.1(2.2)	78.0(2.3)	84.6(3.1)	89.8(1.9)	90.7(2.1)	83.7(2.4)

CCS は 6 モデルと 10 データセット全体で強力なゼロショットのベースラインを平均で 4 ポイント上回る。
CCS はプロンプト感度を低減し、異なるプロンプト間での平均精度がより頑健になる。
ゼロショットの性能を低下させる誤誘導的なプロンプトは CCS の精度には有意な影響を与えない。
latent truth 表現はデータセット間・タスク間で転移し、タスク非依存の真実方向を示す。
中間層は最終出力よりも CCS のパフォーマンスが良いことが多く、出力内の知識を超えた潜在的知識を示唆する。
真実表現の発見はデータ効率がよく、非常に少ない contrast pair でも機能することがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。