QUICK REVIEW

[論文レビュー] Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix J Binder, James Chua|arXiv (Cornell University)|Oct 17, 2024

Natural Language Processing Techniques被引用数 5

ひとこと要約

この論文は、特定のLLMが自分の将来の挙動を自分のデータで訓練されたモデルよりも予測することによって内省できることを示し、訓練データから導出できない自己知識への特権的アクセスを示唆しています。また、複雑なタスクでの制約と行動変化への頑健性も特定しています。

ABSTRACT

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

研究の動機と目的

訓練データから導出できない自己に関する事実へのアクセスとしてのLLMにおける内省を定義する。
内省を測定するデータセット、ファインチューニング手法、評価方法を開発する。
フロントier LLMが特定の条件下で内省能力を示すという証拠を提供する。
内省的予測の校正と頑健性を評価し、限界を特定する。
再現と拡張のためにコードとデータセットを公開する。

提案手法

M1を自己の仮説的な挙動を予測するようファインチューニングする（自己予測）。
別のモデルM2を訓練し、M1の挙動を予測させる（クロス予測）。
未知のタスクでM1の自己予測とM2の予測を比較して内省をテストする。
予測分布の実際の挙動に対する校正を評価する（MAD）。
M1の真の挙動を操作し、M1が内省的予測を更新するかを検証する（行動変化）。
非内省的な説明をコントロールし、データスケーリング分析を実施して記憶化やデータバイアスを排除する。

実験結果

リサーチクエスチョン

RQ1LLMは訓練データに含まれていない自己の挙動に関する事実を報告できるか？
RQ2自己訓練モデルは未知のタスクで自分の挙動を予測する際にクロス訓練モデルより優れているか？
RQ3内省的予測は適切に校正され、真の挙動の変化に対して頑健か？
RQ4特に長い出力や異分布一般化に対する限界は何か？
RQ5自己シミュレーションを超える内省を説明するメカニズムや説明は何か？

主な発見

自己予測訓練を受けたモデルは、未知のタスクでターゲットモデルの挙動を予測する際にクロス予測モデルより優れている。
自己予測のアドバンテージは、ターゲットモデルの真の挙動を意図的に変更した後でも持続する。
自己予測訓練を受けたモデルは、クロス予測や未訓練モデルより校正が良い。
内省効果は単純なタスクでより強く、複雑な長出力タスクや異分布一般化には弱い。
モデルは自分の真の挙動の変化を反映するように内省予測を適応でき、内省の間接的証拠を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。