[论文解读] Looking Inward: Language Models Can Learn About Themselves by Introspection
这篇论文表明,某些大型语言模型可以通过预测自身未来行为来进行自省,其效果优于基于其数据训练的模型,指示存在无法从训练数据推导出的自知之见的特权访问。它还在复杂任务和对行为变化的鲁棒性方面识别出局限性。
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.
研究动机与目标
- 将 LLM 的自省定义为对自身事实的访问,这些事实不能从训练数据推导出来。
- 开发用于衡量自省的数据集、微调方法和评估。
- 提供证据表明在某些条件下,前沿 LLMs 具备自省能力。
- 评估自省预测的校准性和鲁棒性并识别局限性。
- 公开代码和数据集以便复现和扩展。
提出的方法
- 微调 M1 以预测其自身的假设行为(自预测)。
- 训练一个独立的模型 M2 以预测 M1 的行为(跨预测)。
- 将 M1 的自预测与 M2 在未见任务上的预测进行比较以测试自省。
- 评估预测分布相对于实际行为的校准性(MAD)。
- 操纵 M1 的真实行为并测试 M1 是否会更新其自省预测(行为改变)。
- 控制非自省解释并进行数据规模分析以排除记忆化或数据偏差。
实验结果
研究问题
- RQ1LLM 是否能报告关于其自身行为的事实,而这些事实并不包含在其训练数据中?
- RQ2在未见任务上,自训练的模型是否在预测自身行为方面优于跨训练的模型?
- RQ3自省预测是否具有良好的校准性,并且对真实行为的改变具有鲁棒性?
- RQ4自省的局限性有哪些,特别是对较长输出或分布外泛化?
- RQ5有哪些机制或解释可以解释超越自我模拟的自省?
主要发现
- 自预测训练的模型在预测目标模型在未见任务上的行为方面优于跨预测模型。
- 即使故意修改目标模型的真实行为,自预测优势仍然存在。
- 自预测训练的模型相比跨预测或未训练模型具有更好的校准性。
- 自省效应在简单任务上更强,在复杂、长输出任务或分布外泛化上较弱。
- 模型能够适应其自省预测以反映自身真实行为的变化,为自省提供间接证据。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。