QUICK REVIEW

[논문 리뷰] Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix J Binder, James Chua|arXiv (Cornell University)|2024. 10. 17.

Natural Language Processing Techniques인용 수 5

한 줄 요약

이 논문은 특정 LLM들이 자신의 미래 행동을 데이터로 학습한 모델보다 더 잘 예측함으로써 자기 성찰을 할 수 있음을 보여주며, 이는 학습 데이터로 유도할 수 없는 자기 지식에 대한 특권적 접근을 시사합니다. 또한 복잡한 작업에서의 한계와 행동 변화에 대한 강건성을 식별합니다.

ABSTRACT

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

연구 동기 및 목표

LLMs에서 자기에 대한 학습 데이터로부터 유도될 수 없는 사실에 대한 접근으로서의 introspection 정의.
자기 성찰을 측정하기 위한 데이터셋, 미세조정 방법, 평가 방법 개발.
전선(L frontier) LLM이 특정 조건에서 introspective 능력을 보임하는 증거 제공.
자기 성찰 예측의 보정 및 강건성 평가와 한계 식별.
재현과 확장을 위한 코드와 데이터셋 공개를 통해 공개 가능성 확보.

제안 방법

자기 가설 행위(self-prediction)를 예측하도록 M1 미세조정.
M1의 행동을 예측하는 별도 모델 M2를 학습(교차 예측).
보지 않은 작업에서 M1의 자기 예측을 M2의 예측과 비교하여 introspection 테스트.
실제 행동에 대한 분포의 보정( MAD ) 평가.
M1의 실제 행동을 바꾸고 M1이 introspective predictions를 업데이트하는지 테스트(행동 변화).
비 introspective 설명 통제 및 데이터 스케일링 분석 수행으로 기억화나 데이터 편향 배제.

실험 결과

연구 질문

RQ1LLM이 학습 데이터에 포함되지 않은 자신의 행동에 관한 사실을 보고할 수 있는가?
RQ2자기 학습된 모델이 보지 않은 작업에서 자신의 행동을 예측하는 데 교차 학습된 모델보다 우수한가?
RQ3 introspective 예측이 잘 보정되어 있으며 실제 진실 행동의 변화에 견고한가?
RQ4장문의 출력이나 out-of-distribution 일반화에 대한 introspection의 한계는 무엇인가?
RQ5자기 시뮬레이션을 넘어선 introspection을 설명하는 메커니즘은 무엇인가?

주요 결과

자기 예측 학습 모델이 보지 않은 작업에서 대상 모델의 행동을 예측하는 데 교차 예측 모델보다 우수하다.
타깃 모델의 진실된 행동을 의도적으로 변경한 후에도 자기 예측 우위가 유지된다.
자기 예측 학습 모델이 교차 예측 또는 비훈련 모델보다 보정력이 더 좋다.
introspection 효과는 더 단순한 작업에서 더 강하고, 복잡한 장문 출력 또는 out-of-distribution 일반화에서는 약하다.
모델은 자기 진실 행동의 변화에 맞춰 introspective 예측을 조정할 수 있어 introspection의 간접적 증거를 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.