QUICK REVIEW

[논문 리뷰] The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks, Max Tegmark|arXiv (Cornell University)|2023. 10. 10.

Topic Modeling인용 수 16

한 줄 요약

논문은 LLM 표현이 진실을 선형 방향으로 인코딩함을 보여주고, 데이터셋 간 선형 진실 프로브의 전이 가능성을 시연하며, 표적 개입을 통한 인과적 증거를 제공한다; 또한 mass-mean probing을 강건한 탐색 방법으로 도입한다.

ABSTRACT

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

연구 동기 및 목표

사실 진술에 대한 고품질의 진실/거짓 데이터셋을 큐레이션하여 LLM의 진실 표현을 연구한다.
LLM 표현에서 진실이 선형 방향으로 인코딩되어 있는지 조사한다.
다른 데이터셋 및 진술 유형 간의 진실 프로브의 일반화를 평가한다.
식별된 진실 방향이 모델 출력에 영향을 미친다는 인과적 증거를 제공한다.
강건하고 인과적으로 함축된 probing 방법으로 mass-mean probing을 도입한다.

제안 방법

LLaMA-13B 및 LLaMA-2-13B의 최종 토큰 표현에 대한 13층 잔류 스트림 활성화를 추출한다.
PCA를 사용하여 truth 구분을 시각화하고 진실/거짓 진술 간의 선형 구조를 식별한다.
데이터셋 간 진실을 분류하기 위해 선형 프로브(로지스틱 회귀, mass-mean probing, CCS)를 학습하고 다른 데이터셋으로의 전이를 테스트한다.
모호한 상태를 교환하거나 진실 방향 벡터를 추가하여 모델 출력에 영향을 주는 인과 패치를 수행한다.
진실/거짓 데이터셋에 대한 프로브 성능과 가능성이 높은 텍스트에 대한 비교를 통해 진실 특이적 인코딩을 평가한다.

Figure 1: Projections of residual stream representations of our datasets onto their top two PCs.

실험 결과

연구 질문

RQ1LLMs는 사실 진술의 진실 값을 표현하는 데 선형 구조를 보이는가?
RQ2,

주요 결과

PCA 시각화는 상위 주주성분에서 진실 대 거짓 진술의 명확한 선형 분리를 보여준다.
한 데이터셋에서 학습된 프로브가 다른 데이터셋으로 일반화되어 진실 방향이 전이 가능함을 시사한다.
진실 방향을 따라가며의 인과적 개입은 모델의 진실 대 거짓 진술 처리에 상당한 영향을 미칠 수 있다.
Mass-mean probing은 일반화가 더 잘 되고 출력에 대한 인과적 관련성도 더 크며, 로지스틱 회귀나 CCS보다 우수하다.
진실/거짓 데이터셋에서 학습된 프로브가 가능성이 높은 텍스트에서 학습된 프로브보다 모델 예측 매개에 더 잘 관여한다.

Figure 2: Projections of residual stream representations of datasets onto the top 2 PCs of cities .

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.