QUICK REVIEW

[논문 리뷰] Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye|arXiv (Cornell University)|2022. 12. 07.

Topic Modeling인용 수 45

한 줄 요약

논문은 Contrst-Consistent Search (CCS)를 제시합니다. 이는 언어 모델의 활성화에서 잠재적 진리 표현을 추출하여 예/아니오 질문에 대답하는 비지도 방법이며, 평균적으로 제로샷-baseline보다 성능이 우수하고 프롬프트 민감성을 줄입니다.

ABSTRACT

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

연구 동기 및 목표

언 supervision 없이 언어 모델에서 잠재적 진리를 추출하는 문제를 동기 부여하고 형식화한다.
활성화 공간에서 진실과 관련된 방향을 식별하는 경량 프로브를 개발한다.
이러한 잠재 지식이 과제 간 전이되고 오도하는 프롬프트에 강건한지 입증한다.
학습된 표현의 특성과 데이터/샘플 효율성을 분석한다.

제안 방법

각 예-아니오 질문을 양성 진술과 음성 진술의 두 가지 형식으로 포맷하여 대조 쌍을 구성한다.
대조 쌍마다 모델 활성화를 추출하고 이를 정규화한다.
정규화된 활성화를 시그모이드 활성화를 가진 선형 프로브를 통해 확률로 매핑한다.
일관성 항(p(x+)=1-p(x−))과 확정성 항을 결합한 비지도 손실을 학습하여 열화된 해를 피한다.
답을 p(x+)의 평균과 1−p(x−)의 평균으로 추론하고 >0.5 결정 경계로 선택한다.

실험 결과

연구 질문

RQ1언어 모델의 잠재 진리 표현을 활성화만으로 비지도 학습 없이 발견할 수 있는가?
RQ2이러한 표현이 학습 데이터 외의 데이터셋과 과제에서 일반화되는가?
RQ3이 표현들이 모델 출력의 조작이나 오도 프롬프트에 견고한가?
RQ4이 진리 표현은 모델의 어느 계층에 존재하며 데이터 효율성은 어느 정도인가?
RQ5발견된 표현이 모델의 자체 출력 및 실제 정답 라벨과 어떤 관계가 있는가?

주요 결과

방법	RoBERTa	DeBERTa	GPT-J	T5	UQA	T0	평균
0-shot	60.1(5.7)	68.6(8.2)	53.2(5.2)	55.4(5.7)	76.8(9.6)	87.9(4.8)	62.8(6.9)
Calibrated 0-shot	64.3(6.2)	76.3(6.0)	56.0(5.2)	58.8(6.1)	80.4(7.1)	90.5(2.7)	67.2(6.1)
CCS	62.1(4.1)	78.5(3.8)	61.7(2.5)	71.5(3.0)	82.1(2.7)	77.6(3.3)	71.2(3.2)
CCS (All Data)	60.1(3.7)	77.1(4.1)	62.1(2.3)	72.7(6.0)	84.8(2.6)	84.8(3.7)	71.5(3.7)
LR (Ceiling)	79.8(2.5)	86.1(2.2)	78.0(2.3)	84.6(3.1)	89.8(1.9)	90.7(2.1)	83.7(2.4)

CCS는 6개 모델과 10개 데이터셋에서 강한 제로샷 베이스라인보다 평균 4 포인트의 정확도 향상을 보인다.
CCS는 프롬프트 민감성을 줄이고 다양한 프롬프트에서 평균 정확도가 더 견고하다.
제로샷 성능을 저하시킬 수 있는 오도 프롬프트가 CCS 정확도에 큰 영향을 주지 않는다.
잠재 진리 표현은 데이터셋과 과제 간 전이되며 과제에 특화되지 않는 진리 방향임을 시사한다.
중간 계층이 종종 최종 출력보다 CCS 성능이 좋으며, 출력에 있는 지식 너머의 잠재 지식이 있음을 시사한다.
진리 표현은 데이터 효율적일 수 있으며, 때로는 아주 적은 대조 쌍으로도 작동한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.