QUICK REVIEW

[논문 리뷰] The Internal State of an LLM Knows When It's Lying

Amos Azaria, Tom M. Mitchell|arXiv (Cornell University)|2023. 04. 26.

Topic Modeling인용 수 15

한 줄 요약

본 논문은 SAPLMA를 제시한다. SAPLMA는 LLM의 숨겨진 계층 활성화를 이용해 진술의 진실 여부를 예측하는 경량 분류기로, 여러 주제와 모델에서 프롬팅 baselines를 능가한다.

ABSTRACT

While Large Language Models (LLMs) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the LLM as it reads or generates the statement. Experiments demonstrate that given a set of test sentences, of which half are true and half false, our trained classifier achieves an average of 71\% to 83\% accuracy labeling which sentences are true versus false, depending on the LLM base model. Furthermore, we explore the relationship between our classifier's performance and approaches based on the probability assigned to the sentence by the LLM. We show that while LLM-assigned sentence probability is related to sentence truthfulness, this probability is also dependent on sentence length and the frequencies of words in the sentence, resulting in our trained classifier providing a more reliable approach to detecting truthfulness, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.

연구 동기 및 목표

허위 정보의 위험을 LLM의 확신 있는 발화로 인해 측정하고 정량화한다.
미세조정 없이 LLM의 내부 상태에서 진실성 신호를 추출하는 방법(SAPLMA)을 제안한다.
일련의 주제와 아키텍처에서 SAPLMA를 평가해 일반화 가능성과 견고성을 평가한다.
실제 활용 가능성을 보여주기 위해 진실-거짓 데이터셋을 공개하고 LLM 시스템과의 통합 가능성을 시연한다.

제안 방법

LLMs의 숨겨진 계층 활성화에서 간단한 3층 피드포워드 분류기를 학습시킨다.
마지막 계층, 28번째, 24번째, 20번째, 중간 계층 등 여러 후보 계층을 입력으로 평가한다.
6개 주제에 걸친 진실/거짓 진술 데이터셋을 사용해 테스트를 제외한 모든 주제에 대해 학습한다.
BERT 임베딩 및 few-shot 프롬 prompts를 포함한 baselines과 SAPLMA를 비교한다.
LLM 자체가 생성한 진술에 대해 내부 진실 신호를 평가한다.

실험 결과

연구 질문

RQ1LLM의 숨겨진 계층 활성화가 진술의 진실 여부를 밝힐 수 있는가?
RQ2주제 및 모델 계열 전반에 걸친 진실 탐지에서 SAPLMA의 성능은 프롬 prompting baselines에 비해 어떠한가?
RQ3다양한 LLM에 대해 어떤 숨겨진 계층 표현이 진실성 신호를 가장 잘 인코딩하는가?
RQ4훈련 중 보지 않은 주제로 일반화할 때 SAPLMA의 성능은 유지되는가?
RQ5LLM 자체가 생성한 진술에 대해 SAPLMA의 성능은 외부 출처의 진실/거짓 데이터와 비교해 어떤가?

주요 결과

SAPLMA는 OPT-6.7b에서 held-out 주제에 대해 60%–80%의 정확도, LLAMA2-7b에서 70%–90%의 정확도를 달성한다.
SAPLMA는 여섯 가지 주제 전부에 걸쳐 BERT 임베딩 및 few-shot 프롬프트 baselines를 일관되게 능가한다.
OPT-6.7b에서 20번째 계층이 최적 결과를 낼 때가 많고, LLAMA2-7b 모델은 주제 및 설정에 따라 중앙 또는 높은 계층을 선호한다.
OPT-6.7b에서 20번째 계층을 사용한 SAPLMA 학습의 평균 정확도는 86.4%로, LLM 내부의 진실 표현을 감지할 수 있음을 시사한다.
LLM이 전체 문장에 대해 생성한 확률은 구문 및 길이에 크게 영향을 받는 반면, SAPLMA의 시그모이드 출력은 진실 여부와 더 잘 정렬된다(예: 14개의 held-out 진술 세트에서).
LLM 자체가 생성한 문장에 SAPLMA를 적용하면 여전히 baselines를 능가하지만, 외부 소스의 진실/거짓 데이터에 비해 절대 정확도는 낮아지는 경우가 있다(일부 설정에서 70% 범위).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.