QUICK REVIEW

[논문 리뷰] Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Shiyuan Huang, Siddarth Mamidanna|arXiv (Cornell University)|2023. 10. 17.

Topic Modeling인용 수 19

한 줄 요약

본 논문은 ChatGPT가 생성한 감성분석에 대한 자기 설명이 전통적인 특징 기여도 방법과 어떻게 비교되는지, 충실성 및 일치도 지표를 통해 조사하며, 유사한 충실성을 보이지만 큰 불일치와 프롬프트의 한계를 드러낸다.

ABSTRACT

Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

연구 동기 및 목표

감성 분석에서 LLM이 생성한 자기 설명이 모델 예측에 충실한지 평가한다

제안 방법

Explain-then-predict와 predict-then-explain의 두 가지 자기 설명 패러다임을 제시한다
전체 단어 수준 기여도 또는 top-k 단어를 설명으로 생성한다
전통적 설명 방식인 occlusion과 LIME과 비교한다
충실도 지표(포괄성, 충분성, DF MIT, DF Frac, Rank Del)와 불일치 지표를 사용하여 평가한다
상위-k 설명과 그 지표들을 확인한다
ChatGPT 설명과 saliency 값의 정성적 차이를 분석한다

Figure 1: An overview of our investigation. Current conversational LLMs can explain their answers (e.g., by highlighting important words in the input), often automatically or at least when asked to. How should we think of these self-explanations ? In this paper, we study them in relationship to trad

실험 결과

연구 질문

RQ1LLM이 생성한 자기 설명이 감성 분석에서 예측을 충실하게 지원하는가?
RQ2ChatGPT의 자기 설명이 충실도와 합의도 측면에서 전통적 속성 방법(occlusion, LIME)과 어떻게 비교되는가?
RQ3설명-먼저 예측-다음 설명 간의 차이가 정확도와 충실도에 어떤 영향을 미치는가?
RQ4LLM의 상위-k 설명이 전체 기여도에 비해 충실도와 비용 측면에서 경쟁력이 있는가?
RQ5LLM 시대의 해석 가능성 워크플로우에 어떤 실용적 시사점이 있는가?

주요 결과

포괄성↑	충족성↓	DF MIT↑	DF Frac↓	Rank Del↑
E-P (Accuracy: 85%)	0.15	0.26	0.18	-0.00
LIME	0.17	0.22	0.13	-0.02
SelfExp	0.19	0.25	0.16	-0.03
P-E (Accuracy: 88%)	0.20	0.23	0.14	-0.02
LIME	0.27	0.20	0.10	-0.02
SelfExp	0.27	0.22	0.07	-0.01

자기 설명은 충실도 지표에서 전통적 방법과 대등한 성능을 보이나 방법 간에 큰 합의 차이를 보인다
Explain-then-predict는 예측 정확도를 약간 저해하는 반면(85%) predict-then-explain은 88%로 약간 더 높고 두 경우 모두 비설명 모델(92%)보다 낮다
LIME과 occlusion은 특정 지표에서 비용이 크거나 LLM의 자기 설명과 정렬되지 않는 경향이 있다
상위-k 설명은 전체 기여도에서 항상 우수하지 않으며 작업 및 프롬프트에 따라 효과가 달라질 수 있다
설명 방법 간 높은 합의 차이가 충실도가 유사하더라도 지속되며, 현재 평가 파이프라인의 한계를 시사한다
ChatGPT 설명은 세밀한 점수보다는 몇 가지 잘 다듬어진 수준(예: 0.5, 0.75)에서 saliency 값을 생성하는 경향이 있다

Figure 2: Visualization of one explanation each for E-P and P-E model. The top- $k$ explanations

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.