QUICK REVIEW

[论文解读] From Understanding to Utilization: A Survey on Explainability for Large Language Models

Haoyan Luo, Lucia Specia|arXiv (Cornell University)|Jan 23, 2024

Topic Modeling被引用 14

一句话总结

本综述对基于预训练Transformer的大型语言模型的可解释性方法进行分类，区分局部/全局分析，并概述解释如何提升可靠性、编辑与对齐。它还讨论评估方法与未来方向。

ABSTRACT

Explainability for Large Language Models (LLMs) is a critical yet challenging aspect of natural language processing. As LLMs are increasingly integral to diverse applications, their "black-box" nature sparks significant concerns regarding transparency and ethical use. This survey underscores the imperative for increased explainability in LLMs, delving into both the research on explainability and the various methodologies and tasks that utilize an understanding of these models. Our focus is primarily on pre-trained Transformer-based LLMs, such as LLaMA family, which pose distinctive interpretability challenges due to their scale and complexity. In terms of existing methods, we classify them into local and global analyses, based on their explanatory objectives. When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement. Additionally, we examine representative evaluation metrics and datasets, elucidating their advantages and limitations. Our goal is to reconcile theoretical and empirical understanding with practical implementation, proposing exciting avenues for explanatory techniques and their applications in the LLMs era.

研究动机与目标

出于透明度、信任与伦理关注，激发对大型语言模型可解释性的需求。
将现有的可解释性方法分为面向LLM的局部分析与全局分析。
讨论可解释性在模型编辑、能力提升与受控生成中的应用。
强调用于评估解释质量与有用性的评估指标与数据集。
指出在LLM可解释性中连接理论与实践的开放问题与未来方向。

提出的方法

将可解释性方法分为局部分析（特征属性、变换器组件分析）和全局分析（探针、机械可解释性）。
描述局部方法：扰动/梯度/向量基的归因、积分梯度、基于注意力的分析，以及 FFN/分解技术。
描述全局方法：对知识/表征的探测，以及包括电路发现、因果追踪在内的机械可解释性。
综述如何利用解释用于模型编辑、长文本利用以及改进上下文学习（ICL）。
概述解释的可信度和模型输出的真实性的评估策略，包括像 ZsRE 与 CounterFact 以及 TruthfulQA 指标等数据集。

实验结果

研究问题

RQ1哪些可解释性方法适用于基于预训练 Transformer 的大语言模型，它们在范围和粒度上有何差异？
RQ2如何利用局部和全局解释来提升模型透明度、可靠性及下游任务性能？
RQ3哪些评估策略和数据集能有效评估大语言模型解释的质量与有用性？
RQ4可解释性如何引导大语言模型的模型编辑、长文本利用与可控生成？
RQ5在LLM可解释性领域存在哪些开放挑战和未来研究方向？

主要发现

局部分析方法包括用于解释令牌级预测的特征归因、基于梯度的，以及基于向量的方法。
全局分析包括基于探针的技术和如电路发现、因果追踪等机械可解释性方法。
可解释性可以用于指导模型编辑（定位后编辑）并改善如长文本利用和上下文学习等任务。
对解释的评估依赖于可信度、真实性和有用性，数据集包括 ZsRE、CounterFact 和 TruthfulQA。
本综述指出当前方法的局限性，并概述面向可信和对齐的LLM的未来研究方向。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。