Skip to main content
QUICK REVIEW

[论文解读] Explainability for Large Language Models: A Survey

Haiyan Zhao, Hanjie Chen|arXiv (Cornell University)|Sep 2, 2023
Topic Modeling被引用 18
一句话总结

本文综述基于 Transformer 的 LLM 的可解释性技术,将方法按传统微调和提示范式进行组织,并讨论评估、调试及未来挑战。

ABSTRACT

Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models.

研究动机与目标

  • 提供 LLM 可解释性技术的分类体系。
  • 按训练范式(微调与提示)组织解释。
  • 总结每种范式的局部与全局解释方法。
  • 讨论解释的评估指标及其在调试和改进模型中的应用。
  • 突出 LLM 可解释性中的挑战与未来方向。

提出的方法

  • 提出基于 Transformer 的 LLM 的可解释性技术分类体系。
  • 基于传统微调与提示范式进行方法分类。
  • 总结每种范式的局部解释(特征归因、注意力、基于示例、自然语言)和全局解释(探测、神经元激活、概念基础).
  • 回顾解释的评估指标及其适用性。
  • 讨论可解释性中的调试、性能提升与未来研究方向。
Figure 1: We categorize LLM explainability into two major paradigms. Based on this categorization, we summarize different kinds of explainability techniques associated with LLMs belonging to these two paradigms. We also discuss evaluations for the generated explanations under the two paradigms.
Figure 1: We categorize LLM explainability into two major paradigms. Based on this categorization, we summarize different kinds of explainability techniques associated with LLMs belonging to these two paradigms. We also discuss evaluations for the generated explanations under the two paradigms.

实验结果

研究问题

  • RQ1我们如何系统地对 LLM 的可解释性技术进行分类?
  • RQ2针对微调和提示的 LLM,存在哪些局部与全局解释方法?
  • RQ3哪些指标评估解释的质量与有用性?
  • RQ4如何利用解释来调试模型并提升性能?
  • RQ5与传统 DL 模型相比,LLM 可解释性面临哪些关键挑战与机会?

主要发现

  • 提供一个 LLM 可解释性技术的分类体系,分为传统微调和提示范式。
  • 在每种范式下,综述局部解释(特征归因、基于注意力、基于示例、自然语言)和全局解释(探测、神经元激活、基于概念)。
  • 综述讨论生成解释的评估指标及其在各范式中的适用性。
  • 研究涵盖如何利用解释来调试模型并提升性能。
  • 同时指出在 LLM 可解释性相对于传统 DL 模型的挑战与新兴机遇。
Figure 2: LLMs undergo unsupervised pre-training with random initialization to create a base model. The base model can then be fine-tuned through instruction tuning and RLHF to produce the assistant model.
Figure 2: LLMs undergo unsupervised pre-training with random initialization to create a base model. The base model can then be fine-tuned through instruction tuning and RLHF to produce the assistant model.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。