QUICK REVIEW

[论文解读] Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models

Chirag Agarwal, Sree Harsha Tanneru|arXiv (Cornell University)|Feb 7, 2024

Topic Modeling被引用 11

一句话总结

本文讨论大型语言模型产生的自解释中的忠实性和可信性之间的两极分化，并主张评估和提升忠实性，尤其在高风险应用中，同时在适当情况下不牺牲可置信性。

ABSTRACT

Large Language Models (LLMs) are deployed as powerful tools for several natural language processing (NLP) applications. Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior. Self-explanations have seen widespread adoption owing to their conversational and plausible nature. However, there is little to no understanding of their faithfulness. In this work, we discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs. We argue that while LLMs are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the LLMs, raising concerns about their faithfulness. We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. We assert that the faithfulness of explanations is critical in LLMs employed for high-stakes decision-making. Moreover, we emphasize the need for a systematic characterization of faithfulness-plausibility requirements of different real-world applications and ensure explanations meet those needs. While there are several approaches to improving plausibility, improving faithfulness is an open challenge. We call upon the community to develop novel methods to enhance the faithfulness of self explanations thereby enabling transparent deployment of LLMs in diverse high-stakes settings.

研究动机与目标

理解自解释如何反映模型的真实推理与看起来像人类的推理之间的关系的动机
在大语言模型的解释中定义并区分可置信性（plausibility）和忠实性（faithfulness）
综述现有在LLMs中生成与评估自解释的方法
强调高风险领域的影响及对忠实性的需求
提出改进忠实性而不牺牲可用性的方向与开放挑战

提出的方法

对自解释技术的综述，如思路链（chain-of-thought）、令牌重要性、对照解释等
对可置信性与忠实性给出形式定义
讨论使用对照输入与事后干预来评估忠实性的方法
分析由于训练目标和RLHF导致对可置信性的过度强调
提出的研究方向：微调、上下文学习和机械性可解释性

实验结果

研究问题

RQ1在LLM自解释中，可置信性与忠实性有何区别，它们如何影响信任与可靠性？
RQ2当前方法如何衡量忠实性，在黑箱LLMs中的局限性是什么？
RQ3在哪些应用中可置信性或忠实性更为重要，应如何针对用例定制解释？
RQ4哪些策略可以在不过度牺牲可置信性的前提下提升忠实性？
RQ5需要哪些未来的基准和方法来评估和提升LLM解释的忠实性？

主要发现

LLMs 可以生成与人类推理一致的可置信解释，但可能并不反映模型的实际推理过程
当前用于衡量忠实性的评估指标有限，尚无统一的自解释忠实性衡量标准
由于训练目标如RLHF，对可置信性的过度强调可能在高风险场景中损害忠实性
通过模拟反事实输入并对解释进行干预来评估忠实性，但结果显示在识别真实推理方面存在局限
在高风险领域，忠实性至关重要；而可置信性在教育或互动情境中可能更受欢迎；解释应根据应用需求定制
本文呼吁开发可靠的度量标准、基准和生成更忠实自解释的方法

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。