QUICK REVIEW

[论文解读] Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Zhengxuan Wu, Atticus Geiger|arXiv (Cornell University)|May 15, 2023

Explainable Artificial Intelligence (XAI)被引用 8

一句话总结

论文将 Boundless Distributed Alignment Search (Boundless DAS) 扩展到大规模的 LLM，并显示 Alpaca (7B) 实现了一个用于数值推理任务的简单两布尔变量因果模型，且对输入和指令的对齐具有鲁棒性。

ABSTRACT

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that has uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call Boundless DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models. Our tool is extensible to larger LLMs and is released publicly at `https://github.com/stanfordnlp/pyvene`.

研究动机与目标

需要可解释且因果上忠实的对大型语言模型解释，以提升安全性与可信度。
提出 Boundless DAS，将基于因果抽象的可解释性扩展到 LLMs。
证明 Alpaca 能通过一个简单、可解释的因果模型解决数值推理任务。
评估发现的对齐在指令、输入、上下文和输出格式上的鲁棒性与可迁移性。

提出的方法

通过学习边界掩码和旋转，扩展 Distributed Alignment Search (DAS)，实现可扩展对齐（Boundless DAS）。
引入可学习的边界索引 b_j，确定每个高阶变量 Z_j 对应的维度子空间 Y_j，从而自动确定表示的维度。
使用 SoftDII 和加权干预来近似并优化置换干预准确性（IIA）相对于真实因果模型。
优化跨熵目标（式（Eq. 5）），将神经表示与因果变量对齐，并逐步退火边界掩码（β）。
用干预置换准确性（IIA）评估对齐质量，并与随机基线和任务性能进行比较。
将 Boundless DAS 应用于 Alpaca-7B 的价格标注/基于 NLP 的数值推理任务，并测试跨指令、上下文和输出格式的泛化能力。

实验结果

研究问题

RQ1Boundless DAS 是否能够将因果可解释性扩展到像 Alpaca (7B) 这样的大型 LLM？
RQ2内部表示是否以可解释的因果变量在不同上下文与指令下实现鲁棒、可泛化的对齐？
RQ3 Alpaca 用以解决数值推理任务的因果机制本质是什么？
RQ4学习到的对齐是否能够迁移到新的括号、插入的上下文和修改的输出？

主要发现

Experiment	Task Acc.	IIA max	Correlation
Left Boundary (♣)	0.85	0.90	1.00
Left and Right Boundary (♥)	0.85	0.86	1.00
Mid-point Distance	0.85	0.70	1.00
Bracket Identity	0.85	0.72	1.00
Correct Only	1.00 †	0.88	0.99 ( ♥ )
Incorrect Only	0.00 †	0.71	0.84 ( ♥ )
New Bracket (Seen)	0.94	0.94	0.97 ( ♣ )
New Bracket (Unseen)	0.95	0.95	0.94 ( ♣ )
Irrelevant Contexts	0.84	0.83	0.99 ( ♥ )
Sibling Instructions	0.84	0.83	0.87 ( ♥ )
+ exclude top right	0.84	0.83	0.92 ( ♥ )

Boundless DAS 揭示 Alpaca 使用一个简单的两布尔变量因果模型（左边界与右边界检查）来解决任务。
在前两个假设模型上的 IIA 达到或超过任务性能（0.85+），表明内部因果结构具有可信性。
对齐在未见的括号、插入前缀和不同输出格式下具有良好泛化，IIA 的下降很小，体现鲁棒性。
边界学习显示在成功运行中，每个对齐变量仅使用表示空间约 5–10% 的容量。
对照输出错误和替代模型的控制显示显著较低的 IIA，支持所识别对齐的保真性。
在多项分析中，搜索在新括号设置下的任务性能接近百分之百，并与主要结果高度相关。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。