QUICK REVIEW

[論文レビュー] Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Zhengxuan Wu, Atticus Geiger|arXiv (Cornell University)|May 15, 2023

Explainable Artificial Intelligence (XAI)被引用数 8

ひとこと要約

要約: この論文は Boundless Distributed Alignment Search（Boundless DAS）を大規模言語モデルにスケールさせ、Alpaca（7B）が数値推論タスクに対して単純な2つのブール変数因果モデルを実装していることを示し、入力と指示に頑健なアラインメントを示す。

ABSTRACT

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that has uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call Boundless DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models. Our tool is extensible to larger LLMs and is released publicly at `https://github.com/stanfordnlp/pyvene`.

研究の動機と目的

安全性と信頼性のために大規模言語モデルの解釈可能で因果的に忠実な説明の必要性を動機づける。
Boundless DAS を提案し、因果抽象化ベースの解釈可能性をLLMへスケールさせる。
Alpaca が単純で解釈可能な因果モデルを通じて数値推論タスクを解くことを実証。
指示・入力・文脈・出力形式を横断して発見されたアラインメントの頑健性を評価。

提案手法

Distributed Alignment Search（DAS）を拡張し、境界マスクと回転を学習させてスケーラブルなアラインメントを実現（Boundless DAS）。
各高レベル変数 Z_j に対して次元部分空間 Y_j を決定する学習可能な境界インデックス b_j を導入し、表現の次元性の自動決定を可能にする。
SoftDIIと加重介入を用いて、真の因果モデルに対する交換介入精度（IIA）を近似・最適化する。
境界マスク β を徐々にアニーリングさせつつ、神経表現を因果変数に整合させるクロスエントロピーボ objective（式(5)）を最適化する。
IIAでアラインメント品質を評価し、ランダムベースラインやタスク性能と比較する。
Boundless DAS を Alpaca-7B の価格タグ付け/ NLP ベースの数値推論タスクへ適用し、指示・文脈・出力形式を横断した一般化を検証する。

実験結果

リサーチクエスチョン

RQ1Boundless DAS は Alpaca（7B）のような大規模LLMへ因果的解釈性をスケールできるか？
RQ2内部表現が解釈可能な因果変数と頑健かつ一般化可能な方法で文脈・指示を跨いで整合するか？
RQ3Alpaca が数値推論タスクを解く際の因果メカニズムの性質は何か？
RQ4学習されたアラインメントは新しい括弧、挿入文脈、変更出力へ転移するか？

主な発見

Experiment	Task Acc.	IIA max	Correlation
Left Boundary (♣)	0.85	0.90	1.00
Left and Right Boundary (♥)	0.85	0.86	1.00
Mid-point Distance	0.85	0.70	1.00
Bracket Identity	0.85	0.72	1.00
Correct Only	1.00 †	0.88	0.99 ( ♥ )
Incorrect Only	0.00 †	0.71	0.84 ( ♥ )
New Bracket (Seen)	0.94	0.94	0.97 ( ♣ )
New Bracket (Unseen)	0.95	0.95	0.94 ( ♣ )
Irrelevant Contexts	0.84	0.83	0.99 ( ♥ )
Sibling Instructions	0.84	0.83	0.87 ( ♥ )
+ exclude top right	0.84	0.83	0.92 ( ♥ )

Boundless DAS は Alpaca がタスクを解くのに単純な2ブール変数因果モデル（左境界と右境界のチェック）を使用していることを明らかにする。
上位2つの仮説モデルに対するIIAはタスク性能（0.85以上）と同等、またはそれを上回り、内部の因果構造の忠実性を示す。
アラインメントは見ていない括弧・挿入プレフィックス・異なる出力形式に対しても大きなIIAの低下なしに一般化し、頑健性を示す。
境界学習により、成功した実行では各アラインされた変数が表現空間の約5–10%程度しか使用していないことが分かる。
出力が誤っているコントロールや代替モデルではIIA が格段に低くなり、同定されたアラインメントの忠実性を支持する。
複数の分析を通じて、探索は新しい括弧設定でほぼ100%のタスク性能を達成し、主結果と高い相関を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。