QUICK REVIEW

[論文レビュー] Learning Transformer Programs

Dan Friedman, Alexander Wettig|arXiv (Cornell University)|Jun 1, 2023

Explainable Artificial Intelligence (XAI)被引用数 13

ひとこと要約

この論文は、機械的に解釈可能になるよう制約を受けた高性能Transformerを訓練し、次にそれらを人間が読めるプログラム（Python/RASP風）へ変換する。性能をほとんど犠牲にせず、イン-context学習、アルゴリズムタスク、NLPでの結果を示し、コードレベルのデバッグによる解釈可能な回路を提供する。

ABSTRACT

Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short of providing complete, faithful descriptions of the underlying algorithms. In this work, we introduce a procedure for training Transformers that are mechanistically interpretable by design. We build on RASP [Weiss et al., 2021], a programming language that can be compiled into Transformer weights. Instead of compiling human-written programs into Transformers, we design a modified Transformer that can be trained using gradient-based optimization and then automatically converted into a discrete, human-readable program. We refer to these models as Transformer Programs. To validate our approach, we learn Transformer Programs for a variety of problems, including an in-context learning task, a suite of algorithmic problems (e.g. sorting, recognizing Dyck languages), and NLP tasks including named entity recognition and text classification. The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size; and, more importantly, they are easy to interpret. To demonstrate these advantages, we convert Transformers into Python programs and use off-the-shelf code analysis tools to debug model errors and identify the "circuits" used to solve different sub-problems. We hope that Transformer Programs open a new path toward the goal of intrinsically interpretable machine learning.

研究の動機と目的

高 stakes tasksで監査・デバッグのため intrinsically interpretable Transformer modelsの必要性を動機づける。
人間が読めるプログラムへ決定論的写像を保証する制約下でTransformerを訓練するフレームワークを提案。
Transformer Programs がさまざまなアルゴリズム・NLPタスクを競争力のある性能で解けることを示す。
訓練済みモデルから実行可能な Python/RASP風プログラムを自動抽出して回路レベルのデバッグを可能にする。

提案手法

各モジュールが固定の変数セットを読み取り、専用の直交部分空間に書き込むような解離した残差ストリーム制約を導入。
ハードアテンションで離散的、解釈可能なモジュール（カテゴリカルアテンションヘッド）を定義・訓練し、最適化中はGumbel-Softmaxで緩和。
各アテンションヘッドをRASP風の述語-集合プリミティブにマッピング；離散重み（πK, πQ, πV, Wpredicate）上の分布を学習し、Gumbel再パラメータ化でサンプル。
訓練後、離散重みを最大化して決定的に抽出したPythonプログラムを、select_closestプリミティブを用いた述語関数へ変換して取得。
単語埋め込み、数値アテンション、前方伝播/探索様レイヤーを含むよう Framework を拡張し、プログラムのレパートリを広げる。
interpretable programs への訓練と写像の拡張、例コードとデバッグワークフローを含む。

実験結果

リサーチクエスチョン

RQ1Transformer モデルは、解釈可能なプログラムへの決定論的写像を保証する制約下で訓練できるか？
RQ2そのような Transformer Programs は、解釈可能性を保ちつつ、イン-context学習、RASP風アルゴリズムタスク、NLP ベンチマークをどの程度解けるか？
RQ3読みやすい Python/RASP風コードへ変換した場合、学習されたプログラムと回路の質的構造はどうなるか？
RQ4Transformer Programs は標準の Transformer と比較して、難易度の異なるタスクで精度と解釈性の両方をどう達成・トレードオフするか？

主な発見

データセット	説明	例	k	L	H	M	精度
Reverse	Reverse a string.	reverse("abbc") = "cbba"	8	3	8	2	99.79
Histogram	For each token, the number of occurrences of that letter in the sequence.	hist("abbc") = "1221"	8	1	4	2	100.0
Double hist.	For each token, the number of unique tokens with the same histogram value.	hist2("abbc") = "2112"	8	3	4	2	98.40
Sort	Sort the input in lexicographical order.	sort("cbab") = "abbc"	8	3	8	4	99.83
Most-Freq	The unique input tokens in order of frequency, using position to break ties.	most_freq("abbc") = "bac"	8	3	8	4	75.69
Dyck-1	For each position i, is the input up until i a valid string in Dyck-1 (T); a valid prefix (P); or invalid (F).	dyck1("()())") = "PTPTF"	16	3	8	2	99.30
Dyck-2	The same as above, but in Dyck-2.	dyck2("({})(}") = "PPPTPF"	16	3	4	4	99.09
(Table continues as described in text)

Transformer Programs は、同程度のサイズの標準 Transformer と比較して複数タスクで合理的な性能を達成。
RASP風タスクでは、長い入力での例外を除き、いくつかのタスクで99%以上の精度を達成。
インコンテキスト学習の toy タスクでは、モデルはヘッドを組み合わせて induction-head の挙動を再現し、テスト精度は完全に到達。
CoNLL-2003 NER では、Transformer Programs は標準の Transformers に近い F1 を達成し、 unigram ベースラインを上回る。
抽出された Python/RASP風プログラムは解釈可能な回路と特徴重みを露呈し、デバッグと回路解析を支援。
トレードオフとして、長いシーケンスや大きな語彙では標準Transformerが Transformer Programs を上回る傾向があり、スケーリングの課題が示唤される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。