QUICK REVIEW

[論文レビュー] Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan|arXiv (Cornell University)|Jan 12, 2023

Neural Networks and Applications被引用数 54

ひとこと要約

この論文は小さなトランスフォーマを逆算してモジュラー加算を解くことでフーリエベースのアルゴリズムを明らかにし、進捗測度（restricted and excluded loss）を定義して grokking が徐々に現れる回路形成の後、重み減衰が memorization を除去することで生じることを示す。

ABSTRACT

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous extit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

研究の動機と目的

ニューラルネットワークにおける出現的挙動の根底にある滑らかで因果的に結びついた進捗測度の探求を動機づける。
学習済み回路の機械的リバースエンジニアリングによって grokking が説明可能であることを実証する。
モジュラー加算トランスフォーマをリバースエンジニアリングしてフーリエ乗算アルゴリズムを解明する。
grokkking 相転移を前兆する連続的な進捗測度を定義し検証する。
memorization から generalization への旅を特徴づける訓練ダイナミクスを三段階（ memorization、circuit formation、cleanup ）として特徴づける。

提案手法

P=113 の mod P に関する加算で小さなトランスフォーマを訓練する。
ウェイトと活性化をリバースエンジニアリングしてフーリエベースの加算アルゴリズムを同定する。
埋め込み、アテンション、および MLP の活性化が重要周波数集合において周期的構造を示す。
unembedding および MLP 層が三角恒等式を実現して a+b mod P を計算することを示す。
restricted loss および excluded loss を進捗測度として定義し、実証的に検証する。

実験結果

リサーチクエスチョン

RQ1モジュラー加算で訓練した小さなトランスフォーマにおける grokking の機械的構造は何か。
RQ2grokkking 転換に先行する連続的で解釈可能な進捗測度を同定できるか。
RQ3学習されたアルゴリズムは周波数成分と trig 恒等式の観点でどのように機能するか。
RQ4memorization から generalization への旅を特徴づける訓練ダイナミクス（フェーズ）は何か。
RQ5重み減衰は grokking および相転移を駆動する上でどのような役割を果たすか。

主な発見

モデルは入力を円へ埋め込み、フーリエ成分を用いて trig 恒等式により加算を実行する。
logit マッピング W_L は五つの主要周波数の和で良く近似され、フーリエ乗算読み出しを可能にする。
ほとんどのニューロンは単一周波数の次数-2多項式で良く近似され、周波数局在化された読み出しを logits に持つ。
アブレーションにより主要周波数が必要であることが示され、非主要周波数を除去すると性能が改善する場合がある。
Grokking は memorize、circuit formation、cleanup の三つのフェーズから成り、重み減衰が一般化回路への切り替えを導く。
提案された進捗測度（restricted loss および excluded loss）は grokking の前に連続的に増加し、訓練ダイナミクスへの視点を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。