[論文レビュー] Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
本論文は、インコンテキスト学習(ICL)を、Transformerの注意機構と勾配降下の双対形式を明らかにすることによる暗黙的微調整として説明し、モーメントベースの注意がICLと言語モデルの性能を改善することを示している。
Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.
研究の動機と目的
- Motivation: 大規模なGPTがパラメータ更新なしでインコンテキスト学習をどのように行うか理解する。
- Propose a theoretical view that Transformer attention implements a dual form of gradient descent.
- Empirically compare ICL and explicit finetuning on real NLP tasks to validate the implicit fine-tuning view.
- Introduce a momentum-based attention mechanism inspired by gradient descent with momentum to improve performance.
提案手法
- Derive a dual form between Transformer attention and gradient descent showing attention can act like a gradient-based update.
- Frame ICL as a meta-optimization where a pretrained GPT serves as a meta-optimizer that generates meta-gradients from demonstrations and applies them via attention.
- Compare ICL with finetuning on six classification tasks to show similarities in predictions, attention outputs, and token-focused attention.
- Design and evaluate momentum-based attention (MoAttn) by applying an EMA to attention Values to simulate gradient-momentum updates.
- Conduct experiments on language modeling to test whether momentum-based attention reduces perplexity and improves downstream ICL tasks.

実験結果
リサーチクエスチョン
- RQ1Can Transformer attention be interpreted as performing a gradient-descent-like update (a dual form) that underpins ICL?
- RQ2Is ICL behavior empirically similar to explicit finetuning across predictions and internal representations?
- RQ3Does incorporating momentum into attention further improve ICL and language modeling, supporting the meta-optimization view?
主な発見
- ICL and explicit finetuning share a dual view of gradient descent, with ICL relying on meta-gradients produced by forward computation.
- Empirical evidence from six classification tasks shows ICL behavior is similar to finetuning in predictions and attention dynamics.
- ICL tends to produce attention updates and attention weights that resemble those produced by finetuning, indicating similar representational changes.
- A momentum-based attention (MoAttn) consistently improves language modeling perplexity and ICL accuracy compared with vanilla attention.
- Momentum-based attention demonstrates the practical utility of the meta-optimization view for future model design.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。