QUICK REVIEW

[论文解读] Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Damai Dai, Yutao Sun|arXiv (Cornell University)|Dec 20, 2022

Topic Modeling被引用 25

一句话总结

本文将上下文学习（ICL）解释为隐式微调，揭示Transformer注意力与梯度下降之间的对偶形式，并展示基于动量的注意力能提升ICL和语言建模。

ABSTRACT

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.

研究动机与目标

动机：理解大型GPT在不更新参数的情况下如何进行上下文学习。
提出一种理论观点：Transformer注意力实现了梯度下降的对偶形式。
在真实NLP任务中，实证比较ICL与显式微调，以验证隐式微调观点。
引入受带动量的梯度下降启发的动量型注意力机制，以提升性能。

提出的方法

推导Transformer注意力与梯度下降之间的对偶形式，表明注意力可如同基于梯度的更新。
将ICL框定为元优化，其中一个预训练GPT充当元优化器，从演示中生成元梯度并通过注意力应用它们。
在六个分类任务上将ICL与微调进行比较，以展示预测、注意力输出和面向标记的注意力之间的相似性。
通过对注意力数值应用指数移动平均（EMA）来设计和评估动量型注意力（MoAttn），以模拟梯度动量更新。
在语言建模上进行实验，测试动量型注意力是否降低困惑度并提升下游的ICL任务。

实验结果

研究问题

RQ1Transformer注意力是否可被解释为执行一种梯度下降样的更新（对偶形式），从而支撑ICL？
RQ2ICL行为在预测和内部表示上是否在实证上类似于显式微调？
RQ3将动量引入注意力是否进一步提升ICL和语言建模，以支持元优化观点？

主要发现

ICL和显式微调共享对梯度下降的对偶观点，ICL依赖正向计算产生的元梯度。
来自六个分类任务的实证证据显示，ICL在预测和注意力动态方面与微调相似。
ICL倾向于产生与微调产生的注意力更新和注意力权重相似的模式，表明表征变化相近。
基于动量的注意力（MoAttn）相比普通注意力在语言建模困惑度和ICL准确度方面具有持续的改进。
动量型注意力展示了元优化观点在未来模型设计中的实际效用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。