QUICK REVIEW

[论文解读] What learning algorithm is in-context learning? Investigations with linear models

Ekin Akyürek, Dale Schuurmans|arXiv (Cornell University)|Nov 28, 2022

Neural Networks and Applications被引用 85

一句话总结

论文证明变压器可以在上下文中实现标准线性学习算法（梯度下降和闭式岭回归），并显示在上下文学习者在各种条件下的行为类似于这些算法，将ICL与贝叶斯和最小范数预测联系起来。

ABSTRACT

Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms. Code and reference implementations are released at https://github.com/ekinakyurek/google-research/blob/master/incontext.

研究动机与目标

理解变换器中的上下文学习（ICL）是否对应隐式学习算法。
确定哪些标准线性算法可以被变换器在上下文中实现。
评估深度、宽度和训练数据噪声如何影响ICL行为及其与经典预测的对齐程度。
探索中间量（如权重向量和矩矩阵等）是否被编码在上下文表征中。

提出的方法

通过构造性证明，变换器可以实现线性模型的梯度下降单步，具有O(d)个隐藏单元和恒定深度。
通过构造性证明，变换器可以实现对应岭回归的一个Sherman–Morrison更新，具有O(d^2)个隐藏单元和恒定深度。
在不同深度、隐藏规模和噪声条件下，经验性比较ICL预测、梯度下降、岭回归和普通最小二乘的一致性。
使用行为度量：平方预测差（SPD）和隐式权重差（ILWD）来量化ICL与标准预测之间的一致性。
探测中间表示以确定X^T Y和w_OLS等量是否被编码在隐藏状态中。

实验结果

研究问题

RQ1变换器在上下文学习设置中能否实现标准线性学习算法（如梯度下降、岭回归）？
RQ2在不同深度、宽度和数据噪声下，经过训练的上下文学习者的预测是否与经典预测（OLS、岭回归、GD）和贝叶斯估计量一致？
RQ3ICL中编码了哪些中间量，它们在网络的哪些位置出现？
RQ4模型容量（深度/隐藏规模）如何影响上下文学习者的算法行为（在GD、岭回归和OLS之间的相变）？

主要发现

变压器可以为线性回归计算梯度下降的一步，具备O(d)隐藏单元和恒定深度。
变压器可以执行一次Sherman–Morrison更新以实现岭回归，具备O(d^2)隐藏单元和恒定深度。
ICL预测在很大程度上与梯度下降、岭回归和精确最小二乘预测一致，并且在深度和噪声变化时在这些之间转换。
随着宽度和深度增大，ICL趋向线性模型的贝叶斯估计量。
如X^T Y和w_OLS等中间量可以从隐藏表征中解码，表明网络计算出有意义的算法量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。