QUICK REVIEW

[论文解读] Transformers as Algorithms: Generalization and Stability in In-context Learning

Yingcong Li, M. Emrullah Ildiz|arXiv (Cornell University)|Jan 17, 2023

Machine Learning and Algorithms被引用 11

一句话总结

本论文将 in-context learning (ICL) 形式化为一个算法学习问题，在推断期间，transformer 隐式构建一个假设函数，并在 i.i.d. 与动态系统提示下为多任务与迁移学习提供泛化/稳定性界限。

ABSTRACT

In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. In this work, we formalize in-context learning as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system. The crux of our analysis is relating the excess risk to the stability of the algorithm implemented by the transformer. We characterize when transformer/attention architecture provably obeys the stability condition and also provide empirical verification. For generalization on unseen tasks, we identify an inductive bias phenomenon in which the transfer learning risk is governed by the task complexity and the number of MTL tasks in a highly predictable manner. Finally, we provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) provide insights on stability, and (3) verify our theoretical predictions.

研究动机与目标

动机化并将 in-context learning (ICL) 正式化为一个算法学习问题，其中 transformer 在推断时构建一个假设函数。
在 i.i.d. 与动态提示设置下，为 ICL 在多任务学习 (MTL) 中推导泛化界限。
刻画支撑这些泛化保证的 transformer 架构的稳定性特性。
研究迁移学习（看不见的任务）以及支配跨任务泛化的归纳偏置。
给出数值评估，验证近似最优算法实现与稳定性洞见。

提出的方法

将 ICL 模型化为对 in-context 序列的隐式优化，得到预测函数 f^{Alg}_{S^{(m)}}。
通过算法稳定性证明泛化界，给出 i.i.d. 及动态数据下的 MTL 速率为 1/sqrt(nT)。
建立 transformer 自注意力的稳定性条件，并通过 Lipschitz 与扰动分析将稳定性与过剩风险相关联。
将框架扩展到具备指数型 (C_rho, rho)-稳定性的动态系统提示，并据此调整基于稳定性的论证。
使用覆盖数和 Dudley/经验过程思想将稳定性转化为有限样本的过剩风险界。
提供实证验证，表明 ICL 能在经典回归任务上实现近似最优算法，并验证迁移/归纳偏置洞见。

Figure 1: Examples of ICL. We focus on the lower two settings where a transformer admits a supervised dataset or a dynamical system trajectory as a prompt. Then, it auto-regressively predicts the output following an input example $\bm{x}_{i}$ based on the prompt $(\bm{x}_{1},\dots,\bm{x}_{i})$ .

实验结果

研究问题

RQ1在多任务学习设定下，何种条件下 in-context learning 能在任务间泛化？
RQ2transformer 稳定性如何影响带有 i.i.d. 提示与动态系统提示的 ICL 泛化边界？
RQ3未见任务上 ICL 的迁移学习行为是怎样的，任务复杂性和源任务数量如何支配它？
RQ4ICL 是否可解读为在回归问题上实现近似最优算法（如岭回归），提示长度如何影响稳定性？
RQ5源任务结构与目标任务距离之间的对齐如何影响线性与动态 setting 下的迁移风险？

主要发现

在多任务设定下，ICL 泛化在 i.i.d. 与动态提示下均达到 1/sqrt(nT) 速率。
自注意力稳定性可以有界；在某些范数约束下，基于 transformer 的 ICL 遵循产生泛化保证的稳定性条件。
从实证来看，随着提示变长，ICL 预测变得更稳定，使用带噪数据的训练也促进稳定性。
迁移学习存在归纳偏置：迁移风险由任务复杂性和 MT 任务数量支配，与模型大小几乎无关。
对于线性回归类任务，迁移风险和 MTL 风险曲线对齐，实验中迁移风险大约随 d^2/T 增长。
在动态系统中，若内存充足且动态稳定，ICL 可以超越自回归 LS 估计量。

Figure 3: The benefit of learning across the full task sequence: Right side: Standard ERM where each task trains with all $n=40$ prompts. Left side: ERM focuses on different parts of the trajectory by fitting $n/4=10$ prompts per task over $i\in[1,10]$ to $[31,40]$ (highlighted as the orange ranges)

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。