QUICK REVIEW

[论文解读] Pretrained Transformers as Universal Computation Engines

Kevin Lü, Aditya Grover|arXiv (Cornell University)|Mar 9, 2021

Ferroelectric and Negative Capacitance Devices参考文献 60被引用 99

一句话总结

一个GPT-2风格的语言变换器可以被冻结（自注意力和前馈层）并微调只有输入/输出层加层归一化，以在跨模态（数值、视觉、蛋白质）实现具有竞争力的准确性和更快的收敛，暗示语言预训练赋予了普遍计算能力。

ABSTRACT

We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.

研究动机与目标

调查在自然语言上预先训练的变换器是否能在最小微调下对其他模态实现泛化。
评估预训练模态对跨域迁移相对于体系结构的作用。
评估冻结自注意力和前馈层并仅微调外部组件的重要性。
在跨模态任务中比较 Transformer 与 LSTM 基线。
分析语言预训练对下游任务的计算效率提升。

提出的方法

通过冻结自注意力和前馈层，将冻结的预训练GPT-2变换器用作通用计算引擎（FPT）。
仅微调输入嵌入层、输出层和层归一化（以及可选的位置嵌入）以适应多样的下游任务。
在七个分类任务上评估，涵盖数值计算、图像分类和蛋白质折叠预测。
与全训练的变换器和LSTM以及其他预训练模态（Bit Memory、ViT）进行比较。
分析注意力模式、收敛速度及消融实验，以确定驱动迁移的因素。

实验结果

研究问题

RQ1语言预训练的变换器是否能够在不更新核心注意力/FFN参数的情况下转移到不同模态？
RQ2在跨模态迁移中，预训练模态（语言 vs 随机 vs 图像）的重要性有多大？
RQ3相较于 LSTM 基线，变换器架构对迁移性能是否关键？
RQ4在迁移到其他模态时，语言预训练是否提高了相对于随机初始化的计算效率？
RQ5哪些组件（输入层、输出层、层归一化、位置嵌入）对微调最为关键？

主要发现

模型	Bit Memory	XOR	ListOps	MNIST	CIFAR-10	CIFAR-10 LRA	Homology
FPT	100%	100%	38.4%	98.0%	72.1%	38.6%	12.7%
Full	100%	100%	38%	99.1%	70.3%	42%	9%
LSTM	60.9%	50.1%	17.1%	99.5%	73.6%	11.7%	12%

冻结的预训练变换器在七个下游任务上实现了与全训练的变换器与 LSTM 相竞争的准确性。
Bit Memory 与 XOR 任务在 FPT 下达到 100%，而 ListOps、MNIST、CIFAR-10、CIFAR-10 LRA 与 Homology 相对于 LSTMs 显著提升并接近全变换器基线。
语言预训练在所有任务中提供了比随机初始化更快的收敛速度。
模型性能随规模增长而提升；在 CIFAR-10 上，基线 68.2% 的准确率提升到 68.2% → 72.1%（在基线设置下，较大的变体获得更高）。
冻结注意力层在某些位任务上呈现可解释、与下游需求对齐的注意力模式，表明与下游需求在语义上的一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。