QUICK REVIEW

[论文解读] Escaping the Big Data Paradigm with Compact Transformers

Ali Hassani, Steven Walton|arXiv (Cornell University)|Apr 12, 2021

Advanced Neural Network Applications参考文献 48被引用 295

一句话总结

本文介绍紧凑视觉变换器（ViT-Lite、CVT 和 CCT），可在小数据集上从零开始训练，在参数和计算量大幅降低的情况下实现具有竞争力或最先进的准确度。它演示了数据高效的变换器模型在 CIFAR-10/100、Flowers-102 和 ImageNet 上，无需大规模预训练。

ABSTRACT

With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers. We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets. Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results. Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance. CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches. Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks. Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.

研究动机与目标

激发并使变换器模型能够在数据稀缺的小型数据集上从零开始训练。
开发将卷积标记化与注意力结合以实现数据效率和局部性的紧凑型 Transformer 变体。
提出 SeqPool 以替代类别标记并改进输出标记序列的池化。
显示带有卷积标记器的 CCT 在保持较低参数量和计算量的同时提供强大的准确性。
相对于模型大小和数据规模，在 CIFAR-10/100、Flowers-102 和 ImageNet 上展示出最先进或具有竞争力的结果。

提出的方法

提出 ViT-Lite、CVT 和 CCT 作为适合小数据情境的紧凑视觉变换器变体。
在 CCT 中用卷积标记器替代标准的基于补丁的标记化，以嵌入局部结构。
引入 SeqPool，一种基于注意力的序列池化机制，将变换器输出映射到一个单一类别表示。
在 CIFAR-10/100、CIFAR、MNIST、Fashion-MNIST、Flowers-102 和 ImageNet-1k 上从零开始训练的模型，使用带余弦退火的 AdamW 进行评估。
与 CNN 与 ViT/DeiT 基线进行比较（包括蒸馏情景），并报告参数数量和 MACs。

实验结果

研究问题

RQ1视觉变换器是否可以在没有大规模预训练的情况下，针对小型数据集从零开始有效训练？
RQ2带有卷积标记化与序列池化的紧凑型 Transformer 架构在小数据集上是否相对于 ViT 和 CNN 提供数据高效的改进？
RQ3在各种图像数据集上，使用卷积标记化与 SeqPool 对准确性与效率有何影响？
RQ4在像 ImageNet 这样的中等规模数据集上，CCT 相较于传统 CNN 和 ViT 变体的表现如何？
RQ5在受限的计算资源下部署变换器模型并保持具有竞争力的性能是否可行？

主要发现

CCT 在 CIFAR-10 上从零开始训练、模型大约 3.7M 参数时达到 98% 的最高结果（5000 轮在 Table 2 中在 CIFAR-10 获得 98.00%）。
CCT 在 CIFAR-10/100 和 Flowers-102 上优于 ViT 及许多基于 CNN 的方法，同时使用更少的参数和 MACs（例如 CVT 和 CCT 变体在 0.28–3.85M 参数范围内表现强劲）。
在 ImageNet-1k 上，CCT-14/7×2 达到 80.67% Top-1（无蒸馏，参数 22.36M），蒸馏的 CCT 变体达到 81.34% Top-1。
Flowers-102 的结果显示 CCT-14/7×2 在 ImageNet 规模预训练下达到 99.76% Top-1，参数显著更少（约 22.17M）和 MACs（18.63G）。
CCT 通过将模型尺寸降低到大约 15% 的 ResNet50，同样或更好地在 CIFAR-10/100 上实现数据效率优越。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。