QUICK REVIEW

[论文解读] Training data-efficient image transformers & distillation through attention

Hugo Touvron, Cord, Matthieu|arXiv (Cornell University)|Dec 23, 2020

Currency Recognition and Detection参考文献 61被引用 1,049

一句话总结

该论文提出了DeiT（数据高效图像变换器），一种仅在单个8-GPU节点上训练ImageNet-1k数据集不到3天的视觉变换器架构，实现了83.1%的top-1准确率。该方法提出了一种新颖的蒸馏方法，使用专用的蒸馏标记，通过注意力机制实现从教师模型到学生变换器的知识迁移，显著优于标准蒸馏方法，尤其当教师模型为卷积网络时表现更优。

ABSTRACT

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

研究动机与目标

仅使用ImageNet-1k数据，无需外部数据或大规模基础设施，训练出具有竞争力准确率的视觉变换器。
开发一种数据高效的视觉变换器训练协议，使其在标准硬件上实现快速收敛。
提出一种专为变换器设计的新颖蒸馏策略，提升性能，优于标准知识蒸馏。
证明当高效训练时，视觉变换器在准确率和迁移性能方面可与卷积神经网络比肩或超越。

提出的方法

引入一种新型蒸馏标记，其作用类似于分类标记，但通过自注意力机制训练以预测教师模型的软标签。
采用学生-教师框架，学生变换器通过蒸馏标记从教师模型的基于注意力的输出分布中学习。
使用重复数据增强、Mixup、CutMix和RandAugment以提升泛化能力和鲁棒性。
应用权重衰减、标签平滑、随机深度和学习率缩放以稳定训练过程。
在分辨率微调过程中使用双三次插值对位置嵌入进行自适应，以保持范数和性能。
在单个8-GPU节点上训练模型，共300个周期，DeiT-B模型在约53小时内实现收敛。

实验结果

研究问题

RQ1视觉变换器是否仅使用ImageNet-1k且无外部数据即可在ImageNet上实现最先进性能？
RQ2在有限数据下，视觉变换器数据高效训练的关键技术有哪些？
RQ3一种专为变换器设计的蒸馏方法是否优于标准知识蒸馏？
RQ4从卷积神经网络蒸馏是否比从另一视觉变换器蒸馏带来更好的性能？
RQ5与标准蒸馏相比，所提出的蒸馏标记在准确率和泛化能力方面表现如何？

主要发现

DeiT-B仅使用ImageNet-1k训练数据，在单个8-GPU节点上不到3天内即达到83.1%的top-1准确率。
采用所提出的蒸馏标记（DeiT⚗）后，模型在ImageNet-1k上的top-1准确率达到85.2%，优于标准蒸馏方法。
从ResNet-50教师模型蒸馏得到的性能优于从同等规模ViT教师模型蒸馏，证明归纳偏置迁移的有效性。
蒸馏标记策略显著提升性能，尤其在低数据场景下，且优于标准蒸馏方法。
DeiT模型在下游任务（如CIFAR-10、CIFAR-100、Oxford-102 Flowers、Stanford Cars以及iNaturalist-18/19）上也取得了具有竞争力的结果。
在更高分辨率（384×384）下进行微调，使ImageNet-v2上的准确率达到87.7%，证明了模型的可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。