QUICK REVIEW

[论文解读] TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Kan Wu, Jinnian Zhang|arXiv (Cornell University)|Jul 21, 2022

Advanced Neural Network Applications被引用 22

一句话总结

TinyViT 引入了一种快速的预训练蒸馏框架，通过从大型预训练教师那里转移知识来训练微小的视觉变换器，在参数显著更少的情况下实现强的 ImageNet 及下游任务性能。

ABSTRACT

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.

研究动机与目标

促进面向资源受限设备的高效微型视觉变换器的发展。
通过蒸馏使小型 ViT 能从大规模预训练数据中受益。
降低预训练蒸馏的训练内存和计算成本。
提出一个可扩展的框架，在保持迁移性能的同时对微小 ViTs 进行预训练和压缩。

提出的方法

将稀疏的教师 logits 与数据增强元数据存储在磁盘上，以在不重复进行教师前向传播的情况下实现快速预训练蒸馏。
使用存储的教师输出中恢复的稀疏软标签，结合蒸馏损失来训练小型学生 ViT。
使用无标签蒸馏设置，利用教师的软预测而非真实标签。
渐进收缩一个大型种子 ViT，以在参数和吞吐限制下生成一系列 TinyViT 模型。
采用带窗口注意力和 MBConv 块的层次化 Swin 风格架构，以在准确性和效率之间取得平衡。
在 ImageNet-21k 上进行预训练，在 ImageNet-1k 上微调，并可选进行更高分辨率的微调以提升准确性。

实验结果

研究问题

RQ1在预训练阶段通过从大型预训练模型蒸馏知识，小型视觉变换器能否实现具有竞争力的性能？
RQ2如何使蒸馏过程快速且可扩展，以避免大型教师带来的内存和时间负担？
RQ3预训练蒸馏对微型 ViT 向下游任务迁移性的影响是什么？
RQ4渐进式模型收缩对 TinyViT 的精度/效率权衡有何影响？

主要发现

TinyViT-21M 在 IN-21k 预训练后并在 IN-1k 上进行 30 轮微调，参数为 21M，达到 ImageNet-1k 的 84.8% top-1。
在更高输入分辨率下，TinyViT 达到 86.5% top-1，略高于 Swin-L，同时仅使用其参数的大约 11%。
在 IN-21k 上进行蒸馏预训练的 TinyViT-21M 能很好地迁移到下游任务，例如对象检测的 COCO AP 为 50.2（比 28M 参数的 Swin-T 高出 2.1 点）。
快速预训练蒸馏框架通过存储稀疏教师 logits 和对数据增强进行编码来降低内存和计算成本，使得在训练期间不加载教师即可进行大批量蒸馏。
使用更高质量的教师模型（如 Florence、CLIP-ViT-L/14 等）进一步提升 TinyViT 的性能，同时由于在磁盘上的 logits，仍能维持实际的训练成本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。