QUICK REVIEW

[论文解读] Vision Transformer for Small-Size Datasets

Seung Hoon Lee, Seunghyun Lee|arXiv (Cornell University)|Dec 27, 2021

CCD and CMOS Imaging Sensors被引用 122

一句话总结

本文提出 Shifted Patch Tokenization (SPT) 与 Locality Self-Attention (LSA)，为 Vision Transformers 提供更强的局部性归纳偏置，使其在小型数据集上能够从头学习，并在 Tiny-ImageNet 及其他小型基准上提升性能。

ABSTRACT

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08% thanks to the proposed SPT and LSA.

研究动机与目标

解决在小型数据集从头训练时 Vision Transformers 缺乏局部性归纳偏置的问题。
提出通用的附加模块（SPT 和 LSA）以改进 tokenization 与局部关注。
在 Tiny-ImageNet 与 CIFAR/CIFAR-100 上展示性能提升，并评估对中等规模数据集如 ImageNet 的影响。

提出的方法

引入 Shifted Patch Tokenization (SPT)，通过在 tokenization 之前对 patch 进行空间移动并拼接来扩展视觉 token 的感受野。
提出带对角屏蔽的 Locality Self-Attention (LSA)，以去除自 token 注意并应用可学习的 softmax 温度，从而强化局部注意。
解释如何将 SPT 应用于 patch embedding 和 pooling 层，作为对 ViTs 的简单附加。
提供定量与定性分析，展示在使用 SPT 和 LSA 时局部性改善与对物体形状捕捉的提升。
在小型数据集和 ImageNet 上，对比有无 SPT/LSA 的多种 ViT 变体（ViT, PiT, Swin, CaiT）的性能。

实验结果

研究问题

RQ1ViTs 能否在没有大规模预训练的情况下，在小型数据集上从头学习？
RQ2SPT 和 LSA 是否提升局部性归纳偏置并在 ViT 变体上提高性能？
RQ3这些方法在 Tiny-ImageNet 与 CIFAR 类数据集上的精度提升幅度，以及对 ImageNet 等中等规模数据集的影响？

主要发现

在 Tiny-ImageNet 上应用 SPT 和 LSA，针对测试的 ViT 变体平均提升精度为 2.96%。
观察到的最大提升出现在 Tiny-ImageNet 上（例如 Swin Transformer），达到 4.08%。
在 CIFAR-100 上，使用所提出方法后 CaiT 与 PiT 的提升分别为 3.43% 与 4.01%。
在 Tiny-ImageNet 上，ViT 与 Swin 的提升分别达到最多 4.00% 与 4.08%。
在从头训练的 ImageNet 上，ViT 的增益最高为 1.60%（SL-ViT）与 1.44%（SL-PiT）；Swin 的增益最高为 1.06%（SL-Swin）。
消融实验表明，可学习温度缩放和对角屏蔽都对性能提升有贡献，组合使用可实现协同增益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。